Batch text file conversion with Python

Attending conferences and presenting research is a frequent event in the life of an academic. Conference organizing committees that plan these events have a lot on their plate. Even small conferences, such as the one I organized in 2015 (iSLC 2015), can be very demanding.

One thing that conference organizers have to consider is how to implement an article and/or abstract submission process that allows attendees, reviewers, and organizers to fluidly access and distribute these documents. Some of these services are free while others are a paid service. Some services provide better and a more adaptive pipeline for this process.

An important feature of these abstract submission sites is allowing the tagging of abstracts so that organizers can appropriately distribute the content to the best reviewers.

I was approached awhile back because some conference organizers were using an abstract submission site that did not have this feature, so how to distribute the submissions was an open question. Furthermore, the submission website had a suboptimal interface that forced the organizers to download each individual abstract separately rather than as a whole. As a final annoyance to the organizers, the submission process was one such that researchers could submit their work in a variety of formats, including .doc, .docx, .odt, and .pdf files and the conference organizers did not have a way to automatically analyze each of these file types.

To tackle these issues, we settled on the following procedure:

  1. Automatically authenticate with website
  2. Scrape the conference abstracts from the conference website, downloading all attachments
  3. Convert each document into plain text using a variety of packages and combine all text documents into a large listing of all abstracts
  4. Use the combined abstract file to extract individual submissions and use text mining packages to categorize the abstracts
  5. Distribute the abstracts to the appropriate reviewers

Here is the code I wrote to perform steps 1–3. The conference organizers did the text mining and distribution. After the automatic extraction, there were still a tiny number of abstracts that were not converted properly, so this was done by hand. In the end, this saved a massive amount of time and allowed the conference preparation to proceed in an efficient manner.

Improvements on this could include creating more modular code, making the authentication process more secure, and creating a function that accepts a file and returns the text from that document. I hope some of this code may be useful in your future tasks!