Turning a PDF into a Pandas DataFrame

While looking for some specific NYC school information, the only source I could originally find was in the form of a PDF. Nobody wants to go through a PDF and manually enter a bunch of info, so I decided to see if I could extract the data from the PDF with python. Since I was eventually hoping to merge this info into a larger dataframe, I figured I would go ahead and put the PDF into a pandas DataFrame for easy manipulation.

It turns out that I could not actually use the info from the PDF for a couple of reasons, but the value of extracting data from a PDF was immediately obvious, so I thought I would write down the process for future use (and maybe someone else will find it useful). With this particular PDF, we are lucky in that it is already set up in a table.

Thankfully, the PyPDF2 library already exists to extract text from PDFs, so the heavy lifting has been done. We just have to do some cleaning up. First, make sure you have PyPDF2 installed on your environment, then we will import our libraries.

Now we can take a look at the first page of the PDF, by creating an object and then extracting the text (note that the PDF pages are zero-indexed). We can see that its really messy and comes in the form of one really long string, but there is enough order in the chaos with which we can work.

The first thing I did was to strip away the leading 24 characters, which are the unnecessary page header info. The easiest way I could see to convert this to a dataframe was to turn it into a csv. As with most things in code, there are a few different ways to approach this, but this is what I chose to do:

Find a good place to separate text variables with commas

Create breaks to split the text into separate strings for each school/row of data

Clean up the mess

Looking at the raw printout, it turned out that strings of ‘\n \n’ were intuitive places for commas and leftover ‘\n’ strings could be removed.

Line breaks will come after one of three strings: ‘Point of Contact’, ‘.org’, or ‘.gov’. I was having issues with some of the email addresses getting truncated, so I inserted a ‘ break’ at the end of these three strings, which would use to, well, break the string into separate lines of text.

Now we just need to clean up the dataframe a bit. First, we can rename the headers and drop the first row. Next, I decided to keep the different elements of the address separated and just to drop the state, since they are all in NY. This seemed more useful to me for future use, but you do you.

That is as far as I’m going to go, but there is definitely still some work to be done. For example, if you take a look at the tail of the dataframe there are two obvious problems.

There is at least one school which must have had an extra comma inserted into the name.

There is an empty row at the bottom of possibly every page.

These are fixable issues, but that’s a matter of cleaning, and, like I said, I can’t actually use this data, so I’m going to stop spending more time on it. I hope someone (including my future self) finds this helpful.