Use the tm package

tm is the go-to package when it comes to doing text mining/analysis in R.

install.packages("tm")

For our problem, it will help us import a PDF document in R while keeping its structure intact. Plus, it makes it ready for any text analysis you want to do later.

The readPDF function from the tm package doesn’t actually read a PDF file like pdf_text from the previous example did. Instead,&nbsp;it will help you create your own function, the benefit of it being that you can choose whatever PDF extracting engine you want.

Notice the difference with the excerpt from the first method. New empty lines appeared, corresponding more closely to the document. This can help to identify where the header stops in this case.

Another difference is how pages are managed. With the second method, you get the whole text at once, with page breaks symbolized with the fsymbol. With the first method, you simply had a list where 1 page = 1 element.

page_breaks <- grep("\f", doc)
doc[page_breaks[1]]

## [1] "fA/71/PV.62t13/12/2016"

This is the first line of the second page, with an added f in front of it.

Extract the right information

Naturally, you don’t want to stop there. Once you have the PDF document in R, you want to extract the actual pieces of text that interest you, and get rid of the rest.

That’s what this part is about.

I will use a few common tools for string manipulation in R:

The grep and grepl functions.

Base string manipulation functions (such as str_split).

The stringr package.

My goal is to extract all the speeches from the speakers of the document we’ve worked on so far (this one), but I don’t care about the speeches from the president.

Here are the steps I will follow:

Clean the headers and footers on all pages.

Get the two columns together.

Find the rows of the speakers.

Extract the correct rows.

I will use regular expressions (regex) regularly in the code. If you have absolutely no knowledge of it, I recommend you go follow a tutorial, because it is essential as soon as you start managing text data.

If you have some basic knowledge, that should be enough. I’m not a big expert either.

1. Clean the headers and footers on all pages.

Notice how each page contains text at the top and at the bottom that will interfere with our extraction.

Finally, we could get all the speeches in a list. We can now analyze what each country representative talk about, how this evolves over more documents, over the years, depending on the topic discussed, etc.

Now, one could argue that for one document, it would be easier to extract it in a semi-manually way (by specifying the row numbers manually, for example). This is true.

But the idea here is to replicate this same process over hundreds, or even thousands, of such documents.

This is where the fun begins, as they will all have their specificities, the format might evolve, sometimes stuff is misspelled, etc. In fact, even with this example, the extraction is not perfect! You can try to improve it if you want.