Reading PDF files into R for text mining

8-DEC-2016 UPDATE: Added section on using the pdftools package for reading PDF files into R.

Let’s say we’re interested in text mining the opinions of The Supreme Court of the United States from the 2014 term. The opinions are published as PDF files at the following web page http://www.supremecourt.gov/opinions/slipopinion/14. We would probably want to look at all 76 opinions, but for the purposes of this introductory tutorial we’ll just look at the last three of the term: (1) Glossip v. Gross, (2) State Legislature v. Arizona Independent Redistricting Comm’n, and (3) Michigan v. EPA. These are the first three listed on the page. To follow along with this tutorial, download the three opinions by clicking on the name of the case. (If you want to download all the opinions, you may want to look into using a browser extension such as DownThemAll.)

To begin we load the tm package. The tm package provides functions for scanning in text, converting it to a corpus, and creating a term-document matrix.

install.packages("tm") # only need to do once
library(tm)

Next create a vector of PDF file names using the list.files function. The pattern argument says to only grab those files ending with “pdf”:

files <- list.files(pattern = "pdf$")

The “files” vector contains all the PDF file names. We’ll use this vector to automate the process of reading in the text of the PDF files.

Now we need to read in the text from the PDF files. The tm package provides a readPDF function, but it doesn’t actually read in the PDF files. The purpose of the function is to create a function that reads in PDF files! If we read the documentation we see it provides support for four different PDF extraction engines. We’ll use the first one documented (also the default): xpdf.

The xpdf engine is available at http://www.foolabs.com/xpdf/download.html. Download the precompiled binaries for your platform (Linux, Windows or Mac) and extract the files. You should now have a folder called something like “xpdfbin-win-3.04” containing the xpdf programs. To install, copy everything in that folder to an installation directory. For example, on Windows you might copy everything to C:/Program Files/xpdf. In the “xpdf” folder you should have three more folders: bin32, bin64 and doc. The first two are programs for 32- and 64-bit systems, respectively. The last contains documentation on the programs. The tm package uses two of the programs: pdfinfo.exe and texttopdf.exe. Finally you may need to update your system path so it points to where the xpdf tools are installed. This differs depending on your system. Here’s a good set of instructions for doing it for a computer running Windows 7: http://geekswithblogs.net/renso/archive/2009/10/21/how-to-set-the-windows-path-in-windows-7.aspx

With that done, we’re ready to use readPDF to create our function to read in PDF files. We call the function Rpdf, but you can name it whatever you like.

Rpdf <- readPDF(control = list(text = "-layout"))

The readPDF function has a control argument which we use to pass options to our PDF extraction engine. This has to be in the form of a list, so we wrap our options in the list function. There are two control parameters for the xpdf engine: info and text. info passes parameters to pdfinfo.exe and text passes parameters to pdftotext.exe. We only pass one parameter setting to pdftotext: “-layout”. This tells pdftptext.exe to maintain (as best as possible) the original physical layout of the text.

Using our Rpdf function we can proceed to read in the text of the opinions. What we want to do is convert the PDF files to text and store them in a corpus, which is basically a database for text. We can do all that with the following code:

opinions <- Corpus(URISource(files), 
                   readerControl = list(reader = Rpdf))

The Corpus function creates a corpus. The first argument to Corpus is what we want to use to create the corpus. In this case, it’s the vector of PDF files. To do this, we use the URISource function to indicate that the files vector is a URI source. URI stands for Uniform Resource Identifier. In other words, we’re telling the Corpus function that the vector of file names identifies our resources. The second argument, readerControl, tells Corpus which reader to use to read in the text from the PDF files. That would be Rpdf, the function we created. The readerControl argument requires a list of control parameters, one of which is reader, so we enter list(reader = Rpdf). Finally we save the result to an object called “opinions”.

And we’re done! We read in the PDF files without manually converting them to text. Just what we wanted to do. Now we’re ready to do some text mining on our corpus. For example, we can create a term-document matrix, or TDM for short. A TDM stores counts of terms for each document. The tm package provides a function to create a TDM called TermDocumentMatrix.

opinions.tdm <- TermDocumentMatrix(opinions, control = list(removePunctuation = TRUE,
                                                         stopwords = TRUE,
                                                         tolower = TRUE,
                                                         stemming = TRUE,
                                                         removeNumbers = TRUE,
                                                         bounds = list(global = c(3, Inf)))) 

The first argument is our corpus. The second argument is a list of control parameters. In our example we tell the function to clean up the corpus before creating the TDM. We tell it to remove punctuation, remove stopwords (eg, the, of, in, etc.), convert text to lower case, stem the words, remove numbers, and only count words that appear at least 3 times. We save the result to an object called “opinions.tdm”.

To inspect the TDM and see what it looks like, we can use the inspect function. Below we look at the first 10 terms:

inspect(opinions.tdm[1:10,]) 

<<TermDocumentMatrix (terms: 10, documents: 3)>>
Non-/sparse entries: 30/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)

            Docs
Terms        13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
  abandon                   1              1                8
  abdic                     1              1                1
  absent                    5              2                2
  accept                    6              4               12
  accompani                 1              2                2
  accomplish                4              1                1
  accord                   12             10               13
  account                   1             26                8
  accur                     1              3                1
  achiev                    1             15                3

We see, for example, that the term “abandon” appears in the third PDF file 8 times. Also notice that words have been stemmed. The word “achiev” is the stemmed version of “achieve”, “achieved”, “achieves”, and so on.

The tm package includes a few functions for summary statistics. We can use the findFreqTerms function to quickly find frequently occurring terms. To find words that occur at least 100 times:

findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)

 [1] "also"      "amend"     "ant"       "case"      "cite"      "claus"     "congress" 
 [8] "constitut" "cost"      "court"     "decis"     "dissent"   "district"  "effect"   
[15] "elect"     "execut"    "feder"     "find"      "justic"    "law"       "major"    
[22] "make"      "may"       "one"       "opinion"   "petition"  "power"     "reason"   
[29] "requir"    "see"       "state"     "time"      "tion"      "unit"      "use"    

To see the counts of those words we could save the result and use it to subset the TDM, like so:

ft <- findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)
inspect(opinions.tdm[ft,]) 

<<TermDocumentMatrix (terms: 36, documents: 3)>>
Non-/sparse entries: 108/0
Sparsity           : 0%
Maximal term length: 9
Weighting          : term frequency (tf)

           Docs
Terms       13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
  also                    24             13               74
  amend                   60              9               84
  ant                     38             36               46
  case                    67             12              109
  cite                    52             27               78
  claus                  123              4                1
  congress                70             43                3
  constitut              190              4               81
  cost                     1            220                8
  court                  197             57              343
  decis                   27             41               33
  dissent                 77             44              125
  district                90              4               81
  effect                  10             26              130
  elect                  178              1                4
  execut                  14              5              290
  feder                   78              8               28
  find                     9             60               54
  justic                  44              7               74
  law                    103             15               30
  major                   83             42                9
  make                    45             41               32
  may                     77             17               48
  one                     53             24               67
  opinion                 87             33              112
  petition                 3             11              126
  power                   98            115                8
  reason                  13             50               42
  requir                  22             40               53
  see                    101             66              182
  state                  530             25              260
  time                    29             19               63
  tion                    56             17               47
  unit                    63             22               38
  use                     37             13              140
  year                    22             19               92

To see the total counts for those words, we could save the matrix and apply the sum function across the rows:

ft.tdm <- inspect(opinions.tdm[ft,])
apply(ft.tdm, 1, sum)

     also     amend       ant      case      cite     claus  congress constitut      cost 
      111       153       120       188       157       128       116       275       229 
    court     decis   dissent  district    effect     elect    execut     feder      find 
      597       101       246       175       166       183       309       114       123 
   justic       law     major      make       may       one   opinion  petition     power 
      125       148       134       118       142       144       232       140       221 
   reason    requir       see     state      time      tion      unit       use      year 
      105       115       349       815       111       120       123       190       133

Many more analyses are possible. But again the main point of this tutorial was how to read in text from PDF files for text mining. Hopefully this provides a template to get you started.

Using the pdftools package

We can also use the pdftools package to read in PDF files. The nice thing about the pdftools package is that it should work out-of-the-box. You shouldn’t need to install a separate PDF extraction engine outside of R. Just install the package as you would any other and load it.

install.packages("pdftools")
library(pdftools)

The pdftools function for extracting text is pdf_text. Here is how we could use it to read in the Supreme Court PDF files using the same vector of file names we created above.

opinions2 <- lapply(files, pdf_text)

This creates a list object with three elements, one for each document. Each element is a vector that contains the text of the PDF file. The length of each vector corresponds to the number of pages in the PDF file. The first vector has length 81 because the first PDF file has 81 pages.

We can now create a corpus as follows:

corp <- Corpus(VectorSource(opinions2))

Notice we had to use the VectorSource function. This tells the Corpus function to interpret each element of the opinions2 object as a document.

Now if we proceed to creating the term-document matrix as before and inspect the result we notice something funny:

opinions.tdm2 <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
                                                             stopwords = TRUE,
                                                             tolower = TRUE,
                                                             stemming = TRUE,
                                                             removeNumbers = TRUE,
                                                             bounds = list(global = c(3, Inf)))) 

inspect(opinions.tdm2[1:10,])

<>
Non-/sparse entries: 30/0
Sparsity           : 0%
Maximal term length: 6
Weighting          : term frequency (tf)

        Docs
Terms     1 2  3
  —————— 26 6 21
  —decid  1 1  1
  “all    1 1  2
  “each   5 1  1
  “in    14 3  5
  “is     3 1  4
  “it     6 3  8
  “not    1 4  6
  “on     1 1  3
  “that   2 1  2

There are words preceded with double quotes and dashes even though we specified removePunctuation = TRUE. We even see a series of dashes being treated as a word. What happened? It’s hard to say exactly, but apparently the pdf_text function preserved the unicode curly-quotes and em-dashes used in the PDF files whereas the xpdf extraction engine replaced the curly-quotes and em-dashes with standard ASCII quotes and dashes. Therefore the unicode characters were not removed because it appears punctuation removal in the tm package does not include certain unicode characters. This may change in the future. The versions of pdftools and tm used when writing this article were 0.5 and 0.6-2, respectively. (The R version was 3.3.2.)

With some investigation we determined the unicode for the opening and closing quotes and the em-dash are \u201c, \u201d and \u2014. One way we can remove these characters is as follows:

opinions2 <- lapply(opinions2, function(x)gsub("(\u201c|\u201d|\u2014)","",x))

This applies the gsub function to each element of the list which says, “if you find any of these unicode characters, delete it and replace it with nothing.”

Now when we create our corpus and TDM we get the same results as before.

corp <- Corpus(VectorSource(opinions2))
opinions.tdm2 <- TermDocumentMatrix(corp, control = list(removePunctuation = TRUE,
                                                             stopwords = TRUE,
                                                             tolower = TRUE,
                                                             stemming = TRUE,
                                                             removeNumbers = TRUE,
                                                             bounds = list(global = c(3, Inf))))

inspect(opinions.tdm2[1:10,])

<>
Non-/sparse entries: 30/0
Sparsity           : 0%
Maximal term length: 10
Weighting          : term frequency (tf)

            Docs
Terms         1  2  3
  abandon     1  1  8
  abdic       1  1  1
  absent      5  2  2
  accept      6  4 12
  accompani   1  2  2
  accomplish  4  1  1
  accord     12 10 13
  account     1 26  8
  accur       1  3  1
  achiev      1 15  3

However, notice one slight difference: the column headers are numbers instead of the original file names. The column headers are determined by metadata associated with the corpus, specifically the “id” tag. We can change the “id” tag for one document at a time like this:

meta(corp[[1]], tag = "id") <- files[1]

To do it for all documents, we could use a for loop:

for(i in seq(length(corp))){
  meta(corp[[i]], tag = "id")<- files[i]
}

Now when we create the TDM we’ll see file names for the column headers instead of numbers.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2016