Reading PDF files into R for text mining

Let’s say we’re interested in text mining the opinions of The Supreme Court of the United States from the 2014 term. The opinions are published as PDF files at the following web page http://www.supremecourt.gov/opinions/slipopinion/14. We would probably want to look at all 76 opinions, but for the purposes of this introductory tutorial we’ll just look at the last three of the term: (1) Glossip v. Gross, (2) State Legislature v. Arizona Independent Redistricting Comm’n, and (3) Michigan v. EPA. These are the first three listed on the page. To follow along with this tutorial, download the three opinions by clicking on the name of the case. (If you want to download all the opinions, you may want to look into using a browser extension such as DownThemAll.)

To begin we load the pdftools package. The pdftools package provides functions for extracting text from PDF files.

# install.packages("pdftools")
library(pdftools)

Next create a vector of PDF file names using the list.files function. The pattern argument says to only grab those files ending with “pdf”:

files <- list.files(pattern = "pdf$")

NOTE: the code above only works if you have your working directory set to the folder where you downloaded the PDF files. A quick way to do this in RStudio is to go to Session…Set Working Directory.

The “files” vector contains all the PDF file names. We’ll use this vector to automate the process of reading in the text of the PDF files.

The “files” vector contains the three PDF file names.

## [1] "13-1314_3ea4.pdf" "14-46_bqmc.pdf"   "14-7955_aplc.pdf"

We’ll use this vector to automate the process of reading in the text of the PDF files.

The pdftools function for extracting text is pdf_text. Using the lapply function, we can apply the pdf_text function to each element in the “files” vector and create an object called “opinions”.

opinions <- lapply(files, pdf_text)

This creates a list object with three elements, one for each document. The length function verifies it contains three elements:

length(opinions)
## [1] 3

Each element is a vector that contains the text of the PDF file. The length of each vector corresponds to the number of pages in the PDF file. For example, the first vector has length 81 because the first PDF file has 81 pages. We can apply the length function to each element to see this:

lapply(opinions, length) 
## [[1]]
## [1] 81
## 
## [[2]]
## [1] 47
## 
## [[3]]
## [1] 127

And we’re pretty much done! The PDF files are now in R, ready to be cleaned up and analyzed. If you want to see what has been read in, you could enter the following in the console, but it’s going to produce unpleasant blocks of text littered with Character Escapes such as \r and \n.

opinions

When text has been read into R, we typically proceed to some sort of analysis. Here’s a quick demo of what we could do with the tm package. (tm = text mining)

First we load the tm package and then create a corpus, which is basically a database for text. Notice that instead of working with the opinions object we created earlier, we start over.

# install.packages("tm")
library(tm)
corp <- Corpus(URISource(files),
               readerControl = list(reader = readPDF))

The Corpus function creates a corpus. The first argument to Corpus is what we want to use to create the corpus. In this case, it’s the vector of PDF files. To do this, we use the URISource function to indicate that the files vector is a URI source. URI stands for Uniform Resource Identifier. In other words, we’re telling the Corpus function that the vector of file names identifies our resources. The second argument, readerControl, tells Corpus which reader to use to read in the text from the PDF files. That would be readPDF, a tm function. The readerControl argument requires a list of control parameters, one of which is reader, so we enter list(reader = readPDF). Finally we save the result to an object called “corp”.

It turns out that the readPDF function in the tm package actually creates a function that reads in PDF files. The documentation tells us it uses the pdftools::pdf_text function as the default, which is the same function we used above. (?readPDF)

Now that we have a corpus, we can create a term-document matrix, or TDM for short. A TDM stores counts of terms for each document. The tm package provides a function to create a TDM called TermDocumentMatrix.

opinions.tdm <- TermDocumentMatrix(corp, 
                                   control = 
                                     list(removePunctuation = TRUE,
                                          stopwords = TRUE,
                                          tolower = TRUE,
                                          stemming = TRUE,
                                          removeNumbers = TRUE,
                                          bounds = list(global = c(3, Inf)))) 

The first argument is our corpus. The second argument is a list of control parameters. In our example we tell the function to clean up the corpus before creating the TDM. We tell it to remove punctuation, remove stopwords (eg, the, of, in, etc.), convert text to lower case, stem the words, remove numbers, and only count words that appear at least 3 times. We save the result to an object called “opinions.tdm”.

To inspect the TDM and see what it looks like, we can use the inspect function. Below we look at the first 10 terms:

inspect(opinions.tdm[1:10,]) 

## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 30/0
## Sparsity           : 0%
## Maximal term length: 6
## Weighting          : term frequency (tf)
## Sample             :
##         Docs
## Terms    13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
##   ——————               26              6               21
##   —decid                1              1                1
##   “all                  1              1                2
##   “each                 5              1                1
##   “in                  14              3                5
##   “is                   3              1                4
##   “it                   6              3                8
##   “not                  1              4                6
##   “on                   1              1                3
##   “that                 2              1                2

We see words preceded with double quotes and dashes even though we specified removePunctuation = TRUE. We even see a series of dashes being treated as a word. What happened? It appears the pdf_text function preserved the unicode curly-quotes and em-dashes used in the PDF files.

One way to take care of this is to manually use the removePunctuation function with tm_map, both functions in the tm package. The removePunctuation function has an argument called ucp that when set to TRUE will look for unicode punctuation. Here’s how we can use use it to remove punctuation from the corpus:

corp <- tm_map(corp, removePunctuation, ucp = TRUE)

Now we can re-create the TDM, this time without the removePunctuation = TRUE argument.

opinions.tdm <- TermDocumentMatrix(corp, 
                                   control = 
                                     list(stopwords = TRUE,
                                          tolower = TRUE,
                                          stemming = TRUE,
                                          removeNumbers = TRUE,
                                          bounds = list(global = c(3, Inf)))) 

And this appears to have taken care of the punctuation problem.

## <<TermDocumentMatrix (terms: 10, documents: 3)>>
## Non-/sparse entries: 30/0
## Sparsity           : 0%
## Maximal term length: 10
## Weighting          : term frequency (tf)
## Sample             :
##             Docs
## Terms        13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
##   abandon                   1              1                8
##   abdic                     1              1                1
##   absent                    5              2                2
##   accept                    6              4               12
##   accompani                 1              2                2
##   accomplish                4              1                1
##   accord                   12             10               13
##   account                   1             26                8
##   accur                     1              3                1
##   achiev                    1             15                3

We see, for example, that the term “abandon” appears in the third PDF file 8 times. Also notice that words have been stemmed. The word “achiev” is the stemmed version of “achieve”, “achieved”, “achieves”, and so on.

The tm package includes a few functions for summary statistics. We can use the findFreqTerms function to quickly find frequently occurring terms. To find words that occur at least 100 times:

findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)

 [1] "also"      "amend"     "ant"       "case"      "cite"      "claus"     "congress" 
 [8] "constitut" "cost"      "court"     "decis"     "dissent"   "district"  "effect"   
[15] "elect"     "execut"    "feder"     "find"      "justic"    "law"       "major"    
[22] "make"      "may"       "one"       "opinion"   "petition"  "power"     "reason"   
[29] "requir"    "see"       "state"     "time"      "tion"      "unit"      "use"    

To see the counts of those words we could save the result and use it to subset the TDM. Notice we have to use as.matrix to see the print out of the subsetted TDM.

ft <- findFreqTerms(opinions.tdm, lowfreq = 100, highfreq = Inf)
as.matrix(opinions.tdm[ft,]) 

##            Docs
## Terms       13-1314_3ea4.pdf 14-46_bqmc.pdf 14-7955_aplc.pdf
##   also                    24             13               74
##   amend                   57              9               84
##   ant                     38             36               46
##   case                    67             12              109
##   cite                    52             27               78
##   claus                  123              4                1
##   congress                70             43                3
##   constitut              190              4               81
##   cost                     1            220                8
##   court                  197             57              343
##   decis                   27             41               33
##   dissent                 77             44              124
##   district                90              4               81
##   effect                  10             26              130
##   elect                  178              1                4
##   execut                  14              5              290
##   feder                   77              8               28
##   find                     9             60               54
##   justic                  44              7               74
##   law                    102             15               30
##   major                   83             42                9
##   make                    45             41               32
##   may                     77             17               48
##   one                     53             24               67
##   opinion                 87             33              112
##   petition                 3             11              127
##   power                   98            115                8
##   reason                  13             50               42
##   requir                  22             40               52
##   see                    101             66              182
##   state                  529             25              260
##   time                    29             19               63
##   tion                    56             17               47
##   unit                    63             22               38
##   use                     37             13              140
##   year                    22             19               92

To see the total counts for those words, we could save the matrix and apply the sum function across the rows:

ft.tdm <- as.matrix(opinions.tdm[ft,])
sort(apply(ft.tdm, 1, sum), decreasing = TRUE)

##     state     court       see    execut constitut   dissent   opinion 
##       814       597       349       309       275       245       232 
##      cost     power       use      case     elect  district    effect 
##       229       221       190       188       183       175       166 
##      cite     amend       law       one       may  petition     major 
##       157       150       147       144       142       141       134 
##      year     claus    justic      find      unit       ant      tion 
##       133       128       125       123       123       120       120 
##      make  congress    requir     feder      also      time    reason 
##       118       116       114       113       111       111       105 
##     decis 
##       101

Many more analyses are possible. But again the main point of this tutorial was how to read in text from PDF files for text mining. Hopefully this provides a template to get you started.

For questions or clarifications regarding this article, contact the UVa Library StatLab: statlab@virginia.edu

Clay Ford
Statistical Research Consultant
University of Virginia Library
April 14, 2016