For the basic theory behind these practical steps, go have a look at the main page. This page takes that theory and provides details with resulting text files and csv files at each stage of the process for illustrative purposes.
For public demonstration, I’m using a sample from Napoleon III’s History of Julius Caesar (because why not; available from Project Gutenberg here). I copied and pasted rather than using the HTML source since there are a number of paywalled sites that don’t make it easy to get HTML source or that are faster to simply copy and paste.
The Napoleon text in its original form is divided by book, chapter, section and page, with books and chapters clearly labeled, and sections labeled only with an uppercase roman numeral followed by a period. Sections are prefixed with a sub title that needs to be moved. Pages are listed in arabic numberals within curly braces, like so: {9}
For the purpose of illustrating a non-page-number based citation system, I’m stripping page numbers and dividing on a book/chapter/section basis for this example.
The only thing we might want to remove for the purposes of text mining are the footnotes, which are in square brackets like so: [1]
I used regular expressions in Text Wrangler to clean up the NapoleonIIIBook1.txt file and ready it for tokenization. The first two examples have one line for the search expression and should be replaced with a blank. The remaining examples have one line for the search expression and one for the replace expression.
Remove bracketed footnotes:
Remove page numbers
Set up section order (note the sections are roman numerals) and move the section title below the section number.
A less clean file might require more at this stage, but ours looks something like this
At this point, we tokenize the text. I’ve done this using simple regular expressions (except for terribly long roman numeral) as a way of demonstrating some of the issues that go wrong (like dashed words and comma-separated numbers).
Search for a word, any possible punctuation that follows and a space.
Quotation marks that sit before a word will need some attention
Multiple punctuation marks need to be joined.
Comma separated numbers need to be recombined.
Double tabs show up occasionally, so replace \t\t with \t. So do blank lines returns, so replace \r\t\r with \r or \r\n\t\r\n with \r\n depending on your OS platform.
Dashed words are a problem so replace \t- with -.
Finally, let’s tackle book/chapter/section notation.
Our file is more or less tokenized–like this–and can be copied into an Excel file with the appropriate column labels.
Each word needs an ID, so we’ll simply enter a 1 in the first column and then autoincrement (=SUM(A1+1)) all the way down. Once these values are filled in, copy and the paste-special using only values to cement these numbers in place. Do this first so you can put things back in order easily if necessary.
Word_Clean should ideally be lower-case, so use Excel’s =LOWER() function to draw the value from the Word_Word column.
Spelling, OCR and other orthographic inconsistencies can create issues with topic modeling in particular, and this step can be invaluable. In this file, there are a few ligatures that could be corrected (æ to ae, œ to oe), and some UTF-8 characters like ô and è. These can be cleaned in Open Refine or using a text editor now that the Word_ID column has a numerically sorted referent to put each word in the text back in order.
You can skip straight to the export step if you want. But please don’t. The Word_ID column will let you put the text back in order. Doing so means lots of extra work later on to get citations. A little work here to add line-by-line citations will save lots of time later.
Give each token a full citation by copying the book/section values into those columns in Excel. This is not fun in Excel because it requires manual copy/paste.
It’s much easier to do this with a script, so if you only venture into really unfamiliar territory once, make it here using a script that loops through the tokenized file line by line and adds citations and word order. The script logic would look something like this:
The end result is a file with full citations and a sortable Word_Order column that adds backup sorting capabilities for the Word_ID column.
The Word_Clean column is now your main reference point. Copy the whole column to a text file, or copy chunks of the column to many text files, and you have minable data that also has the ability to search or reshape your data without losing its original form and citation.
This site built with Foundation 6. Kalani Craig, 2024