Workflows for Paywalled Texts: The "difficult" database/scripting version
Try it out!
The SQL/scripting version of this process does some pretty cool things, like wild-card searches and contextual output.
Let’s say you’ve just put a bunch of text from Napoleon III’s Life of Julius Caesar into MALLET and identified some interesting vocabulary, including references to “Rome”, “constitution” and “construction”. How do you get from that vocabulary to a citation?
The end result of this process gives you output that looks something like this:
If steps 1-3 are taken care of, you’ll need a little extra info.
Data structure
The “Word” table
If you’ve read the simpler version, you’ve seen the basics. Three extra fields give you a little more control:
Word_Type: If you’re running named-entity recognition or working with any kind of semantic markup that differentiates between types of word entries, this gives you room to do that.
Word_Lemma and Word_SpeechPart: If you’re working with an unsupported language and need roots or lemma for text mining, this attaches natural-language-processing information to each token in the corpus.
Additionally, you can use these fields to completely recreate the text by printing Word_Word and Word_Punctuation, divided by spaces and sorting by (in order) Word_SourceID, Word_Cite1, Word_Cite2, Word_Cite3, Word_Order.
The “Source” table
Tracking sources extends your ability to search and sort the data across sources, which is very difficult to do in the simpler Excel-based version.
Source refers to some discrete boundary in your text: a novel, a play, a saint’s life. Consider how you might cite a text in a print publication, and use that as the smallest division. To make this data description more clear, we’ll use novels as an example of individual sources and page numbers as our example of a citational record.
At minimum, Source contains a numeric ID for each source (Source_ID) and a title (Source_Title). Variations allow for more effective subcorpus analysis. Examples here include by period (Source_Period, Source_Century), geography (Source_Geography), genre or other taxonomy (Source_Type), or Author (Source_Author). Source_Notes allows for additional documentation (a full citation, a URL, etc.).
Differences in cleaning, tokenization, and import
The biggest difference between a scripted SQL version and an Excel version is that these three steps can be combined.
Use a script to split the file on Book/Chapter/Section or other citation structure.
Loop through the citation structure at its smallest level, splitting each section into tokens on word divisions (\s)
If you’re working with Roman numerals, you can also convert Roman numerals to arabic numerals for the purpose of proper numeric sorting. I’ve done that here. Depending on the which case, it may be helpful to allow varchar input into the three Cite fields but do any SELECT queries for those fields as CAST(unsigned) so they sort in numeric order.
Import each token into the database, starting a new Word_Order loop at 1 with each change in the most granular of the Cite fields (e.g. if you’re using Cite3, each time Cite3 changes, start Word_Order over at 1 again; if you’re only chunking in Cite1 and Cite2, start Word_Order over when Cite2 changes)
Exporting
An output script with a few options will give you better control over the output to text mining software.
Which texts do you want? All of them? Just those in a particular genre or time period?
Which field do you want? Word_Word, the original, or Word_Clean, with carefully cleaned, lower-case data?
Do you want tokenized line-by-line output or paragraph style output with punctuation?
Do you want files chunked by author? Text? By one or more of the Word_Cite fields?