The world of text mining—particularly low-barrier-to-entry topic modeling with MALLET and work with AntConc or NTLK—opens up a whole variety of analytical options for scholars interested in pursuing distant reading. In turn, distant reading projects are often based on open-data collections, a phrase that conjures up visions of herds of information roaming free across the digital plains waiting to be corralled by avid scholars. It’s easy to do full-text analysis on these open corpora with unrestricted copyrights. Just download the full corpus, and off you go.
The reality of the data landscape for digital humanists is much more complicated. Here and there, a few roaming hand-transcribed sources that constitute fairly small collections, comparatively, pop their heads up, but many of the largest, most accurate collections of digitized and transcribed texts are largely fenced-in digitized sources like ProQuest, Early English Books Online, or (selfishly) Brepols database of medieval critical editions. The prevalence of copyrighted digitized texts, many with very restrictive copyright and usage guidelines that limit reproduction, can also limit applications of text mining, hGIS and network-theoretical approaches.
Let’s say you’ve just put a bunch of text from Napoleon III’s Life of Julius Caesar into MALLET and identified some interesting vocabulary, including references to “Rome”, “constitution” and “construction”. How do you get from that vocabulary to a citation?
The end result of this process gives you output that looks something like this:
Want to search for something custom? This assumes a wild-card search, so “rom” will match “Rome”, “Romans”, “romanization”, etc.
These links demonstrate the more complicated version using SQL and scripting but the process is also documented for scholars working mostly with Excel and basic text editing skills.
As a historian, I gravitate toward big questions that require lots of sources and the integration of two discrete skillsets developed over 10 years in industry and 10 years in the academy. Simply put, I’m a better historian with the digital than without it.
My current question is about medieval conflict resolution undertaken in informal settings, and I’m approaching the question by looking at how textual authority is constructed and then used to bolster real-world authority. For an example of this, have a look at a recent conference paper, “Between Miracles and Memory: Min(d)ing the gap in construction of authority in early medieval episcopal saint’s lives and deeds of bishops” and the list of citations I generated from paywalled data using the process detailed here.
Even more narrowly, I want to understand how divine agency works as it moves from textual account to real-world conflict resolution. How does divine, saintly or otherworldly intervention help the subjects of these biographies, and their successors, as they remember, replicate, reinforce and restructure their own agency as they seek to resolve conflict in the real world? How do these patterns change over time? By role? By geographic context?
To answer these questions, I’m looking at medieval biography–saints’ lives, deeds of bishops, biographies of kings–to understand how informal conflict resolution worked outside the boundaries of formal legal or sanctioned military conflict.
That all adds up to a giant text-mining project. Because the boundaries of text mining are fairly well established, it’s a fairly simple set of parameters. I need a corpus of medieval biographies sorted by time period, genre, geography, and author, and then prepped for topic modeling, corpus linguistics and a little semantic analysis based on Part-Of-Speech tagging. On the face of it, that doesn’t seem all that difficult.
Until you consider that most transcribed critical editions of medieval sources are paywalled. And all of them are in Latin.
With copyrighted, paywalled corpora1 built-in full-text download and off-the-shelf analysis are often not an option. As such, in-text citations become an absolutely vital part of the digital analysis. However, readily available topic modeling tools like MALLET strip the citation data scholars, digital and analog alike, need to participate in a scholarly debate.
This “how did they make that” project describes the basics of a workflow that bridges the gap between open-data ideals and paywalled sources. It helps scholars working with restricted text by providing a way to maintain intact word-by-word citation information in a reasonably simple format (though there are more complex versions of this process out there). This process preserves citations in a way that accommodates the copyright restrictions of providers of paywalled data while still providing the results in broadly reproducible form.
The data-management process starts with data scraping, clean-up and import approaches that provide individual scholars with private corpora that maintain intact word-by-word citation information. The resulting database can then be used for text mining in analytical tools that still maintains a tie between distant-reading analysis and the original citations for germane word (or words) of interest.2
If you’re using any paywalled data for a digital history project you’ll need to provide word-by-word citations.
It’s also helpful if you have bad OCR that needs some manual cleanup (for instance, alphabetical sorting) before you put it back together.
Finally, it’s just good practice to keep your citations in place, so even scholars using open data might benefit from a similar process.
At a high level, the workflow is basic, as is the data description.
At its most basic, a single table called “Word” contains the fields necessary to preserve citations. Each word will be its own row or record. This flat file format can live in Excel (slow but workable) or be imported into an SQL platform.
Consider the interaction between on- and off-line citation hunting: if your discipline requires page numbers in a novel but will accept book/chapter notation, it’s much easier to use book/chapter notation, which generally travels between editions of a text, rather than page number, which is edition specific.
Make sure there are headers or footers dividing the text of each page into your chosen divisions and assign these divisions to Cite1, Cite2 and Cite3. Consistency within each source is necessary, but it is possible to mix notational styles from source to source (page number vs book->chapter->line) within a single corpus because the sorting process described here always uses multiple fields to sort on. Just don’t mix notational styles within a source or getting all the tokenized words back into order will be difficult.
Automated web scraping (a good series here) is both less time-consuming and easier to maintain the header tags that will pass citation information to the database during the cleaning process.
In practical terms, there are often limitations for paywalled data that mean wget or other automated scraping methods for data gathering give way to manual copy and paste. In these cases, it’s fairly easy to maintain discrete citation information.
Any OCR cleaning that works at a high level–full-text search and replace–works best here.
As with any data-management process, acquisition and cleaning are the most time consuming part of the project. Ultimately, however, there are two considerations for web scraping:
If you’ve identified book-chapter-page number division in a source, make sure the headers or footers are consistent enough within a single source that you can search for those headers/footers. Using these consistent data-structure divisions, it’s fairly easy to see chunks in the text.
Once an individual source is clean enough–that is, it’s a reasonably accurate text file with occasional clear markers for citation divisions–we import. The import process should do three things:
If you have a good text editor and some basic skill in regular expressions, it’s fairly easy to combine chunking, tokenizing, and importing.
Export the Word_Word column or the Word_Clean column, depending on your needs, into a text file so that each token is on a single line, and voilà, you have tokenized text for use in MALLET, AntConc, Voyant, etc.
Once you have some analytical results from the text mining and can track the words or phrases that are useful, the original document becomes a searchable database that corroborates the text mining. It’s also possible to sort results alphabetically, find entries for the word or words you’re interested in and then copy and paste the list of citations for the appearance of that word.
There are two versions.
The simplest version requires:
Scholars with more technical skill can parlay the theory behind this into a more complex version that involves
This version also provides better support for scholars hoping to apply text mining and natural-language processing analysis to undersupported languages.
None and none. The initial forays into data like this on a smaller scale can be done on a laptop with Excel. NB: It does help to have institutional affiliation in order to get access to these very expensive paywalled sources.
At the interim stage–the dangerous stage–a laptop equipped with a Bitnami stack (mine is PHP for a quick and dirty web search) provides access to MySQL, a scripting language and basic file-write capabilities.
When I reached the million-token threshold, I did need additional processing power, which is provided via supercomputing access at my current institution. However, that need stems from a combination of requirements. I’m dealing both with paywalled data and with data in badly supported languages, so I have natural-language-processing information in several additional tables that document relationships between words in the database for network analysis of grammar. These networks of words get very demanding on the processing side.
Or corpora in less well supported languages. The process described here, plus one additional step, makes it easier to tackle text mining in languages without the support that English and other western-Roman-character-set languages have in spades. Occasional footnotes provide a basic explanation, but the entire process is also documented explicitly for scholars working in less-well-supported languages (paywalled or not) at http://www.kalanicraig.com/workflow/workflow-for-unsupported-languages-addendum/. ↩
For scholars working with undersupported languages, an additional field, “root” provides the ability to reconstruct a lemmatized corpus created from the paywalled data for use in MALLET, AntConc or a linguistic network analysis, again with the original citations left intact. ↩
The lemmatization step for unsupported languages happens between tokenization and export. ↩
This site built with Foundation 6. Kalani Craig, 2025