The SQL/scripting version of this process does some pretty cool things, like wild-card searches and contextual output.
Let’s say you’ve just put a bunch of text from Napoleon III’s Life of Julius Caesar into MALLET and identified some interesting vocabulary, including references to “Rome”, “constitution” and “construction”. How do you get from that vocabulary to a citation?
The end result of this process gives you output that looks something like this:
Want to search for something custom? This assumes a wild-card search, so “rom” will match “Rome”, “Romans”, “romanization”, etc.
The harder version assumes several things:
If steps 1-3 are taken care of, you’ll need a little extra info.
If you’ve read the simpler version, you’ve seen the basics. Three extra fields give you a little more control:
Additionally, you can use these fields to completely recreate the text by printing Word_Word and Word_Punctuation, divided by spaces and sorting by (in order) Word_SourceID, Word_Cite1, Word_Cite2, Word_Cite3, Word_Order.
CREATE TABLE `Word` (
`Word_ID` int(11) NOT NULL AUTO_INCREMENT,
`Word_SourceID` int(11) NOT NULL,
`Word_Cite1` varchar(50) NOT NULL,
`Word_Cite2` varchar(25) NOT NULL,
`Word_Cite3` varchar(25) DEFAULT NULL,
`Word_Word` varchar(50) DEFAULT NULL,
`Word_Clean` varchar(50) DEFAULT NULL,
`Word_Punctuation` varchar(25) NOT NULL DEFAULT '',
`Word_Order` int(20) DEFAULT NULL,
`Word_Type` varchar(25) DEFAULT NULL,
`Word_Lemma` varchar(50) DEFAULT NULL,
`Word_SpeechPart` varchar(50) DEFAULT NULL,
PRIMARY KEY (`Word_ID`)
)
Tracking sources extends your ability to search and sort the data across sources, which is very difficult to do in the simpler Excel-based version.
Source refers to some discrete boundary in your text: a novel, a play, a saint’s life. Consider how you might cite a text in a print publication, and use that as the smallest division. To make this data description more clear, we’ll use novels as an example of individual sources and page numbers as our example of a citational record.
At minimum, Source contains a numeric ID for each source (Source_ID) and a title (Source_Title). Variations allow for more effective subcorpus analysis. Examples here include by period (Source_Period, Source_Century), geography (Source_Geography), genre or other taxonomy (Source_Type), or Author (Source_Author). Source_Notes allows for additional documentation (a full citation, a URL, etc.).
CREATE TABLE `Source` (
`Source_ID` int(11) unsigned NOT NULL AUTO\_INCREMENT,
`Source_Title` text,
`Source_Author` varchar(250) DEFAULT NULL,
`Source_Century` int(11) DEFAULT NULL,
`Source_Period` enum('Time Period 1', 'Time Period 2', 'Time Period 3') DEFAULT NULL,
`Source_Type` enum('Genre 1', 'Genre 2', 'Genre 3') DEFAULT NULL,
`Source_Note` text,
PRIMARY KEY (`Source_ID`)
)
The biggest difference between a scripted SQL version and an Excel version is that these three steps can be combined.
An output script with a few options will give you better control over the output to text mining software.
This site built with Foundation 6. Kalani Craig, 2025