Corpus linguistics is an adaptation of close-reading techniques to computationally-sized corpora. Text clustering–things like topic modeling and word embedding–use distributional statistics and matrix algebra to cluster chunks of a corpus together in a variety of ways. Fancier algorithms in the text-clustering world underlie the LLMs (Large Language Models) that power artificial intelligence like ChatGPT. Don’t worry, we’ve got non-numbery readings to help you make sense of these.
Reading: Our independent reading will focus on topic modeling because it’s the easiest of the clustering approaches to learn and has the most GUI-based point-and-click tools available, and reading about topic modeling will help prepare you to think about some of the more complex approaches to text clustering. See Week 5 Reading and Discussion for a guide that makes topic modeling slightly less opaque.
Lab: We’ll work with Google Co-Lab, the online “programming notebook” approach we used in Week 3 to split one file into many files. You’ll have one challenge at the beginning, which is to adapt my Google Co-Lab notebook’s approach to storing files in Google Drive to a Google Co-Lab notebook that has a host of topic modeling approaches built in. We also have 2 intermediate and advanced labs from The Programming Historian available for your use. See Week 5 Lab: Google Co-Lab approaches to topic modeling for a full at-home walkthrough. This week’s lab is designed for you to explore at home and troubleshoot/discuss in class. Note that the reading is shorter to make time for that.
Collaborative data management:Collaborative Data Week 5 NAME REDACTED
Theory and Methods Reading: Ted Underwood, “Topic modelling made just simple enough”, https://tedunderwood.com/2012/04/07/topic-modeling-made-just- simple-enough/. We will do a hands-on sticky-note version of the bucket analogy in class.We
Exemplar Reading: Craig, Diaz, Kloster, “The Coded Language of Empire: Digital History, Archival Deep Dives, and the Imperial United States in Cuba’s Third War of Independence”, The American Historical Review , 2024, 129(2), 474–516, https://doi.org/10.1093/ahr/rhae179 and also in our Canvas Files at File rhae179_Craig.pdf could not be included in the ePub document. Please see separate zip file for access. so you don’t have to have an AHR subscription.
**
Discussion:
Further resources:
**
This week, we’ll go back to the Google Co-Lab environment we looked at in Week 3, but this time, we’ll actively use a Google Co-Lab notebook to run some topic models.
The most common way to create a Co-Lab notebook is to copy one that already exists.
You’ll use the one I’ve modified and debugged by:
I’m adapting a notebook by Jonathan Soma (https://github.com/jsoma/), which is available
However, there are some code chunks that don’t work, so I’ve done a good bit of debugging to make a useful version of the tutorial package that follows. (The code chunks don’t work because Soma’s notebook was written with a version of Python that has since been updated, and a few of the references to code libraries in it have been renamed. File this under “caveats and regrets.”)
Google Co-Lab notebooks have different “sections”:
executable sections: things with code in them that you can “play”. Executable sections use a different font and a grey background.
Something in a notebook that looks like old-school typewriter text is usually executable code.
documentation sections: text-only sections that explain what’s happening in the executable sections. These documentation sections use a sans-serif font like the one you’re reading now.
These sections stack on top of each other and let you scroll through a “program”, or a series of executable sections that does things. Below, we have two documentation sections (“Using topic modeling to extract topics from documents” and “Prep work: Downloading necessary files”) and an executable section with the header “# Make data directory if it doesn’t exist.”
happening:
Each executable section has a “play” button.
The square brackets “[ ] “ next to the “”# Make data directory if it doesn’t exist.” line turn into a PLAY button when you mouse over the space between the brackets.
NOTA BENE: Ignore the blue buttons that show up after you play each executable section. These are new additions from Google’s AI, not part of the original notebook and all of its very thoughtful detailed explanations.
Now we get to use the contents of the tutorial notebook itself. This notebook is spectacular in its explanation of topic modeling. It’s also funny. (It actually contains the sentence “Let’s be honest with ourselves: we expected something a bit better.”
Why, yes! Yes, you can. If you recall the Google Co-Lab notebook I used to split one file into many files in Week 3, I had executable section that connect to, and then load text files from, Google Drive.
We can use the basic principles there to Frankenstein together our own corpus with the “Attempt Two” section of the topic-modeling tutorial notebook.
REDACTED
This uses a folder of .txt files on Google Drive to run the topic models, but it’s missing some variables that make the change-over-time stream graph in the original work properly.
Word embedding offers a different view of statistical distribution. It uses something like network analysis (basically matrix algebra, which I can explain briefly in class if someone asks) to put words into 3D space.
If the mathy distribution description doesn’t make sense, how do we teach people how topic modeling works with sticky notes?
|
—|—
As with our word-barf proposal, I’m not looking for perfection at this point in the semester. Instead, we’re going to think about the second stage of project development: 1-pagers.
The answer: a table.
Format your assignment like this:
Sub-question | Primary-source research needs | Data cleaning needs | Analysis methods | Sample historiography |
---|---|---|---|---|
Here, you write a question that is a smaller, more manageable section of your big research question in 1 sentence. This is the only prose in the table. (e.g. “How do episcopal saints lives in 500-700 differ in their focus on bishops vs monks between vitae and gesta?”) | Here, you bullet-point the specific archives/primary sources that will contribute to evidence for this subargument. Try to make document-number or temporal estimates here (e.g. “estimated 150 documents from BNF” or “estimated 250 chapters of 2-3 pages of episcopal gesta from between 500 and 700 CE”) | Here, you bullet-pointthe data cleaning you’ll need to do to the sources from the archival research. Try to make time estimates here (e.g. “sources need to be transcribed; 20 minutes per source IDed” or “sources are computer-readable but need to be hand-cleaned; 3 minutes per document”) | Here, you bullet-point how you will segment the data, what methods you will use to analyze it, and how each segment of data treated with that specific method will provide evidence to help you answer your question | Here, you provide the top 3-5 secondary-reading or DH-project citations that your subquestion responds to or is in conversation with. No annotations. That happens in a separate document. |
Each subquestion should have its own line in the table. Some subquestions might have overlapping sources (e.g. sources that answer my question about genre differences between 500-700 might also overlap with sources that help me compare gesta b/w 500-700 and gesta b/w 700-1000).
Please use OneDrive or GoogleDrive to create whatever documents you want to share with me. Make sure you’ve shared with craigkl@iu.edu, and then submit the sharing link.
Draft assignments will always be due before class. We’ll spend 20 minutes in class doing guided revisions, and then you’ll have another 24-36 hours to revise what’s in the Google doc you initially shared for your submission. I’ll start reading and commenting on Wednesday evening or Thursday morning.
Due at: Sep 23, 2024 at 12am
Grading Type: Points
Points: 0.0
Submitting: Online URL
NB: This was claimed by a student and we worked on file renaming
This site built with Foundation 6. Kalani Craig, 2025