Week 5: Text Clustering

Week 5 Overview
Week 5 Reading and Discussion
Week 5 Lab: Google Co-Lab approaches to topic modeling
Table-based Project Proposal
Collaborative Data Week 5 NAME REDACTED

Week 5 Overview

Text Clustering

Corpus linguistics is an adaptation of close-reading techniques to computationally-sized corpora. Text clustering–things like topic modeling and word embedding–use distributional statistics and matrix algebra to cluster chunks of a corpus together in a variety of ways. Fancier algorithms in the text-clustering world underlie the LLMs (Large Language Models) that power artificial intelligence like ChatGPT. Don’t worry, we’ve got non-numbery readings to help you make sense of these.

Reading: Our independent reading will focus on topic modeling because it’s the easiest of the clustering approaches to learn and has the most GUI-based point-and-click tools available, and reading about topic modeling will help prepare you to think about some of the more complex approaches to text clustering. See Week 5 Reading and Discussion for a guide that makes topic modeling slightly less opaque.

Lab: We’ll work with Google Co-Lab, the online “programming notebook” approach we used in Week 3 to split one file into many files. You’ll have one challenge at the beginning, which is to adapt my Google Co-Lab notebook’s approach to storing files in Google Drive to a Google Co-Lab notebook that has a host of topic modeling approaches built in. We also have 2 intermediate and advanced labs from The Programming Historian available for your use. See Week 5 Lab: Google Co-Lab approaches to topic modeling for a full at-home walkthrough. This week’s lab is designed for you to explore at home and troubleshoot/discuss in class. Note that the reading is shorter to make time for that.

Collaborative data management:Collaborative Data Week 5 NAME REDACTED

Week 5 Reading and Discussion

Theory and Methods Reading: Ted Underwood, “Topic modelling made just simple enough”, https://tedunderwood.com/2012/04/07/topic-modeling-made-just- simple-enough/. We will do a hands-on sticky-note version of the bucket analogy in class.We

Exemplar Reading: Craig, Diaz, Kloster, “The Coded Language of Empire: Digital History, Archival Deep Dives, and the Imperial United States in Cuba’s Third War of Independence”, The American Historical Review , 2024, 129(2), 474–516, https://doi.org/10.1093/ahr/rhae179 and also in our Canvas Files at File rhae179_Craig.pdf could not be included in the ePub document. Please see separate zip file for access. so you don’t have to have an AHR subscription.

Discussion:

How does the high-level flyover-view analysis that topic modeling offers intersect with the interpretive nature of history as a discipline? With evidence-based argumentation?
What historiographic considerations do we need to keep in mind as we consider topic modeling? For instance,
How do we build corpora differently or similarly in order to engage with topic modeling than we might if we were using corpus-linguistics or close-reading approaches?
How do we frame the more statistically oriented process of topic modeling for peer review and consumption by historians who lean toward a close-reading approach?

Further resources:

Once again, Jo Guldi, The Dangerous Art of Text Mining: A Methodology for Digital History, Cambridge University Press (2023). https://doi.org/10.1017/9781009263016.
Weavers, Kenter, Huijnen, “Concepts Through Time” on word-embeddings for tracing meaning. https://www.slideshare.net/slideshow/concepts-through-time-tracing-concepts-in-dutch-newspaper-discourse-using-sequential-word-vector-spaces/50077323
Naming topics: https://journals.sagepub.com/doi/full/10.1177/25152459231160105

Week 5 Lab: Google Co-Lab approaches to topic modeling

Lab background

This week, we’ll go back to the Google Co-Lab environment we looked at in Week 3, but this time, we’ll actively use a Google Co-Lab notebook to run some topic models.

Tutorial

Creating your own Google Co-Lab notebook

The most common way to create a Co-Lab notebook is to copy one that already exists.

You’ll use the one I’ve modified and debugged by:

Clicking this link: REDACTED
Going to the File menu and using the “Save a copy to Drive” to make a copy of this into your own Google Drive so that you can make changes
Rename the notebook “DigHist Week 5 Topic Modeling [yournamehere].ipynb”. The “ipynb” extension basically says I am a Python (py) notebook (nb) that can be run in any environment that supports Jupyter Notebooks (Google Co-Lab, Docker, Anaconda).

I’m adapting a notebook by Jonathan Soma (https://github.com/jsoma/), which is available

directly through Google Co-Lab at https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Introduction%20to%20topic%20modeling.ipynb
on GitHub athttps://github.com/littlecolumns/ds4j-notebooks/blob/master/text-analysis/notebooks/Introduction%20to%20topic%20modeling.ipynb

However, there are some code chunks that don’t work, so I’ve done a good bit of debugging to make a useful version of the tutorial package that follows. (The code chunks don’t work because Soma’s notebook was written with a version of Python that has since been updated, and a few of the references to code libraries in it have been renamed. File this under “caveats and regrets.”)

Navigating Google Co-Lab

Google Co-Lab notebooks have different “sections”:

executable sections: things with code in them that you can “play”. Executable sections use a different font and a grey background.

Something in a notebook that looks like old-school typewriter text is usually executable code.
documentation sections: text-only sections that explain what’s happening in the executable sections. These documentation sections use a sans-serif font like the one you’re reading now.

These sections stack on top of each other and let you scroll through a “program”, or a series of executable sections that does things. Below, we have two documentation sections (“Using topic modeling to extract topics from documents” and “Prep work: Downloading necessary files”) and an executable section with the header “# Make data directory if it doesn’t exist.”

Executable sections have color coding to help you understand what’s

happening:

Green is a comment. These lines start with a #, which tells Python that this line is for a human to read, not for the computer to actually run.
Lines that start with an ! are loading some sort of system file or “library”
Libraries are preset collections of commands that you can use. Imagine if you had to type “bold” before anything you wanted bolded in Word, and then “/bold” when you want the bold text to stop, instead of pushing the “B” button. The “B” button is part of a text-formatting library that lets you short-hand the bold command.
Indentations indicate loops that are applied to everything in a list of variables.
Each level of indentation indicates another “nested” loop, or a series of commands that need to run before the computer can proceed to the next loop.

Playing the executable sections in a Google Co-Lab notebook

Each executable section has a “play” button.

The square brackets “[ ] “ next to the “”# Make data directory if it doesn’t exist.” line turn into a PLAY button when you mouse over the space between the brackets.

NOTA BENE: Ignore the blue buttons that show up after you play each executable section. These are new additions from Google’s AI, not part of the original notebook and all of its very thoughtful detailed explanations.

Using the Google Notebook to make a topic model

Now we get to use the contents of the tutorial notebook itself. This notebook is spectacular in its explanation of topic modeling. It’s also funny. (It actually contains the sentence “Let’s be honest with ourselves: we expected something a bit better.”

Read each documentation section as you get to it
Play each executable section and look at the output.
Then read the next documentation section, which generally explains the output, walks through why the output changes, explores how topic modeling is both weird and cool, and suggests alternatives.

Bring questions to class! These might include:

WTF is fitting?
How do you choose the number of topics you’re working with?
Can you give the topics names? If you do, what does that do to your readers’ understanding of the topic contents?
How does topic modeling output change based on the way that the corpus is divided? (That is, what happens if I have some really long documents and some really short documents? Does that change the model?)
How do I know which documents contribute to a topic, and how do I confirm my understanding of the concrete links between the topics and the documents I use as evidence when I write an argument?

Can I use this Google Co-Lab Notebook on my own corpus?

Why, yes! Yes, you can. If you recall the Google Co-Lab notebook I used to split one file into many files in Week 3, I had executable section that connect to, and then load text files from, Google Drive.

We can use the basic principles there to Frankenstein together our own corpus with the “Attempt Two” section of the topic-modeling tutorial notebook.

REDACTED

This uses a folder of .txt files on Google Drive to run the topic models, but it’s missing some variables that make the change-over-time stream graph in the original work properly.

Further resources on clustering algorithms that aren’t topic modeling

Word embedding offers a different view of statistical distribution. It uses something like network analysis (basically matrix algebra, which I can explain briefly in class if someone asks) to put words into 3D space.

Ben Schmidt’s canonical tutorial on word embedding: https://bookworm.benschmidt.org/posts/2015-10-25-Word-Embeddings.html
https://www.lauraknelson.com/project/word-embeddings/, with a Read Me file at https://github.com/ubcecon/ai-workshop/blob/main/README.md

Hands on topic modeling

If the mathy distribution description doesn’t make sense, how do we teach people how topic modeling works with sticky notes?

Class-bio corpus: REDACTED
White-board topic model: REDACTED

Screenshot 2024-09-12 at 3.21.27 PM.png | Screenshot 2024-09-12 at
3.21.32 PM.png —|—

Table-based Project Proposal

How can you scope a project quickly without 800 pages of prose?

As with our word-barf proposal, I’m not looking for perfection at this point in the semester. Instead, we’re going to think about the second stage of project development: 1-pagers.

How do you break your long-term goal into smaller arguments?
How do you align those smaller arguments with specific actions? (e.g. archival sources, data cleaning, necessary analysis, secondary-source reading?)
How do you communicate this overview of your project to collaborators, editors, and committee members in a way that lets them have input early on rather than giving you back a prose draft covered in red that feels impossible to revise?

The answer: a table.

Format your assignment like this:

Sub-question	Primary-source research needs	Data cleaning needs	Analysis methods	Sample historiography
Here, you write a question that is a smaller, more manageable section of your big research question in 1 sentence. This is the only prose in the table. (e.g. “How do episcopal saints lives in 500-700 differ in their focus on bishops vs monks between vitae and gesta?”)	Here, you bullet-point the specific archives/primary sources that will contribute to evidence for this subargument. Try to make document-number or temporal estimates here (e.g. “estimated 150 documents from BNF” or “estimated 250 chapters of 2-3 pages of episcopal gesta from between 500 and 700 CE”)	Here, you bullet-pointthe data cleaning you’ll need to do to the sources from the archival research. Try to make time estimates here (e.g. “sources need to be transcribed; 20 minutes per source IDed” or “sources are computer-readable but need to be hand-cleaned; 3 minutes per document”)	Here, you bullet-point how you will segment the data, what methods you will use to analyze it, and how each segment of data treated with that specific method will provide evidence to help you answer your question	Here, you provide the top 3-5 secondary-reading or DH-project citations that your subquestion responds to or is in conversation with. No annotations. That happens in a separate document.

Each subquestion should have its own line in the table. Some subquestions might have overlapping sources (e.g. sources that answer my question about genre differences between 500-700 might also overlap with sources that help me compare gesta b/w 500-700 and gesta b/w 700-1000).

How to submit assignments in this class

Please use OneDrive or GoogleDrive to create whatever documents you want to share with me. Make sure you’ve shared with craigkl@iu.edu, and then submit the sharing link.

How we’ll treat assignments in this class

Draft assignments will always be due before class. We’ll spend 20 minutes in class doing guided revisions, and then you’ll have another 24-36 hours to revise what’s in the Google doc you initially shared for your submission. I’ll start reading and commenting on Wednesday evening or Thursday morning.

Due at: Sep 23, 2024 at 12am

Grading Type: Points

Points: 0.0

Submitting: Online URL

Collaborative Data Week 5

NB: This was claimed by a student and we worked on file renaming

Fall 2024 H699 Week 5