Tag Archives: text mining

Text-Mining Praxis: Poetry Portfolios Over Time

For this praxis assignment I assembled a corpus of three documents, each produced over a comparable three-year period:

  1. The poetry I wrote before my first poetry workshop (2004-07);
  2. The final portfolios for each of my undergraduate poetry workshops (2007-10); and
  3. My MFA thesis (2010-13).

A few prelimary takeaways:

I used to be more prolific, though much less discriminate. Before I took my first college poetry workshop, I had already written over 20,500 words, equivalent to a 180-page PDF. During undergrad, that number halved, dropping to about 10,300 words, or an 80-page PDF. My MFA thesis topped out at 6,700 words in a 68-page PDF. I have no way of quantifying “hours spent writing” during these three intervals, but anecdotally that time at least doubled at each new stage. This double movement toward more writing (time) and away from more writing (stuff) suggests a growing commitment to revision as well as a more discriminate eye for what “makes it into” the final manuscript in the end.

Undergrad taught me to compress; grad school to expand. In terms of words-per-sentence (wps), my pre-workshop poetry was coming in at about 26wps. My poetry instructor in college herself wrote densely-packed lyric verse, so it’s not surprising to see my own undergraduate poems tightening up to 20wps as images came to the forefront and exposition fell to the wayside. We were also writing in and out of a number of poetic forms–sonnet, villanelle, pantoum, terza rima–which likely further compresses the sentences making up these poems. When I brought to my first graduate workshop one these sonnet-ish things that went halfway down the page and halfway across it, I was immediately told the next poem needed to fill the page, with lines twice as long and twice as many of them. In my second year, I took a semester-long hybrid seminar/workshop on the long poem, which positioned poetry as a time art and held up more poetic modes of thinking such as digression, association, and meandering as models for reading and producing this kind of poem. I obviously internalized this advice, as, by the time I submitted my MFA thesis, my sentences were nearly twice as as long as they’d ever been before, sprawling out to a feverish and ecstatic 47wps.

Things suddenly stopped “being like” other things. Across the full corpus, “like” turns out to be my most commonly-used word, appearing 223 different times. Curiously, only 13 of these are in my MFA thesis, 4 of which appear together in a single stanza of one poem. Which isn’t to say the figurative language stopped, but that it became more coded: things just started “being” (rather than “being like”) other things. For example:

Tiny errors in the Latin Vulgate
have grown horns from the head of Moses.

It is radiant. The deer has seen the face of God

spent a summer living in his house sleeping on his floor.

This one I like. But earlier figurative language was, at best, the worst, always either heavy-handed or confused–and often both. In my pre-MFA days, these were things that were allowed to be “like” other things:

  • loose leaves sprinkled like finely chopped snow” (chopped snow?)
  • “lips that pull back like wrapping paper around her teeth” (what? no.)
  • lights of a distant airplane flickering like fireflies on a heavy playhouse curtain” (ugh.)
  • tossing my wrapper along the road like fast silver ash out a casual window” (double ugh.)

Other stray observations. I was still writing love poems in college, but individual names no longer appeared (Voyant shows that most of the “distinctive words” in the pre-workshop documents were names or initials of ex-girlfriends). “Love” appears only twice in the later poems.

Black, white, and red are among the top-15 terms used across the corpus, and their usage was remarkably similar from document to document (black is omenous; white is ecstatic or otherworldly; red is to call attention to something out of place). The “Left-Term-Right” feature in Voyant is really tremendous in this regard.

And night-time conjures different figures over time: in the pre-workshop poems, people walk around alone at night (“I stand exposednaked as my handbeneath the night’s skylight moon”); in the college workshop poems, people come together at night for a party or rendezvous (“laughs around each bend bouncing like vectors across the night”); and, in the MFA thesis, night is the time for prophetic animals to arrive (“That night a deer chirped not itself by the thing so small I could not see it that was on top of it near it or inside of it & and how long had it been there?”).

Kavanaugh and Ford: A Mindful Exercise in Text-Mining

I’ve been fairly excited to utilize Voyant to do some textual analysis. I wanted to choose text to analyze that would engage with structural political issues to draw attention to inequalities within our societal structures. Thus, I’m particularly interested in engaging in discussions surrounding systems of power and privilege in modern America. This is why I’ve chosen to do a text analysis comparing and contrasting Brett Kavanaugh’s opening statement for the Senate Judiciary Committee with  Dr. Christine Blasey Ford’s opening statement.

Before going any further, I would like to issue a trigger warning for topics of sexual violence and assault. I recognize that these past few weeks have been difficult and overwhelming for some (myself included), and I would like to be transparent as I move forward with my analysis on topics that may come up.

My previous research has centered around topics of feminist theory, and rape culture, so the recent events regarding Supreme Court Justice Nominee, Judge Brett Kavanaugh, have been particularly significant to my past work and personal interests.

To provide some very brief context on current events, Kavanaugh was recently nominated by President Donald Trump on July 9, 2018 to replace retiring Associate Supreme Court Justice Anthony Kennedy. When it became known that Kavanaugh would most likely become the nominee, Dr. Christine Blasey Ford came forward with allegations that Kavanaugh had sexually assaulted her in the 1980’s while she was in high school. In addition to Dr. Ford’s allegations, two other women came forward with allegations of sexual assault against Kavanaugh as well. To read a full timeline of the events that occurred (and continue to occur) surrounding Kavanaugh’s appointment, I suggest checking out the current New York Times politics section or Buzzfeed’s news tab regarding the Brett Kavanaugh vote.  

The Senate Judiciary Committee surrounding Kavanaugh’s potential appointment invited both Kavanaugh and Ford to provide testimony about the allegation on September 24, 2018. Both Ford and Kavanaugh agreed to testify. Ford and Kavanaugh both gave a prepared speech ((initial testimony) on September 24, 2018 and then were asked questions from the committee. For this project, I am only comparing each opening statement–not the questions asked and answers given after the statements were provided. In the future, I believe much could be learned from a full and more thorough analysis including both the statements and the questions/responses given, however for the breadth of this current research and assignment I am only very briefly looking at both individuals opening statements.

This research is primarily exploratory in that I have no concrete hypothesis on what I will find. More-so, I am interested in engaging with each text to see if there are any conclusions that can be drawn from the language. Specifically, do either of the texts have implications regarding structurally oppressive systems of patriarchy and rape culture? Can the language of each speech tell us something about the ways in which sexual assault accusations are handled in the United States by the ways an accuser and the accused present themselves via issued statements? While this is only one example, I would be curious to see what type of questions can be raised from the text.

To begin, I googled both “Kavanaugh Opening Statement” and “Ford Opening Statement” to obtain the text for the analysis.

Here is the link I utilized to access Brett Kavanaugh’s opening statement.

Here is the link I utilized to access Dr. Christine Blasey Ford’s opening statement.

Next, I utilized Voyant, the open source web-based application for performing text analysis.

Here are my findings from Voyant

Kavanaugh’s Opening Statement:


Ford’s Opening Statement:

Comparison (Instead of directly copying and pasting the text in as I had done separately above, I simply inputed both links to the text into Voyant)

There are several questions that can be raised from this data. In fact, an entire essay could be written on a variety of discussions and arguments that compare and contrast the text and further look at how they compare with other cases like this one; however for the breadth of this short post I will only pull together a few key points that I noted, centering how the text potentially relates to the structural oppression of women in the United States.

First, I thought it was interesting that Kavanaugh’s opening statement was significantly longer (5,266 total words) than Ford’s (2,510 total words). Within a patriarchal society, women are traditionally taught (both directly and indirectly) to take up less space (physically and metaphorically), so I wondered if this could be relevant when considering the fact that Ford’s opening statement was significantly shorter than Kavanaugh’s. Does the internalized oppression of sexism in female-identified individuals contribute to the length of women’s responses to sexual violence–i.e. do women who experience sexual violence take up less space (potentially without even noticing or directly trying to) in regard to their statements than the accused (in this case men)? Perhaps a larger sample of research comparing both accuser’s and the accused sexual assault statements (specifically when the accuser is female and the accused is male) could provide more insight on this. 

Additionally, another observation I had while comparing and contrasting the texts with one another was the most used words within each text. Specifically, one of the most used words in Kavanaugh’s (the accused) speech was “women” which I found to be interesting. Do other people (specifically men) who are accused of sexual violence often use the word “women” in statements regarding sexual violence? Is this repetitive use of the word used to somehow prove that an individual would not harm women (even when they are being accused of just that)? It makes me consider an aspect of rape culture that is often seen when dealing with sexual violence–the justification that one could ultimately not commit crimes of sexual violence because they are a “good man” who has many healthy relationships (friendships or romantic) with women. There is no evidence that just because a man has some positive relationships with women that he is less likely to commit sexual assault; however there is data that states that people are more likely to be sexually assaulted by someone they know (RAINN). I would be curious to look into this further by utilizing tools like Voyant to consider the most used words in other statements from accused people of sexual violence.

Ultimately, this was a brief and interesting exercise in investigation and exploration. I think that there could be many different interesting and important research opportunities utilizing tools like Voyant that look at statements provided by sexual violence survivors and those who are accused of sexual violence. This was just a starting point and by no means is the necessary and extensive research that most done on this topic, rather it remains the beginning for further questions to be asked and analyzed. I’m eager to dive into more in-depth research on these topics in the future, possibly using Voyant or other text-mining web-based applications.

Text Mining – The Rap Songs of the Syrian Revolution/War

The purpose of this text mining assignment is to understand the main recurrent themes, phrases and terms in the rap songs of the Syrian revolution/war (originally in Arabic) and their relation (if any) to the overall unfolding of the Syrian war events, battles and displacement. In what follows, I will highlight the main findings and limitations of the tool for this case study.

The rap songs can be found The Creative Memory of the Syrian Revolution, that is an online platform aiming to archive Syrian artistic expression (plastic art, poetry, songs, calligraphy, etc) in the age of the revolution and war. Interestingly, the website also incorporates digital tools (mapping) to map the location of demonstrations, battles, and the cities in which or for which songs were composed. It’s useful to mention that I’ve worked for the website/songs & music section since March 2016, and thus translated most of these songs lyrics. Overall, the songs cover variety of themes elucidating the horror of war, evoking the angry echo of death, and expressing aspirations for freedom and peace.

To begin with, I went over the 390 songs archived to pick the translated lyrics of the 32 rap songs stretching from 2011 until this day (take for example, Tetlayt). 


I then entered the lyrics, from the most recent to the oldest, into Voyant. And here:

fig. 1

fig. 2


Unsurprisingly, the top 4 trends are: people, country, want, revolution (fig. 1 & 2).






The analysis shows that the word “like” comes fourth, when the word mostly appears in a song where the rapper repeats “like [something/someone] for amplification (fig. 2 & 3).



fig. 3

Next, I looked into when or at what phase of the revolution/war some terms were most used. It was revealing to see the terms “want” and “leave” (fig. 4 & 5) were popular at the beginning of the revolution in 2011, the time when the leading slogan was “Leave, Leave, oh Bashar” and “the people want to bring down the regime“.

fig. 4

fig. 5

fig. 6

On another note, it doesn’t seem that Voyant can group the singulars and plurals of the same word (child/children in fig. 6). Or is there a way we can group several words together?






So although the analysis gives a good insight into general trends, I would argue that song texts require a tool that is adaptable to the special characteristics of the genre. After all, music reformulates language in performance, and what may be revealed as a trend in text may very well not be the case through the experience of singing and listening. Beyond text, rap songs (any song really) are a play on  paralinguistic features such as tones, rhythms, intonations, pauses; and musical ones, such as scales, tone systems, rhythmic temporal structures, and musical techniques–all of which of course, a tool like voyant cannot capture. I know there are speech recognition software that are widely used for transcription, but that’s not what I’m interested in. I’m thinking of tool that do analysis of speech as speech/sound. I’m curious to know what my colleagues who did speech analysis thought of this.

From allegation to cloture: text mining US Senators’ formal statements on Kavanaugh

# overview

For this project I examined Senators’ formal public statements on the Kavanaugh nomination in the wake of Dr. Christine Blasey Ford’s allegation that he attempted to rape her as a teenager. I edited this out initially, but including now that this is an attempt to do something productive with how sick I feel at how hostile American culture remains toward women, our sexuality, and our safety.


## process

I built my corpus for analysis by visiting every single one of the 99* official (and incredibly banal) US Senator websites and searching the term “Kavanaugh” using the search function on each site. I reviewed the first 20 search results** on each website and harvested the first result(s) (up to three) which met my criteria. My criteria were that they be direct, formal press released statements about Kavanaugh issued on or after September 15, 2018 up until the time of my data collection, which took from 5pm-10pm EST on October 5th, 2018. Some Senators had few or no formal statements in that period. I did not include in my results any speeches, video, news articles or shows, or op-eds. I only included formal statements, including officially-issued press released comments. For instances in which statements included quoted text and text outside of the quote area, I included only the quote area.

I have publicly posted all of my data and results.

My working list of Senators and their official websites is from an XML file I downloaded from the United States Senate website.

I opened the XML file in Excel and removed information not relevant to my text mining project, such as each Senate member’s office address. I kept each member’s last name, first name, state represented, party affiliation, and official webpage URL. This is my master list, posted to Google Sheets here.

I created a second sheet for the statements. It contains the Senators’ last name along with date, title and content of the statement. I did a search for quote marks and effectively removed most or all of them. This statement content data is available in a Google Sheet here.

I joined the two sheets in Tableau (outer join to accomodate future work I may do with this), and used Tableau’s filtering capabilities to get plain text files separating out the Democrat statements, Republican statements, and Independent statements, along with a fourth file which is a consolidation of all statements. The plan was to perform topic modeling on each and compare.


### in the mangle

Mallet wasn’t too hard to install following these instructions. I input (inputted?) my consolidated Democrat, Republican, and Independant statements and had it output a joined mallet file with stopwords removed. Then I ran the train-topics command, and here I really don’t know what I was doing other than closely following the instructions. It worked? It made the 3 files it was supposed to make – two text files and a compresed .gz file. I have no idea what to do with any of them. Honestly, this is over my head and the explanations on the Mallet site presuppose more familiarity than I have with topic modeling. Here is a link to the inputs I fed Mallet and what it gave back to me.


#### discussion

At this point I’m frustrated with Mallet and my ignorance thereof (and, in the spirit of showing obstacles along the way, I’m cranky from operating without full use of my right arm which was injured a few days ago). I’d like to know more about topic modeling, but I’d like the learning process to be at least somewhat guided by an actual in-person person who knows what they’re doing. The readings this week are not adequate as sole preparation or context for trying to execute topic modeling or text mining, and my supplemental research didn’t make a significant difference.

I like my topic and corpus. Something I found interesting when I was collecting my data is that not all Senators issued formal press release statements on Kavenaugh during the period I examined. I was suprised by some who didn’t. Kamala Harris, Elizabeth Warren and Kirsten Gillibrand issued no formal statements referencing Kavanaugh between September 15th and the date of writing (October 5th), whereas Lindsay Graham issued four. This is not to say the former Senators were silent on the topic. Just that they did not choose to issue formal statements. Somewhat alarmingly, searching for “Kavanaugh” on Chuck Schumer’s site returned no results at all. Thinking this was in error, I manually reviewed his press release section going back to September 15th. Indeed, though Schumer issued very many press releases during that period, Kavanaugh was not mentioned a single time in the title of any.

And here’s where I need collaborators, perhaps a political scientist and/or public relations expert who could contextualize the role that formal statements play in politics and why different Senators make different choices about issuing them.

There were other interesting findings as well. The search functions on the websites visited were all over the yard. Many had terrible indexing, returning the same result over and over in the list. Cory Booker’s website returned 2,080 results for “Kavanaugh”. Dianne Feinstein’s site returned 6. The majority of Senators did engage with the Kavanaugh nomination through the vehicle of formal statements. Only ten Senators’ websites either lacked a search function entirely or the search returned zero results for Kavanaugh.

I will likely run the data I gathered through Voyant or perform a different analysis tomorrow. If so, I will update this post accordingly.


##### update 10/7

I wonder if I should be feeding Mallet the statements individually, rather than in consolidated text files grouped by party affiliation. I also realized I wanted to have these individually, rather than as cells in a CSV, so that I can feed into Voyant and see the comparisons between statements. I don’t know how to write macros in Excel, but this seemed like a great application for a Python script. I’ve been trying to learn Python so decided to write a script that would import a CSV and export parts of the individual records as individual text files.

I wrote some Python code and got it working (with an assist from Reddit when an extraneous variable was tripping me up, and suggestions on how I could improve a future iteration from Patrick Smyth). I’ve posted the individual statements in a shared folder here. The filenaming convention is as follows. Filenames start with “D”, “R”, or “I” to indicate which party the senator belongs to (Democrat/Republican/Independent), followed by the Senator’s surname, and a number that kept multiple statementss from the same senator from overwriting each other.

I plan to try analyzing these individual statements tomorrow.


###### update 10/8

I took the statements I broke out in Python and ran them through Voyant. I ran the 56 statements from Democrats separately from the 42 statements from Republicans. I did not analyze the 4 statements from Independents, 3 of which were from Bernie Sanders.

Voyant seems to be a bit buggy. I added “Kavanaugh,” and “judge” to Voyant’s default stopword list, as “Judge Kavanaugh” appeared in every single result, but it took a couple of tries and ultimately only worked on the Cirrus tool. Voyant refused to acknowledge my stopword list on the other tools. I’d also attempted to supress “Kavanaugh’s”, but Voyant kept showing it, including on the Cirrus tool, despite my adding it to the stopwords list. “Fire” is on the default stoplist, and I think it shouldn’t be. Voyant also would not honor font changes, though there was a dropdown menu to do so.

Both groups showed great variability in length. Democrats’ statements ranged from 24 to 612 words. Republicans’ statements ranged from 48 to 887 words.

The Collocates tool was interesting but mysterious. There was a little slidey bar at the bottom that changed the results, but there were no labels or other support to interpret why that was happening or what was being measured. I made sure to keep both my Democrat and Republican analyses at “5” so at least I had consistency. I searched for more information on the tool in the documentation, but the Collocates tool isn’t even listed.

Republicans often linked Dr. Ford’s name with verbs such as heard, said, appear, provide, and named. Democrats used more descriptors, such as credible, courage, and bravely.

Collocator graphs from Voyant Tools

It was fun watching the Mandalas tool build, showing relationships between the documents in the corpus and the top 10 terms used. The Democrat mandala (shown first) built off the words “court”, “ford”, “dr”, “senate”, “investigation”, “supreme”, “fbi”, “allegations”, “sexual”, and “assault”. The Republican mandala (shown second) built of their top 10 words which were “dr”, “committee”, “senate”, “process”, “court”, “ford”, “fbi”, “supreme”, “evidence”, and “judiciary”. The Democrats’ statements called attention to the specific nature of the allegations, while the Republicans’ statements focused on the legal process.

Voyant Tools text analysis mandala visualization

Voyant tools text analysis mandala visualization

Another fun but under-documented tool is called the StreamGraph. This seems more about visual interest than effectively communicating information, as the areas of the different segments are quite hard to compare. Again, the Democrats’ statements visualization is shown first, followed by the Republican. The Democrats highlight “investigation”, whereas the Republicans highlight “process.”

Voyant text mining Stream graph

Voyant text analysis Stream graph

####### text mining tools review

In closing, here are my reviews of the text mining tools I used.

Voyant: buggy, unreliable, good fun but about as rigorous as a party game
Mallet: a machine may have learned something, but I didn’t



*Jon Kyl, John McCain’s replacement, does not yet have an official Senate website of his own. A quick Google search revealed no official press release statements in the first 20 results.

**Bob Corker, Cindy Hyde-Smith, and John Kennedy did not have a search function on their sites. The search function on Rand Paul’s site was not functioning. Each has a news or media section of their site, and that is where I looked for press releases. Chuck Schumer and Tina Smith’s sites’ search functions returned zero results for “Kavanaugh”. I reviewed titles of all press releases on their sites since September 15th and found no reference to Kavanaugh.


My process with the Praxis 1 Text Mining Assignment began with a seed that was planted during the self-Googling audits we did in the first weeks of class, because I found an obituary for a woman of my same name (sans middle name of initial).

From this, my thoughts went to the exquisite obituaries that were written by The New York Times after 9-11 which were published as a beautiful book titled Portraits. One of my dearest friends has a wonderful father who was engaged to a woman who perished that most fateful of New York Tuesdays. My first Voyant text mining text, therefore, was of his fiancee’s NYT obituary. And the last text I mined for this project was the obituary for the great soprano Monserrat Caballe, when I heard the news of her passing as I was drafting this post.

The word REVEAL that appears above the Voyant text box is an understatement. When the words appeared as visuals, I felt like I was learning something about her and them as a couple that I would never have been able to grasp by just reading her obituary. Indeed, I had read it many times prior. Was it the revelation of some extraordinary kind of subtext? Is this what “close reading” is or should be? The experience hit me in an unexpected way between the eyes as I looked at the screen and in the gut.

My process then shifted immediately to song lyrics because, as a singer myself who moonlights as a voice teacher and vocal coach, I’m always reviewing, teaching and learning lyrics. I saw the potential value of using Voyant in this way in high relief. I got really juiced by the prospect of all the subtexts and feeling tones that would be revealed to actors/singers via Voyant. When I started entering lyrics, this was confirmed a thousand fold on the screen. So, completely unexpectedly, I now have an awesome new tool in my music skill set. The most amazing thing about this is that I will be participating in “Performing Knowledge” an all-day theatrical offering at The Segal Center on Dec. 10 for which I submitted the following proposal that was accepted by the Theater Dept.:

“Muscle Memory: How the Body +  Voice Em”body” Songs, Poems, Arias, Odes, Monologues & Chants — Learning vocal/spoken word content, performing it, and recording it with audio technology is an intensely physical/psychological/organic process that taps into and connects with a performer’s individually unique “muscle memory”, leading to the creation of vocal/sound art with the body + voice as the vehicle of such audio content. This proposed idea seeks to analyze “songs” as “maps” in the Digital Humanities context. Participants are highly encouraged to bring a song, poem, monologue, etc. with lyric/text sheet to “map out”. The take-away will be a “working map” that employs muscle memory toward learning, memorizing, auditioning, recording and performing any  vocal/spoken word content. –Conceived, written and submitted by Carolyn A. McDonough, Monday, Sept. 17, 2018.” [I’m excited to add that during the first creative meeting toward this all-day production, I connected my proposed idea to readings of Donna Haraway and Kathering Hayles from ITP Core 1]

What better way to celebrate this, than to “voyant” song/lyric content and today’s “sad news day” obituary of a great operatic soprano. Rather than describe these Voyant Reveals through writing further, I was SO struck by the visuals generated on my screen that I wanted to show and share these as the findings of my research.

My first choice was “What I Did For Love” from A Chorus Line (on a sidenote, I’ve seen the actual legal pad that lyricist Edward Kleban wrote the score on at the NYPL Lincoln Center performing arts branch, and I thought I had a photo, but alas I do not as I really wanted to include it to show the evolution from handwritten word/text to Voyant text analysis.)

I was screaming as the results JUMPED out of the screen at me of the keyword “GONE” that is indeed the KEY to the emotional subtext an actor/singer needs to convey within this song in an audition or performance which I KNOW from having heard, studied, taught, and seen this song performed MANY times. And it’s only sung ONCE! How does Voyant achieve this super-wordle superpower?

I then chose “Nothing” also from A Chorus Line as both of these songs are sung by my favorite character, Diana Morales, aka Morales.

Can you hear the screams of discovery?!

Next was today’s obit for a great soprano which made me sad to hear on WQXR this morning because I once attended one of her rehearsals at Lincoln Center:

A complex REVEAL of a complex human being and vocal artist by profession.

AMAZING. Such visuals of texts, especially texts I know “by heart” are extremely powerful.

Lastly, over the long weekend, I’m going to “Voyant” this blog post itself, so that its layers of meaning can be revealed to me even further. –CAM

Text Mining Game Comments (Probably Too Many at Once!)

To tell the truth, I’ve been playing with Voyant a lot, trying to figure out what the most interesting thing is that I could do with it! Tenen could critique my analysis on the grounds that it’s definitely doing some things I don’t fully understand; Underwood would probably quibble with my construction of a corpus and my method of selecting words to consider.  Multiple authors could very reasonably take issue with the lack of political engagement in my choice. However, if the purpose here is to get my feet wet, I think it’s a good idea to start with a very familiar subject matter, and in my case, that means board games.

Risk Legacy was published in 2011. This game reimagined the classic Risk as a series of scenarios, played by the same group, in which players would make changes to the board between (or during!) scenarios. Several years later,* the popularity and prevalence of legacy-style, campaign-style, and scenario-based board games has skyrocketed.  Two such games, Gloomhaven and Pandemic Legacy, are the top two games on BoardGameGeek as of this writing.

I was interested in learning more about the reception of this type of game in the board gaming community. The most obvious source for such information is BoardGameGeek (BGG).  I could have looked at detailed reviews, but since I preferred to look at reactions from a broader section of the community, I chose to look at the comments for each game.  BGG allows users to rate games and comment on them, and since all the games I had in mind were quite popular, there was ample data for each.  Additionally, BGG has an API that made extracting this data relatively easy.**

As I was only able to download the most recent 100 comments for each game, this is where I started.  I listed all the games of this style that I could think of, created a file for each set of comments, and loaded them into Voyant. Note that I personally have only played five of these nine games. The games in question are:

  • The 7th Continent, a cooperative exploration game
  • Charterstone, a worker-placement strategy game
  • Gloomhaven, a cooperative dungeon crawl
  • Star Wars: Imperial Assault, a game based on the second edition of the older dungeon crawl, Descent, but with a Star Wars theme. It’s cooperative, but with the equivalent of a dungeon master.
  • Near and Far, a strategy game with “adventures” which involve reading paragraphs from a book. This is a sequel to Above and Below, an earlier, simpler game by the same designer
  • Pandemic Legacy Season One, a legacy-style adaptation of the popular cooperative game, Pandemic
  • Pandemic Legacy Season Two, a sequel to Pandemic Legacy Season One
  • Risk Legacy, described above
  • Seafall, a competitive nautical-themed game with an exploration element

The 7th Continent is a slightly controversial inclusion to this list; I have it here because it is often discussed with the others. I excluded Descent because it isn’t often considered as part of this genealogy (although perhaps it should be). Both these decisions felt a little arbitrary; I can certainly understand why building a corpus is such an important and difficult part of the text-mining process!

These comments included 4,535 unique word forms, with the length of each document varying from 4,059 words (Risk Legacy) to 2,615 (7th Continent).  Voyant found the most frequent words across this corpus, but also the most distinctive words for each game. The most frequent words weren’t very interesting: game, play, games, like, campaign.*** Most of these words would probably be the most frequent for any set of game comments I loaded into Voyant! However, I noticed some interesting patterns among the distinctive words. These included:

Game Jargon referring to scenarios. That includes: “curse” for The 7th Continent (7 instances), “month” for Pandemic Legacy (15 instances), and “skirmish” for Imperial Assault (15 instances). “Prologue” was mentioned 8 times for Pandemic Legacy Season 2, in reference to the practice scenario included in the game.

References to related games or other editions. “Legacy” was mentioned 15 times for Charterstone, although it is not officially a legacy game. “Descent” was mentioned 15 times for Imperial Assault, which is based on Descent. “Below” was mentioned 19 times for Near and Far, which is a sequel to the game Above and Below. “Above” was also mentioned much more often for Near and Far than for other games; I’m not sure why it didn’t show up among the distinctive words.

References to game mechanics or game genres. Charterstone, a worker placement game, had 20 mentions of “worker” and 17 of “placement.” The word “worker” was also used 9 times for Near and Far, which also has a worker placement element; “threats” (another mechanic in the game) were mentioned 8 times. For Gloomhaven, a dungeon crawl, the word “dungeon” turned up 20 times.  Risk Legacy had four mentions of “packets” in which the new materials were kept. The comments about Seafall included 6 references to “vp” (victory points).  Near and Far and Charterstone also use victory points, but for some reason they were mentioned far less often in reference to those games.

The means by which the game was published. Kickstarter, a crowdfunding website, is very frequently used to publish board games these days. In this group, The 7th Continent, Gloomhaven, and Near and Far were all published via Kickstarter. Curiously, both the name “Kickstarter” and the abbreviation “KS” appeared with much higher frequency in the comments on the 7th Continent and Near and Far than in the comments for Gloomhaven. 7th Continent players were also much more likely to use the abbreviation than to type out the full word; I have no idea why this might be.

Thus, it appears that most of the words that stand out statistically (in this automated analysis) in the comments refer to facts about the game, rather than directly expressing an opinion. The exception to this rule was Seafall, which is by far the lowest-ranked of these games and which received some strongly negative reviews when it was first published. The distinctive words for Seafall included two very ominous ones: “willing” and “faq” (each used five times).

In any case, I suspected I could find more interesting information outside the selected terms. Here, again, Underwood worries me; if I select terms out of my own head, I risk biasing my results. However, I decided to risk it, because I wanted to see what aspects of the campaign game experience commenters found important or at least noteworthy. If I had more time to work on this, it would be a good idea to read through some reviews for good words describing various aspects of this style of game, or perhaps go back to a podcast where this was discussed, and see how the terms used there were (or weren’t) reflected in the comments. Without taking this step, I’m likely to miss things; for instance, the fact that the word “runaway” (as in, runaway leader) constitutes 0.0008 of the words used to describe Seafall, and is never used in the comments of any of the other games except Charterstone, where it appears at a much lower rate.**** As it is, however, I took the unscientific step of searching for the words that I thought seemed likely to matter. My results were interesting:

(Please note that, because of how I named the files, Pandemic Legacy Season Two is the first of the two Pandemics listed!)

It’s very striking to me how different each of these bars looks. Some characteristics are hugely important to some of the games but not at all mentioned in the others! “Story*” (including both story and storytelling) is mentioned unsurprisingly often when discussing Near and Far; one important part of that game involves reading story paragraphs from a book. It’s interesting, though, that story features so much more heavily in the first season of Pandemic Legacy than the second. Of course, the mere mention of a story doesn’t mean that the story of a game met with approval; most of the comments on Pandemic Legacy’s story are positive, while the comments on Charterstone’s are a bit more mixed.

Gloomhaven comments are much more about characters than any of the other terms I used; one of the distinguishing characteristics of this game is the way that characters change over time. Many of the comments also mentioned that the characters do not conform to common dungeon crawl tropes. However, the fact that characters are mentioned in every game except for two suggests that characters are important to players of campaign-style games.

I also experimented with some of the words that appeared in the word cloud, but since this post is already quite long, I won’t detail everything I noticed! It was interesting, for instance, to note how the use of words like “experience” and “campaign” varied strongly among these games.  (For instance: “experience” turned out to be a strongly positive word in this corpus, and applied mainly to Pandemic Legacy.)

In any case, I had several takeaways from this experience:

  • Selecting an appropriate corpus is difficult. Familiarity with the subject matter was helpful, but someone less familiar may have selected a less biased corpus.
  • The more games I included, the more difficult this analysis became!
  • My knowledge of the subject area allowed me to more easily interpret the prevalence of certain words, particularly those that constituted some kind of game jargon.
  • Words often have a particularly positive or negative connotation throughout a corpus, though they may not have that connotation outside that corpus. (For instance: rulebook. If a comment brings up the rulebook of a game, it is never to compliment it.)
  • Even a simple tool like this includes some math that isn’t totally transparent to me. I can appreciate the general concept of “distinctive words,” but I don’t know exactly how they are calculated. (I’m reading through the help files now to figure it out!)

I also consolidated all the comments on each game into a single file, which was very convenient for this analysis, but prevented me from distinguishing among the commenters.  This could be important if, for example, all five instances of a word were by the same author.

*Note that there was a lag of several years due to the immense amount of playtesting and design work required for this type of game.

**Thanks to Olivia Ildefonso who helped me with this during Digital Fellows’ office hours!

***Note that “like” and “game” are both ambiguous terms. “Like” is used both to express approval and to compare one game to another. “Game” could refer to the overall game or to one session of it (e.g. “I didn’t enjoy my first game of this, but later I came to like it.”).

****To be fair, it is unlikely anyone would complain of a runaway leader in 7th Continent, Gloomhaven, Imperial Assault, or either of the Pandemics, as they are all cooperative games.

Text mining the Billboard Country Top 10

My apologies to anyone who read this before the evening of October 8. I set this to post automatically, but for the wrong date and without all that I wanted to include.

I’m a big fan of music but as I’ve gotten further away from my undergrad years, I’ve become less familiar with what is currently playing on the radio. Thanks to my brother’s children, I have some semblance of a grasp on certain musical genres, but I have absolutely no idea what’s happening in the world of country music (I did at one point, as I went to undergrad in Virginia).

I decided to use Voyant Tools to do a text analysis of the first 10 songs on the Billboard Country chart from the week of September 8, 2018. The joke about country music is that it’s about dogs, trucks, and your wife leaving you. When I was more familiar with country music, I found it to be more complex than this, but a lot could have changed since I last paid attention. Will a look at the country songs with the most sales/airplay during this week support these assumptions? For the sake of uniformity, I accepted the lyrics on Genius.com as being correct and removed all extraneous words from the lyrics (chorus, bridge, etc.).

The songs in the top 10 are as follows:

  1. Meant to Be – Bebe Rexha & Florida Georgia Line
  2. Tequila – Dan + Shay
  3. Simple – Florida Georgia Line
  4. Drowns the Whiskey – Jason Aldean featuring Miranda Lambert
  5. Sunrise, Sunburn, Sunset – Luke Bryan
  6. Life Changes – Thomas Rhett
  7. Heaven – Kane Brown
  8. Mercy – Brett Young
  9. Get Along – Kenny Chesney
  10. Hotel Key – Old Dominion

If you would like to view these lyrics for yourself, I’ve left the files in a google folder.

As we can see, the words “truck,” “dog,” “wife,” and “left” were not among the most frequently used, although it may not be entirely surprising that “ain’t” was.

The most frequently used word in the corpus, “it’s” appeared only 19 times, showing that there is a quite a bit of diversity in these lyrics. I looked for other patterns, such as whether vocabulary density or average words per sentence had an effect on the song’s position on the chart, but there was no correlation.

Text-Mining the MTA Annual Report

After some failed attempts at text-mining other sources [1], I settled on examining the New York Metropolitan Transportation Authority’s annual reports. The MTA offers online access to its annual reports going back to the year 2000 [2]. As a daily rider and occasional critic of the MTA, I thought this might provide insight to its sometimes murky motivations.

I decided to compare the 2017, 2009, and 2001 annual reports. I chose these because 2017 was the most current, 2009 was the first annual report after the Great Recession became a steady factor in New York life, and 2001 was the annual report after the 9/11 attacks on the World Trade Center. I thought there might be interesting differences between the most recent annual report and the annual reports written during periods of intense social and financial stress.

Because the formats of the annual reports vary from year to year, I was worried that some differences emerging from text-mining might be due to those formatting changes rather than operational changes. So at first I tried to minimize this by finding sections of the annual reports that seemed analogous in all three years. After a few tries, though, I finally realized that dissecting the annual reports in this manner had too much risk of leaving out important information. It would therefore be better to simply use the entirety of the text in each annual report for comparison, since any formatting changes to particular sections would probably not change the overall tone of the annual report (and the MTA in general).

I downloaded the PDFs of the annual reports [3], copied the full text within, and ran that text through Voyant’s online text-mining tool (https://voyant-tools.org/).

The 20 most frequent words for each annual report are listed below. It is important to note that these lists track specific spellings of words, but it is sometimes more important to track all related words (words with the same root, like “complete” and “completion”). Voyant allows users to search for roots instead of specific spellings, but the user needs to already know which root to search for.

2001 Top 20:
mta (313); new (216); capital (176); service (154); financial (146); transit (144); year (138); operating (135); december (127); tbta (125); percent (121); authority (120); york (120); bonds (112); statements (110); total (105); million (104); long (103); nycta (93); revenue (93)

2009 Top 20:
new (73); bus (61); station (50); mta (49); island (42); street (41); service (39); transit (35); annual (31); long (31); report (31); completed (30); target (30); page (29); avenue (27); york (24); line (23); performance (23); bridge (22); city (22)

2017 Top 20:
mta (421); new (277); million (198); project (147); bus (146); program (140); report (136); station (125); annual (121); service (110); total (109); safety (105); pal (100); 2800 (98); page (97); capital (94); completed (89); metro (85); north (82); work (80)

One of the most striking differences to me was the use of the word “safety” and other words sharing the root “safe.” Before text-mining, I would have thought that “safe” words would be most common in the 2001 annual report, reflecting a desire to soothe public fears of terrorist attacks after 9/11. Yet the most frequent use by far of “safe” words was in 2017. This was not simply a matter of raw volume, but also the frequency rate. “Safe” words were mentioned almost four times as often in 2017 (frequency rate: 0.0038) than in 2001 (0.001). “Secure” words might at first seem more equitable in 2001 (0.0017) and 2017 (0.0022). However, these results are skewed, because in 2001, many of the references to “secure” words were in their financial meaning, not their public-safety meaning. (e.g. “Authority’s investment policy states that securities underlying repurchase agreements must have a market value…”)

This much higher recent focus on safety might be due to the 9/11 attacks not being the fault of the MTA, so any disruptions in safety could have been generally seen as understandable. The 2001 annual report mentioned that the agency was mostly continuing to follow the “MTA all-agency safety initiative, launched in 1996.” However, by 2017, a series of train and bus crashes (one of which happened just one day ago), and heavy media coverage of the MTA’s financial corruption and faulty equipment, were possibly shifting blame for safety issues to the MTA’s own internal problems. Therefore, the MTA might now be feeling a greater need to emphasize its commitment to safety, whereas it was more assumed before.

In a similar vein, “replace” words were five times more frequent in 2017 (0.0022) than in 2001 (0.0004). “Repair” words were also much more frequent in 2017 (0.0014) than 2001 (0.00033). In 2001, the few mentions of “repair” were often in terms of maintaining “a state of good repair,” which might indicate that the MTA thought the system was already working pretty well. By 2017, public awareness of the system’s dilapidation might have changed that. Many mentions of repair and replacement in the 2017 annual report are also in reference to damage done by Hurricane Sandy (which happened in 2012).

In contrast to 2017’s focus on safety and repair, the 2001 annual report is more concerned with financial information than later years. Many of the top twenty words are related to economics, such as “capital,” “revenue,” and “bonds.” In fact, as mentioned above, the 2001 annual report often uses the word “security” with its financial meaning.

The 2009 annual report was extremely shorter (6,272 words) than in 2001 (36,126 words) and 2017 (29,706 words). Perhaps the Great Recession put such a freeze on projects that there simply wasn’t as much to discuss. However, even after considering the prevalence of “New York,” 2009 still had a much higher frequency rate of the word “new.” (The prevalence of “new” every year at first made me think that the MTA was obsessed with promoting new projects, but the Links tool in Voyant reminded me that this was largely because of “New York.”) Maybe even though there weren’t many new projects to trumpet, the report tried particularly hard to highlight what there was.

The recession might also be why “rehabilitate” and its relative words were used almost zero times in 2001 and 2017, but were used heavily in 2009 (0.0043). Rehabilitating current infrastructure might be less costly than completely new projects, yet still allow for the word “new” to be used. “Rehabilitate” words were used even more frequently in 2009 than the word “York.”

One significant flaw in Voyant is that it doesn’t seem to provide the frequency rate of a word for the entire document. Instead, it only provides the frequency rate for each segment of the document. The lowest possible number of segments that a user can search is two. This means that users have to calculate the document-length frequency rate themselves by dividing the number of instances by the number of words in the document. If the document-length frequency rate is available somewhere in the Voyant results, it doesn’t seem intuitive and it isn’t explained in the Voyant instructions.

Although I generally found Voyant to be an interesting and useful tool, it always needs to be combined with traditional analysis of the text. Without keeping an eye on the context of the results, it would be easy to make false assumptions about why particular words are being used. Helpfully, Voyant has “Contexts” and “Reader” windows that allow for users to quickly personally analyze how a word is being used in the text.

[1] I first ran Charles Darwin’s “Origin of Species” and “Descent of Man” through Voyant, but the results were not particularly surprising. The most common words were ones like “male,” “female,” “species,” “bird,” etc.

In a crassly narcissistic decision, I then pasted one of my own unpublished novels into Voyant. This revealed a few surprises about my writing style (the fifth most common word was “like,” which either means I love similes or being raised in Southern California during the 1980s left a stronger mark than I thought). I also apparently swear a lot. However, this didn’t seem socially relevant enough to center an entire report around.

Then I thought it might be very relevant to text-mine the recent Supreme Court confirmation hearings of Brett Kavanaugh and compare them to his confirmation hearings when he was nominated to the D.C. Circuit Court of Appeals. Unfortunately, there are no full transcripts available yet of the Supreme Court hearings. The closest approximation that I found was the C-Span website, which has limited closed-caption transcripts, but their user interface doesn’t allow for copying the full text of the hearing. The transcripts for Kavanaugh’s 2003 and 2006 Circuit Court hearings were available from the U.S. Congress’s website, but the website warned that transcripts of hearings can take years to be made available. Since the deadline for this assignment is October 9, I decided that was too much of a gamble. I then tried running Kavanaugh’s opening statements through Voyant, but that seemed like too small of a sample to draw any significant conclusions. (Although it’s interesting that he used the word “love” a lot more in 2018 than he did back in 2003.)

[2] 2017: http://web.mta.info/mta/compliance/pdf/2017_annual/SectionA-2017-Annual-Report.pdf
2009: http://web.mta.info/mta/compliance/pdf/2009%20Annual%20Report%20Narrative.pdf
2001: http://web.mta.info/mta/investor/pdf/annualreport2001.pdf

[3] It’s important to download the PDFs before copying text. Copying directly from websites can result in text that has a lot of formatting errors, which then requires data-cleaning and can lead to misleading results.