Tag Archives: voyant

Text-Mining Praxis: Poetry Portfolios Over Time

For this praxis assignment I assembled a corpus of three documents, each produced over a comparable three-year period:

  1. The poetry I wrote before my first poetry workshop (2004-07);
  2. The final portfolios for each of my undergraduate poetry workshops (2007-10); and
  3. My MFA thesis (2010-13).

A few prelimary takeaways:

I used to be more prolific, though much less discriminate. Before I took my first college poetry workshop, I had already written over 20,500 words, equivalent to a 180-page PDF. During undergrad, that number halved, dropping to about 10,300 words, or an 80-page PDF. My MFA thesis topped out at 6,700 words in a 68-page PDF. I have no way of quantifying “hours spent writing” during these three intervals, but anecdotally that time at least doubled at each new stage. This double movement toward more writing (time) and away from more writing (stuff) suggests a growing commitment to revision as well as a more discriminate eye for what “makes it into” the final manuscript in the end.

Undergrad taught me to compress; grad school to expand. In terms of words-per-sentence (wps), my pre-workshop poetry was coming in at about 26wps. My poetry instructor in college herself wrote densely-packed lyric verse, so it’s not surprising to see my own undergraduate poems tightening up to 20wps as images came to the forefront and exposition fell to the wayside. We were also writing in and out of a number of poetic forms–sonnet, villanelle, pantoum, terza rima–which likely further compresses the sentences making up these poems. When I brought to my first graduate workshop one these sonnet-ish things that went halfway down the page and halfway across it, I was immediately told the next poem needed to fill the page, with lines twice as long and twice as many of them. In my second year, I took a semester-long hybrid seminar/workshop on the long poem, which positioned poetry as a time art and held up more poetic modes of thinking such as digression, association, and meandering as models for reading and producing this kind of poem. I obviously internalized this advice, as, by the time I submitted my MFA thesis, my sentences were nearly twice as as long as they’d ever been before, sprawling out to a feverish and ecstatic 47wps.

Things suddenly stopped “being like” other things. Across the full corpus, “like” turns out to be my most commonly-used word, appearing 223 different times. Curiously, only 13 of these are in my MFA thesis, 4 of which appear together in a single stanza of one poem. Which isn’t to say the figurative language stopped, but that it became more coded: things just started “being” (rather than “being like”) other things. For example:

Tiny errors in the Latin Vulgate
have grown horns from the head of Moses.

It is radiant. The deer has seen the face of God

spent a summer living in his house sleeping on his floor.

This one I like. But earlier figurative language was, at best, the worst, always either heavy-handed or confused–and often both. In my pre-MFA days, these were things that were allowed to be “like” other things:

  • loose leaves sprinkled like finely chopped snow” (chopped snow?)
  • “lips that pull back like wrapping paper around her teeth” (what? no.)
  • lights of a distant airplane flickering like fireflies on a heavy playhouse curtain” (ugh.)
  • tossing my wrapper along the road like fast silver ash out a casual window” (double ugh.)

Other stray observations. I was still writing love poems in college, but individual names no longer appeared (Voyant shows that most of the “distinctive words” in the pre-workshop documents were names or initials of ex-girlfriends). “Love” appears only twice in the later poems.

Black, white, and red are among the top-15 terms used across the corpus, and their usage was remarkably similar from document to document (black is omenous; white is ecstatic or otherworldly; red is to call attention to something out of place). The “Left-Term-Right” feature in Voyant is really tremendous in this regard.

And night-time conjures different figures over time: in the pre-workshop poems, people walk around alone at night (“I stand exposednaked as my handbeneath the night’s skylight moon”); in the college workshop poems, people come together at night for a party or rendezvous (“laughs around each bend bouncing like vectors across the night”); and, in the MFA thesis, night is the time for prophetic animals to arrive (“That night a deer chirped not itself by the thing so small I could not see it that was on top of it near it or inside of it & and how long had it been there?”).

Text mining praxis: mining for evidence of course learning outcomes in student writing

I’ve been hearing more and more about building corpora of student writing of late, and while I haven’t actually consulted any of these, I was happy to have the opportunity to see what building a small corpus of student writing would be like in Voyant. I was particularly excited about using samples from ENGL 21007: Writing for Engineering which I taught at City College in Spring 2018, because I had a great time teaching that course and know the writing samples well.

Of the four essays written in ENGL 21007 I chose the first assignment, a memo, because it is all text (the subsequent assignments contain graphs, charts and images and I wasn’t sure how these would behave in Voyant). I downloaded the student essays from Blackboard as .docx and redacted them in Microsoft Word. This was a bad move because Microsoft Word 365 held on to the metadata, so student email accounts showed up when I uploaded my corpus to Voyant. I quickly removed my corpus from Voyant and googled how do I remove the metadata, then decided that it would be faster to convert all .docx to .pdf and redact them with Acrobat Pro (I got a one-week free trial) so I did this, zipped it up and voila.

22 Essays Written by Undergraduate Engineering Majors at City College of New York, Spring 2018

I love how Voyant automatically saves my corpus to the web. No registration, no logging in and out. There must be millions of corpora out there.

I was excited to see how the essays looked in Voyant and what I could do with them there. I decided to get the feeling of Voyant by first asking a simple question: what did students choose to write about? The assignment was to locate something on the City College campus, in one’s community or on one’s commute to college that could be improved with an engineering solution.

Cirrus view shows most frequently used words in 22 memos written by engineering majors in ENGL 21007

What strikes me as I look at the word cloud is that students’ concern with “time” (61 occurrences) was only slightly less marked than the reasonable – given that topics had to be related to the City College campus – concern with “students” (66 occurrences). I was interested to see that “escalators” (48 occurrences)  got more attention than “windows” (40 occurrences), but I think we all felt strongly about both. “Subway” (56 occurrences) and “MTA” (50 occurrences), which are the same thing, were a major concern. Uploading samples of student writing and seeing them magically visualized in a word cloud summarizes the topics we addressed in ENGL 21007 in a useful and powerful way.

Secondly and in a more pedagogical vein, I wanted to see how Voyant could be used to measure the achievement of course learning outcomes in a corpus of student writing. This turned out to be a way more difficult question than my first simple what did students write about. The challenge lies in figuring out what query will tell me whether the eight English 21007 course learning outcomes listed on the CCNY First Year Writing Program website  were achieved through the essay assignment that produced the 22 samples I put in Voyant, and whether evidence of having achieved or not achieved these outcomes can be mined from student essays with Voyant. Two of the course learning outcomes seemed more congenial to the Memo assignment than others. These are:

“Negotiate your own writing goals and audience expectations regarding conventions of genre, medium, and rhetorical situation.”

“Practice using various library resources, online databases, and the Internet to locate sources appropriate to your writing projects.”

To answer the question of whether students were successful in negotiating their writing goals would require knowing what their goals were. Not knowing this, I set this part of the question aside. Audience expectations was easier. In the assignment prompt I had told students that the memo had to be addressed to the department, office or institution that had the power to approve the implementation of proposed engineering solutions or the power to move engineering proposals on to the department, office or institution that could eventually approve these. There are, however, many differently named addressees in the student memos I put in this corpus. Furthermore, addressing the memo to an official body does not by itself achieve the course learning outcome. My question therefore becomes, what general conventions of genre, medium and rhetorical situation do all departments, offices or institutions expect to see in a memorandum, and how do I identify these in a query? What words or what combinations of words constitute memo-speak? To make things worse (or better :)!), I had told students that they could model their memos on the examples of memos I gave them or, if they preferred, they could model them differently so long as they were coherent and good. I therefore cannot rely on form to measure convention of genre. I’m sorry to say I have no answers to my questions as of yet; I’m still trying to figure out how to ask my corpus if students negotiated audience expectations regarding conventions of genre, medium and rhetorical situation (having said this, I think I can rule out medium, because I asked students to write the memo in Microsoft Word).

The second course learning outcome I selected – that students practice using library resources, online databases and the internet – strikes me as more quantifiable than the first.

Only one of 22 memos contains the words “Works Cited”

Unfortunately, I hadn’t required students do research for the memos I put in the corpus. When I looked for keywords that would indicate that students had done some research Voyant came up with one instance of “Bibliography,” one instance of “Works Cited” and no instances of “references” or “sources.” The second course learning outcome I selected is not as congenial to the memo assignment – or the memo assignment not congenial to that course learning outcome – as I first thought.

I tried asking Veliza the bot for help in figuring out whether course learning outcomes had been achieved (Veliza is the sister of Eliza, a psychotherapist, and says she isn’t too good at text analysis yet). Veliza answers questions with expressions of encouragement or more questions but she’s not much help. The “from text” button in the lower right corner of the Veliza tool is kind of fun because it fetches sentences from the text (according to what criteria I haven’t a clue) but conversation quickly gets surreal because Veliza doesn’t really engage.

In conclusion, I am unsure how to use text mining to measure course learning outcomes in student writing done in 200-level courses. I think that Voyant may work better for measuring course learning outcomes in courses with more of an emphasis on vocabulary and grammar, such as, for example, EAL. It’s a bit of a dilemma for me, because I think that the achievement of at least some course learning outcomes should be measurable in the writing students produce in a course.

Text Mining – The Rap Songs of the Syrian Revolution/War

The purpose of this text mining assignment is to understand the main recurrent themes, phrases and terms in the rap songs of the Syrian revolution/war (originally in Arabic) and their relation (if any) to the overall unfolding of the Syrian war events, battles and displacement. In what follows, I will highlight the main findings and limitations of the tool for this case study.

The rap songs can be found The Creative Memory of the Syrian Revolution, that is an online platform aiming to archive Syrian artistic expression (plastic art, poetry, songs, calligraphy, etc) in the age of the revolution and war. Interestingly, the website also incorporates digital tools (mapping) to map the location of demonstrations, battles, and the cities in which or for which songs were composed. It’s useful to mention that I’ve worked for the website/songs & music section since March 2016, and thus translated most of these songs lyrics. Overall, the songs cover variety of themes elucidating the horror of war, evoking the angry echo of death, and expressing aspirations for freedom and peace.

To begin with, I went over the 390 songs archived to pick the translated lyrics of the 32 rap songs stretching from 2011 until this day (take for example, Tetlayt). 


I then entered the lyrics, from the most recent to the oldest, into Voyant. And here:

fig. 1

fig. 2


Unsurprisingly, the top 4 trends are: people, country, want, revolution (fig. 1 & 2).






The analysis shows that the word “like” comes fourth, when the word mostly appears in a song where the rapper repeats “like [something/someone] for amplification (fig. 2 & 3).



fig. 3

Next, I looked into when or at what phase of the revolution/war some terms were most used. It was revealing to see the terms “want” and “leave” (fig. 4 & 5) were popular at the beginning of the revolution in 2011, the time when the leading slogan was “Leave, Leave, oh Bashar” and “the people want to bring down the regime“.

fig. 4

fig. 5

fig. 6

On another note, it doesn’t seem that Voyant can group the singulars and plurals of the same word (child/children in fig. 6). Or is there a way we can group several words together?






So although the analysis gives a good insight into general trends, I would argue that song texts require a tool that is adaptable to the special characteristics of the genre. After all, music reformulates language in performance, and what may be revealed as a trend in text may very well not be the case through the experience of singing and listening. Beyond text, rap songs (any song really) are a play on  paralinguistic features such as tones, rhythms, intonations, pauses; and musical ones, such as scales, tone systems, rhythmic temporal structures, and musical techniques–all of which of course, a tool like voyant cannot capture. I know there are speech recognition software that are widely used for transcription, but that’s not what I’m interested in. I’m thinking of tool that do analysis of speech as speech/sound. I’m curious to know what my colleagues who did speech analysis thought of this.


My process with the Praxis 1 Text Mining Assignment began with a seed that was planted during the self-Googling audits we did in the first weeks of class, because I found an obituary for a woman of my same name (sans middle name of initial).

From this, my thoughts went to the exquisite obituaries that were written by The New York Times after 9-11 which were published as a beautiful book titled Portraits. One of my dearest friends has a wonderful father who was engaged to a woman who perished that most fateful of New York Tuesdays. My first Voyant text mining text, therefore, was of his fiancee’s NYT obituary. And the last text I mined for this project was the obituary for the great soprano Monserrat Caballe, when I heard the news of her passing as I was drafting this post.

The word REVEAL that appears above the Voyant text box is an understatement. When the words appeared as visuals, I felt like I was learning something about her and them as a couple that I would never have been able to grasp by just reading her obituary. Indeed, I had read it many times prior. Was it the revelation of some extraordinary kind of subtext? Is this what “close reading” is or should be? The experience hit me in an unexpected way between the eyes as I looked at the screen and in the gut.

My process then shifted immediately to song lyrics because, as a singer myself who moonlights as a voice teacher and vocal coach, I’m always reviewing, teaching and learning lyrics. I saw the potential value of using Voyant in this way in high relief. I got really juiced by the prospect of all the subtexts and feeling tones that would be revealed to actors/singers via Voyant. When I started entering lyrics, this was confirmed a thousand fold on the screen. So, completely unexpectedly, I now have an awesome new tool in my music skill set. The most amazing thing about this is that I will be participating in “Performing Knowledge” an all-day theatrical offering at The Segal Center on Dec. 10 for which I submitted the following proposal that was accepted by the Theater Dept.:

“Muscle Memory: How the Body +  Voice Em”body” Songs, Poems, Arias, Odes, Monologues & Chants — Learning vocal/spoken word content, performing it, and recording it with audio technology is an intensely physical/psychological/organic process that taps into and connects with a performer’s individually unique “muscle memory”, leading to the creation of vocal/sound art with the body + voice as the vehicle of such audio content. This proposed idea seeks to analyze “songs” as “maps” in the Digital Humanities context. Participants are highly encouraged to bring a song, poem, monologue, etc. with lyric/text sheet to “map out”. The take-away will be a “working map” that employs muscle memory toward learning, memorizing, auditioning, recording and performing any  vocal/spoken word content. –Conceived, written and submitted by Carolyn A. McDonough, Monday, Sept. 17, 2018.” [I’m excited to add that during the first creative meeting toward this all-day production, I connected my proposed idea to readings of Donna Haraway and Kathering Hayles from ITP Core 1]

What better way to celebrate this, than to “voyant” song/lyric content and today’s “sad news day” obituary of a great operatic soprano. Rather than describe these Voyant Reveals through writing further, I was SO struck by the visuals generated on my screen that I wanted to show and share these as the findings of my research.

My first choice was “What I Did For Love” from A Chorus Line (on a sidenote, I’ve seen the actual legal pad that lyricist Edward Kleban wrote the score on at the NYPL Lincoln Center performing arts branch, and I thought I had a photo, but alas I do not as I really wanted to include it to show the evolution from handwritten word/text to Voyant text analysis.)

I was screaming as the results JUMPED out of the screen at me of the keyword “GONE” that is indeed the KEY to the emotional subtext an actor/singer needs to convey within this song in an audition or performance which I KNOW from having heard, studied, taught, and seen this song performed MANY times. And it’s only sung ONCE! How does Voyant achieve this super-wordle superpower?

I then chose “Nothing” also from A Chorus Line as both of these songs are sung by my favorite character, Diana Morales, aka Morales.

Can you hear the screams of discovery?!

Next was today’s obit for a great soprano which made me sad to hear on WQXR this morning because I once attended one of her rehearsals at Lincoln Center:

A complex REVEAL of a complex human being and vocal artist by profession.

AMAZING. Such visuals of texts, especially texts I know “by heart” are extremely powerful.

Lastly, over the long weekend, I’m going to “Voyant” this blog post itself, so that its layers of meaning can be revealed to me even further. –CAM

Text Mining Game Comments (Probably Too Many at Once!)

To tell the truth, I’ve been playing with Voyant a lot, trying to figure out what the most interesting thing is that I could do with it! Tenen could critique my analysis on the grounds that it’s definitely doing some things I don’t fully understand; Underwood would probably quibble with my construction of a corpus and my method of selecting words to consider.  Multiple authors could very reasonably take issue with the lack of political engagement in my choice. However, if the purpose here is to get my feet wet, I think it’s a good idea to start with a very familiar subject matter, and in my case, that means board games.

Risk Legacy was published in 2011. This game reimagined the classic Risk as a series of scenarios, played by the same group, in which players would make changes to the board between (or during!) scenarios. Several years later,* the popularity and prevalence of legacy-style, campaign-style, and scenario-based board games has skyrocketed.  Two such games, Gloomhaven and Pandemic Legacy, are the top two games on BoardGameGeek as of this writing.

I was interested in learning more about the reception of this type of game in the board gaming community. The most obvious source for such information is BoardGameGeek (BGG).  I could have looked at detailed reviews, but since I preferred to look at reactions from a broader section of the community, I chose to look at the comments for each game.  BGG allows users to rate games and comment on them, and since all the games I had in mind were quite popular, there was ample data for each.  Additionally, BGG has an API that made extracting this data relatively easy.**

As I was only able to download the most recent 100 comments for each game, this is where I started.  I listed all the games of this style that I could think of, created a file for each set of comments, and loaded them into Voyant. Note that I personally have only played five of these nine games. The games in question are:

  • The 7th Continent, a cooperative exploration game
  • Charterstone, a worker-placement strategy game
  • Gloomhaven, a cooperative dungeon crawl
  • Star Wars: Imperial Assault, a game based on the second edition of the older dungeon crawl, Descent, but with a Star Wars theme. It’s cooperative, but with the equivalent of a dungeon master.
  • Near and Far, a strategy game with “adventures” which involve reading paragraphs from a book. This is a sequel to Above and Below, an earlier, simpler game by the same designer
  • Pandemic Legacy Season One, a legacy-style adaptation of the popular cooperative game, Pandemic
  • Pandemic Legacy Season Two, a sequel to Pandemic Legacy Season One
  • Risk Legacy, described above
  • Seafall, a competitive nautical-themed game with an exploration element

The 7th Continent is a slightly controversial inclusion to this list; I have it here because it is often discussed with the others. I excluded Descent because it isn’t often considered as part of this genealogy (although perhaps it should be). Both these decisions felt a little arbitrary; I can certainly understand why building a corpus is such an important and difficult part of the text-mining process!

These comments included 4,535 unique word forms, with the length of each document varying from 4,059 words (Risk Legacy) to 2,615 (7th Continent).  Voyant found the most frequent words across this corpus, but also the most distinctive words for each game. The most frequent words weren’t very interesting: game, play, games, like, campaign.*** Most of these words would probably be the most frequent for any set of game comments I loaded into Voyant! However, I noticed some interesting patterns among the distinctive words. These included:

Game Jargon referring to scenarios. That includes: “curse” for The 7th Continent (7 instances), “month” for Pandemic Legacy (15 instances), and “skirmish” for Imperial Assault (15 instances). “Prologue” was mentioned 8 times for Pandemic Legacy Season 2, in reference to the practice scenario included in the game.

References to related games or other editions. “Legacy” was mentioned 15 times for Charterstone, although it is not officially a legacy game. “Descent” was mentioned 15 times for Imperial Assault, which is based on Descent. “Below” was mentioned 19 times for Near and Far, which is a sequel to the game Above and Below. “Above” was also mentioned much more often for Near and Far than for other games; I’m not sure why it didn’t show up among the distinctive words.

References to game mechanics or game genres. Charterstone, a worker placement game, had 20 mentions of “worker” and 17 of “placement.” The word “worker” was also used 9 times for Near and Far, which also has a worker placement element; “threats” (another mechanic in the game) were mentioned 8 times. For Gloomhaven, a dungeon crawl, the word “dungeon” turned up 20 times.  Risk Legacy had four mentions of “packets” in which the new materials were kept. The comments about Seafall included 6 references to “vp” (victory points).  Near and Far and Charterstone also use victory points, but for some reason they were mentioned far less often in reference to those games.

The means by which the game was published. Kickstarter, a crowdfunding website, is very frequently used to publish board games these days. In this group, The 7th Continent, Gloomhaven, and Near and Far were all published via Kickstarter. Curiously, both the name “Kickstarter” and the abbreviation “KS” appeared with much higher frequency in the comments on the 7th Continent and Near and Far than in the comments for Gloomhaven. 7th Continent players were also much more likely to use the abbreviation than to type out the full word; I have no idea why this might be.

Thus, it appears that most of the words that stand out statistically (in this automated analysis) in the comments refer to facts about the game, rather than directly expressing an opinion. The exception to this rule was Seafall, which is by far the lowest-ranked of these games and which received some strongly negative reviews when it was first published. The distinctive words for Seafall included two very ominous ones: “willing” and “faq” (each used five times).

In any case, I suspected I could find more interesting information outside the selected terms. Here, again, Underwood worries me; if I select terms out of my own head, I risk biasing my results. However, I decided to risk it, because I wanted to see what aspects of the campaign game experience commenters found important or at least noteworthy. If I had more time to work on this, it would be a good idea to read through some reviews for good words describing various aspects of this style of game, or perhaps go back to a podcast where this was discussed, and see how the terms used there were (or weren’t) reflected in the comments. Without taking this step, I’m likely to miss things; for instance, the fact that the word “runaway” (as in, runaway leader) constitutes 0.0008 of the words used to describe Seafall, and is never used in the comments of any of the other games except Charterstone, where it appears at a much lower rate.**** As it is, however, I took the unscientific step of searching for the words that I thought seemed likely to matter. My results were interesting:

(Please note that, because of how I named the files, Pandemic Legacy Season Two is the first of the two Pandemics listed!)

It’s very striking to me how different each of these bars looks. Some characteristics are hugely important to some of the games but not at all mentioned in the others! “Story*” (including both story and storytelling) is mentioned unsurprisingly often when discussing Near and Far; one important part of that game involves reading story paragraphs from a book. It’s interesting, though, that story features so much more heavily in the first season of Pandemic Legacy than the second. Of course, the mere mention of a story doesn’t mean that the story of a game met with approval; most of the comments on Pandemic Legacy’s story are positive, while the comments on Charterstone’s are a bit more mixed.

Gloomhaven comments are much more about characters than any of the other terms I used; one of the distinguishing characteristics of this game is the way that characters change over time. Many of the comments also mentioned that the characters do not conform to common dungeon crawl tropes. However, the fact that characters are mentioned in every game except for two suggests that characters are important to players of campaign-style games.

I also experimented with some of the words that appeared in the word cloud, but since this post is already quite long, I won’t detail everything I noticed! It was interesting, for instance, to note how the use of words like “experience” and “campaign” varied strongly among these games.  (For instance: “experience” turned out to be a strongly positive word in this corpus, and applied mainly to Pandemic Legacy.)

In any case, I had several takeaways from this experience:

  • Selecting an appropriate corpus is difficult. Familiarity with the subject matter was helpful, but someone less familiar may have selected a less biased corpus.
  • The more games I included, the more difficult this analysis became!
  • My knowledge of the subject area allowed me to more easily interpret the prevalence of certain words, particularly those that constituted some kind of game jargon.
  • Words often have a particularly positive or negative connotation throughout a corpus, though they may not have that connotation outside that corpus. (For instance: rulebook. If a comment brings up the rulebook of a game, it is never to compliment it.)
  • Even a simple tool like this includes some math that isn’t totally transparent to me. I can appreciate the general concept of “distinctive words,” but I don’t know exactly how they are calculated. (I’m reading through the help files now to figure it out!)

I also consolidated all the comments on each game into a single file, which was very convenient for this analysis, but prevented me from distinguishing among the commenters.  This could be important if, for example, all five instances of a word were by the same author.

*Note that there was a lag of several years due to the immense amount of playtesting and design work required for this type of game.

**Thanks to Olivia Ildefonso who helped me with this during Digital Fellows’ office hours!

***Note that “like” and “game” are both ambiguous terms. “Like” is used both to express approval and to compare one game to another. “Game” could refer to the overall game or to one session of it (e.g. “I didn’t enjoy my first game of this, but later I came to like it.”).

****To be fair, it is unlikely anyone would complain of a runaway leader in 7th Continent, Gloomhaven, Imperial Assault, or either of the Pandemics, as they are all cooperative games.