Tag Archives: text analysis

New York Times: Sentiment Analysis and Selling You Stuff

Something related to textual analysis:

The New York Times is researching how to contextualize which ads they show with the feelings an article is likely to inspire. I’m not a fan. They claim having learned that ads perform better on emotional articles won’t influence the newsroom, but we’ll see. At least they’re being transparent about doing this work. They’ve published an article with information on how they developed their sentiment analysis algorithm (link below).

There’s an explanation of the types of models they used and why. The initial steps were linear and tree-based textual analysis models, followed by a deep learning phase intended to “focus on language patterns that signaled emotions, not topics.” This outperformed the linear models some of the time, but not all of the time.

From what I can tell, the training set used a survey showing articles with images to establish a baseline, but the linear predictive models focus purely on text. I may be misunderstanding this or information may be missing. I expect that image selection can enhance or diminish the emotionality of an article. Perhaps sensational or graphic images would prove to drive more (or fewer) ad clicks. Despite the buffer the NYT cites between their newsroom and marketing arms, this feels like morally hazardous territory. So to answer the question in the title of the NYT piece, this article makes me feel disturbed. But I still didn’t click an ad.

It’s a quick read. Check it out.


Text-Mining Praxis: Poetry Portfolios Over Time

For this praxis assignment I assembled a corpus of three documents, each produced over a comparable three-year period:

  1. The poetry I wrote before my first poetry workshop (2004-07);
  2. The final portfolios for each of my undergraduate poetry workshops (2007-10); and
  3. My MFA thesis (2010-13).

A few prelimary takeaways:

I used to be more prolific, though much less discriminate. Before I took my first college poetry workshop, I had already written over 20,500 words, equivalent to a 180-page PDF. During undergrad, that number halved, dropping to about 10,300 words, or an 80-page PDF. My MFA thesis topped out at 6,700 words in a 68-page PDF. I have no way of quantifying “hours spent writing” during these three intervals, but anecdotally that time at least doubled at each new stage. This double movement toward more writing (time) and away from more writing (stuff) suggests a growing commitment to revision as well as a more discriminate eye for what “makes it into” the final manuscript in the end.

Undergrad taught me to compress; grad school to expand. In terms of words-per-sentence (wps), my pre-workshop poetry was coming in at about 26wps. My poetry instructor in college herself wrote densely-packed lyric verse, so it’s not surprising to see my own undergraduate poems tightening up to 20wps as images came to the forefront and exposition fell to the wayside. We were also writing in and out of a number of poetic forms–sonnet, villanelle, pantoum, terza rima–which likely further compresses the sentences making up these poems. When I brought to my first graduate workshop one these sonnet-ish things that went halfway down the page and halfway across it, I was immediately told the next poem needed to fill the page, with lines twice as long and twice as many of them. In my second year, I took a semester-long hybrid seminar/workshop on the long poem, which positioned poetry as a time art and held up more poetic modes of thinking such as digression, association, and meandering as models for reading and producing this kind of poem. I obviously internalized this advice, as, by the time I submitted my MFA thesis, my sentences were nearly twice as as long as they’d ever been before, sprawling out to a feverish and ecstatic 47wps.

Things suddenly stopped “being like” other things. Across the full corpus, “like” turns out to be my most commonly-used word, appearing 223 different times. Curiously, only 13 of these are in my MFA thesis, 4 of which appear together in a single stanza of one poem. Which isn’t to say the figurative language stopped, but that it became more coded: things just started “being” (rather than “being like”) other things. For example:

Tiny errors in the Latin Vulgate
have grown horns from the head of Moses.

It is radiant. The deer has seen the face of God

spent a summer living in his house sleeping on his floor.

This one I like. But earlier figurative language was, at best, the worst, always either heavy-handed or confused–and often both. In my pre-MFA days, these were things that were allowed to be “like” other things:

  • loose leaves sprinkled like finely chopped snow” (chopped snow?)
  • “lips that pull back like wrapping paper around her teeth” (what? no.)
  • lights of a distant airplane flickering like fireflies on a heavy playhouse curtain” (ugh.)
  • tossing my wrapper along the road like fast silver ash out a casual window” (double ugh.)

Other stray observations. I was still writing love poems in college, but individual names no longer appeared (Voyant shows that most of the “distinctive words” in the pre-workshop documents were names or initials of ex-girlfriends). “Love” appears only twice in the later poems.

Black, white, and red are among the top-15 terms used across the corpus, and their usage was remarkably similar from document to document (black is omenous; white is ecstatic or otherworldly; red is to call attention to something out of place). The “Left-Term-Right” feature in Voyant is really tremendous in this regard.

And night-time conjures different figures over time: in the pre-workshop poems, people walk around alone at night (“I stand exposednaked as my handbeneath the night’s skylight moon”); in the college workshop poems, people come together at night for a party or rendezvous (“laughs around each bend bouncing like vectors across the night”); and, in the MFA thesis, night is the time for prophetic animals to arrive (“That night a deer chirped not itself by the thing so small I could not see it that was on top of it near it or inside of it & and how long had it been there?”).

Kavanaugh and Ford: A Mindful Exercise in Text-Mining

I’ve been fairly excited to utilize Voyant to do some textual analysis. I wanted to choose text to analyze that would engage with structural political issues to draw attention to inequalities within our societal structures. Thus, I’m particularly interested in engaging in discussions surrounding systems of power and privilege in modern America. This is why I’ve chosen to do a text analysis comparing and contrasting Brett Kavanaugh’s opening statement for the Senate Judiciary Committee with  Dr. Christine Blasey Ford’s opening statement.

Before going any further, I would like to issue a trigger warning for topics of sexual violence and assault. I recognize that these past few weeks have been difficult and overwhelming for some (myself included), and I would like to be transparent as I move forward with my analysis on topics that may come up.

My previous research has centered around topics of feminist theory, and rape culture, so the recent events regarding Supreme Court Justice Nominee, Judge Brett Kavanaugh, have been particularly significant to my past work and personal interests.

To provide some very brief context on current events, Kavanaugh was recently nominated by President Donald Trump on July 9, 2018 to replace retiring Associate Supreme Court Justice Anthony Kennedy. When it became known that Kavanaugh would most likely become the nominee, Dr. Christine Blasey Ford came forward with allegations that Kavanaugh had sexually assaulted her in the 1980’s while she was in high school. In addition to Dr. Ford’s allegations, two other women came forward with allegations of sexual assault against Kavanaugh as well. To read a full timeline of the events that occurred (and continue to occur) surrounding Kavanaugh’s appointment, I suggest checking out the current New York Times politics section or Buzzfeed’s news tab regarding the Brett Kavanaugh vote.  

The Senate Judiciary Committee surrounding Kavanaugh’s potential appointment invited both Kavanaugh and Ford to provide testimony about the allegation on September 24, 2018. Both Ford and Kavanaugh agreed to testify. Ford and Kavanaugh both gave a prepared speech ((initial testimony) on September 24, 2018 and then were asked questions from the committee. For this project, I am only comparing each opening statement–not the questions asked and answers given after the statements were provided. In the future, I believe much could be learned from a full and more thorough analysis including both the statements and the questions/responses given, however for the breadth of this current research and assignment I am only very briefly looking at both individuals opening statements.

This research is primarily exploratory in that I have no concrete hypothesis on what I will find. More-so, I am interested in engaging with each text to see if there are any conclusions that can be drawn from the language. Specifically, do either of the texts have implications regarding structurally oppressive systems of patriarchy and rape culture? Can the language of each speech tell us something about the ways in which sexual assault accusations are handled in the United States by the ways an accuser and the accused present themselves via issued statements? While this is only one example, I would be curious to see what type of questions can be raised from the text.

To begin, I googled both “Kavanaugh Opening Statement” and “Ford Opening Statement” to obtain the text for the analysis.

Here is the link I utilized to access Brett Kavanaugh’s opening statement.

Here is the link I utilized to access Dr. Christine Blasey Ford’s opening statement.

Next, I utilized Voyant, the open source web-based application for performing text analysis.

Here are my findings from Voyant

Kavanaugh’s Opening Statement:


Ford’s Opening Statement:

Comparison (Instead of directly copying and pasting the text in as I had done separately above, I simply inputed both links to the text into Voyant)

There are several questions that can be raised from this data. In fact, an entire essay could be written on a variety of discussions and arguments that compare and contrast the text and further look at how they compare with other cases like this one; however for the breadth of this short post I will only pull together a few key points that I noted, centering how the text potentially relates to the structural oppression of women in the United States.

First, I thought it was interesting that Kavanaugh’s opening statement was significantly longer (5,266 total words) than Ford’s (2,510 total words). Within a patriarchal society, women are traditionally taught (both directly and indirectly) to take up less space (physically and metaphorically), so I wondered if this could be relevant when considering the fact that Ford’s opening statement was significantly shorter than Kavanaugh’s. Does the internalized oppression of sexism in female-identified individuals contribute to the length of women’s responses to sexual violence–i.e. do women who experience sexual violence take up less space (potentially without even noticing or directly trying to) in regard to their statements than the accused (in this case men)? Perhaps a larger sample of research comparing both accuser’s and the accused sexual assault statements (specifically when the accuser is female and the accused is male) could provide more insight on this. 

Additionally, another observation I had while comparing and contrasting the texts with one another was the most used words within each text. Specifically, one of the most used words in Kavanaugh’s (the accused) speech was “women” which I found to be interesting. Do other people (specifically men) who are accused of sexual violence often use the word “women” in statements regarding sexual violence? Is this repetitive use of the word used to somehow prove that an individual would not harm women (even when they are being accused of just that)? It makes me consider an aspect of rape culture that is often seen when dealing with sexual violence–the justification that one could ultimately not commit crimes of sexual violence because they are a “good man” who has many healthy relationships (friendships or romantic) with women. There is no evidence that just because a man has some positive relationships with women that he is less likely to commit sexual assault; however there is data that states that people are more likely to be sexually assaulted by someone they know (RAINN). I would be curious to look into this further by utilizing tools like Voyant to consider the most used words in other statements from accused people of sexual violence.

Ultimately, this was a brief and interesting exercise in investigation and exploration. I think that there could be many different interesting and important research opportunities utilizing tools like Voyant that look at statements provided by sexual violence survivors and those who are accused of sexual violence. This was just a starting point and by no means is the necessary and extensive research that most done on this topic, rather it remains the beginning for further questions to be asked and analyzed. I’m eager to dive into more in-depth research on these topics in the future, possibly using Voyant or other text-mining web-based applications.

From allegation to cloture: text mining US Senators’ formal statements on Kavanaugh

# overview

For this project I examined Senators’ formal public statements on the Kavanaugh nomination in the wake of Dr. Christine Blasey Ford’s allegation that he attempted to rape her as a teenager. I edited this out initially, but including now that this is an attempt to do something productive with how sick I feel at how hostile American culture remains toward women, our sexuality, and our safety.


## process

I built my corpus for analysis by visiting every single one of the 99* official (and incredibly banal) US Senator websites and searching the term “Kavanaugh” using the search function on each site. I reviewed the first 20 search results** on each website and harvested the first result(s) (up to three) which met my criteria. My criteria were that they be direct, formal press released statements about Kavanaugh issued on or after September 15, 2018 up until the time of my data collection, which took from 5pm-10pm EST on October 5th, 2018. Some Senators had few or no formal statements in that period. I did not include in my results any speeches, video, news articles or shows, or op-eds. I only included formal statements, including officially-issued press released comments. For instances in which statements included quoted text and text outside of the quote area, I included only the quote area.

I have publicly posted all of my data and results.

My working list of Senators and their official websites is from an XML file I downloaded from the United States Senate website.

I opened the XML file in Excel and removed information not relevant to my text mining project, such as each Senate member’s office address. I kept each member’s last name, first name, state represented, party affiliation, and official webpage URL. This is my master list, posted to Google Sheets here.

I created a second sheet for the statements. It contains the Senators’ last name along with date, title and content of the statement. I did a search for quote marks and effectively removed most or all of them. This statement content data is available in a Google Sheet here.

I joined the two sheets in Tableau (outer join to accomodate future work I may do with this), and used Tableau’s filtering capabilities to get plain text files separating out the Democrat statements, Republican statements, and Independent statements, along with a fourth file which is a consolidation of all statements. The plan was to perform topic modeling on each and compare.


### in the mangle

Mallet wasn’t too hard to install following these instructions. I input (inputted?) my consolidated Democrat, Republican, and Independant statements and had it output a joined mallet file with stopwords removed. Then I ran the train-topics command, and here I really don’t know what I was doing other than closely following the instructions. It worked? It made the 3 files it was supposed to make – two text files and a compresed .gz file. I have no idea what to do with any of them. Honestly, this is over my head and the explanations on the Mallet site presuppose more familiarity than I have with topic modeling. Here is a link to the inputs I fed Mallet and what it gave back to me.


#### discussion

At this point I’m frustrated with Mallet and my ignorance thereof (and, in the spirit of showing obstacles along the way, I’m cranky from operating without full use of my right arm which was injured a few days ago). I’d like to know more about topic modeling, but I’d like the learning process to be at least somewhat guided by an actual in-person person who knows what they’re doing. The readings this week are not adequate as sole preparation or context for trying to execute topic modeling or text mining, and my supplemental research didn’t make a significant difference.

I like my topic and corpus. Something I found interesting when I was collecting my data is that not all Senators issued formal press release statements on Kavenaugh during the period I examined. I was suprised by some who didn’t. Kamala Harris, Elizabeth Warren and Kirsten Gillibrand issued no formal statements referencing Kavanaugh between September 15th and the date of writing (October 5th), whereas Lindsay Graham issued four. This is not to say the former Senators were silent on the topic. Just that they did not choose to issue formal statements. Somewhat alarmingly, searching for “Kavanaugh” on Chuck Schumer’s site returned no results at all. Thinking this was in error, I manually reviewed his press release section going back to September 15th. Indeed, though Schumer issued very many press releases during that period, Kavanaugh was not mentioned a single time in the title of any.

And here’s where I need collaborators, perhaps a political scientist and/or public relations expert who could contextualize the role that formal statements play in politics and why different Senators make different choices about issuing them.

There were other interesting findings as well. The search functions on the websites visited were all over the yard. Many had terrible indexing, returning the same result over and over in the list. Cory Booker’s website returned 2,080 results for “Kavanaugh”. Dianne Feinstein’s site returned 6. The majority of Senators did engage with the Kavanaugh nomination through the vehicle of formal statements. Only ten Senators’ websites either lacked a search function entirely or the search returned zero results for Kavanaugh.

I will likely run the data I gathered through Voyant or perform a different analysis tomorrow. If so, I will update this post accordingly.


##### update 10/7

I wonder if I should be feeding Mallet the statements individually, rather than in consolidated text files grouped by party affiliation. I also realized I wanted to have these individually, rather than as cells in a CSV, so that I can feed into Voyant and see the comparisons between statements. I don’t know how to write macros in Excel, but this seemed like a great application for a Python script. I’ve been trying to learn Python so decided to write a script that would import a CSV and export parts of the individual records as individual text files.

I wrote some Python code and got it working (with an assist from Reddit when an extraneous variable was tripping me up, and suggestions on how I could improve a future iteration from Patrick Smyth). I’ve posted the individual statements in a shared folder here. The filenaming convention is as follows. Filenames start with “D”, “R”, or “I” to indicate which party the senator belongs to (Democrat/Republican/Independent), followed by the Senator’s surname, and a number that kept multiple statementss from the same senator from overwriting each other.

I plan to try analyzing these individual statements tomorrow.


###### update 10/8

I took the statements I broke out in Python and ran them through Voyant. I ran the 56 statements from Democrats separately from the 42 statements from Republicans. I did not analyze the 4 statements from Independents, 3 of which were from Bernie Sanders.

Voyant seems to be a bit buggy. I added “Kavanaugh,” and “judge” to Voyant’s default stopword list, as “Judge Kavanaugh” appeared in every single result, but it took a couple of tries and ultimately only worked on the Cirrus tool. Voyant refused to acknowledge my stopword list on the other tools. I’d also attempted to supress “Kavanaugh’s”, but Voyant kept showing it, including on the Cirrus tool, despite my adding it to the stopwords list. “Fire” is on the default stoplist, and I think it shouldn’t be. Voyant also would not honor font changes, though there was a dropdown menu to do so.

Both groups showed great variability in length. Democrats’ statements ranged from 24 to 612 words. Republicans’ statements ranged from 48 to 887 words.

The Collocates tool was interesting but mysterious. There was a little slidey bar at the bottom that changed the results, but there were no labels or other support to interpret why that was happening or what was being measured. I made sure to keep both my Democrat and Republican analyses at “5” so at least I had consistency. I searched for more information on the tool in the documentation, but the Collocates tool isn’t even listed.

Republicans often linked Dr. Ford’s name with verbs such as heard, said, appear, provide, and named. Democrats used more descriptors, such as credible, courage, and bravely.

Collocator graphs from Voyant Tools

It was fun watching the Mandalas tool build, showing relationships between the documents in the corpus and the top 10 terms used. The Democrat mandala (shown first) built off the words “court”, “ford”, “dr”, “senate”, “investigation”, “supreme”, “fbi”, “allegations”, “sexual”, and “assault”. The Republican mandala (shown second) built of their top 10 words which were “dr”, “committee”, “senate”, “process”, “court”, “ford”, “fbi”, “supreme”, “evidence”, and “judiciary”. The Democrats’ statements called attention to the specific nature of the allegations, while the Republicans’ statements focused on the legal process.

Voyant Tools text analysis mandala visualization

Voyant tools text analysis mandala visualization

Another fun but under-documented tool is called the StreamGraph. This seems more about visual interest than effectively communicating information, as the areas of the different segments are quite hard to compare. Again, the Democrats’ statements visualization is shown first, followed by the Republican. The Democrats highlight “investigation”, whereas the Republicans highlight “process.”

Voyant text mining Stream graph

Voyant text analysis Stream graph

####### text mining tools review

In closing, here are my reviews of the text mining tools I used.

Voyant: buggy, unreliable, good fun but about as rigorous as a party game
Mallet: a machine may have learned something, but I didn’t



*Jon Kyl, John McCain’s replacement, does not yet have an official Senate website of his own. A quick Google search revealed no official press release statements in the first 20 results.

**Bob Corker, Cindy Hyde-Smith, and John Kennedy did not have a search function on their sites. The search function on Rand Paul’s site was not functioning. Each has a news or media section of their site, and that is where I looked for press releases. Chuck Schumer and Tina Smith’s sites’ search functions returned zero results for “Kavanaugh”. I reviewed titles of all press releases on their sites since September 15th and found no reference to Kavanaugh.


My process with the Praxis 1 Text Mining Assignment began with a seed that was planted during the self-Googling audits we did in the first weeks of class, because I found an obituary for a woman of my same name (sans middle name of initial).

From this, my thoughts went to the exquisite obituaries that were written by The New York Times after 9-11 which were published as a beautiful book titled Portraits. One of my dearest friends has a wonderful father who was engaged to a woman who perished that most fateful of New York Tuesdays. My first Voyant text mining text, therefore, was of his fiancee’s NYT obituary. And the last text I mined for this project was the obituary for the great soprano Monserrat Caballe, when I heard the news of her passing as I was drafting this post.

The word REVEAL that appears above the Voyant text box is an understatement. When the words appeared as visuals, I felt like I was learning something about her and them as a couple that I would never have been able to grasp by just reading her obituary. Indeed, I had read it many times prior. Was it the revelation of some extraordinary kind of subtext? Is this what “close reading” is or should be? The experience hit me in an unexpected way between the eyes as I looked at the screen and in the gut.

My process then shifted immediately to song lyrics because, as a singer myself who moonlights as a voice teacher and vocal coach, I’m always reviewing, teaching and learning lyrics. I saw the potential value of using Voyant in this way in high relief. I got really juiced by the prospect of all the subtexts and feeling tones that would be revealed to actors/singers via Voyant. When I started entering lyrics, this was confirmed a thousand fold on the screen. So, completely unexpectedly, I now have an awesome new tool in my music skill set. The most amazing thing about this is that I will be participating in “Performing Knowledge” an all-day theatrical offering at The Segal Center on Dec. 10 for which I submitted the following proposal that was accepted by the Theater Dept.:

“Muscle Memory: How the Body +  Voice Em”body” Songs, Poems, Arias, Odes, Monologues & Chants — Learning vocal/spoken word content, performing it, and recording it with audio technology is an intensely physical/psychological/organic process that taps into and connects with a performer’s individually unique “muscle memory”, leading to the creation of vocal/sound art with the body + voice as the vehicle of such audio content. This proposed idea seeks to analyze “songs” as “maps” in the Digital Humanities context. Participants are highly encouraged to bring a song, poem, monologue, etc. with lyric/text sheet to “map out”. The take-away will be a “working map” that employs muscle memory toward learning, memorizing, auditioning, recording and performing any  vocal/spoken word content. –Conceived, written and submitted by Carolyn A. McDonough, Monday, Sept. 17, 2018.” [I’m excited to add that during the first creative meeting toward this all-day production, I connected my proposed idea to readings of Donna Haraway and Kathering Hayles from ITP Core 1]

What better way to celebrate this, than to “voyant” song/lyric content and today’s “sad news day” obituary of a great operatic soprano. Rather than describe these Voyant Reveals through writing further, I was SO struck by the visuals generated on my screen that I wanted to show and share these as the findings of my research.

My first choice was “What I Did For Love” from A Chorus Line (on a sidenote, I’ve seen the actual legal pad that lyricist Edward Kleban wrote the score on at the NYPL Lincoln Center performing arts branch, and I thought I had a photo, but alas I do not as I really wanted to include it to show the evolution from handwritten word/text to Voyant text analysis.)

I was screaming as the results JUMPED out of the screen at me of the keyword “GONE” that is indeed the KEY to the emotional subtext an actor/singer needs to convey within this song in an audition or performance which I KNOW from having heard, studied, taught, and seen this song performed MANY times. And it’s only sung ONCE! How does Voyant achieve this super-wordle superpower?

I then chose “Nothing” also from A Chorus Line as both of these songs are sung by my favorite character, Diana Morales, aka Morales.

Can you hear the screams of discovery?!

Next was today’s obit for a great soprano which made me sad to hear on WQXR this morning because I once attended one of her rehearsals at Lincoln Center:

A complex REVEAL of a complex human being and vocal artist by profession.

AMAZING. Such visuals of texts, especially texts I know “by heart” are extremely powerful.

Lastly, over the long weekend, I’m going to “Voyant” this blog post itself, so that its layers of meaning can be revealed to me even further. –CAM