Monthly Archives: October 2018

Speclab and Drucker: Theoretical and Practical Design and Computation

Patrick Grady O’Malley

Like the littérateur concerned with their prose, a Digital Humanist seeks to express their humanistic interests with the digital tool-kit provided by modernity and the laws of technology. How does one use HTML appropriately to express ones thoughts and vision? Which markup language is most appropriate for the task at hand? Not only must the language be spot on when creating a digital work, language referring to that of a literary project, but also its code. Understanding expansive (and growing) digital languages to put one’s dream on a screen is the plight we all face as emerging Digital Humanists.

In order to successfully render a quality project, consider the rules of design that dictate the visual arts. Aesthetically pleasing work is mandatory not just in fine art or graphic design, but also our world too. “Features such as sidebars, hot links, menus, and tabs have become so rapidly conventionalized that their character as representations has become invisible. Under scrutiny, the structural hierarchy of information coded into buttons, bars, windows, and other elements of the interface reveals the rhetoric of display, [9]” reiterates the importance of design choice but also brings to light the notion that certain design elements may easily be overlooked by the user as something of a commodity to be expected. In other words, our hard work in choosing how to visualize our project may barely be noticed, at least by those who aren’t really looking. Nonetheless, these tedious decisions must be made for the relevance of the project and the objects it represents.

Users expect good-looking interfaces that are founded in functionality. When coupled with text to be explored, I could see how it would be easy to overlook the functionality of one versus the other (design/text), but I suppose that is the nature of collaboration amongst experts that bring to the project different skill sets. Only then can something worthwhile and exceptional be achieved. The coding of both the design and the text is a skill in and of itself, furthering the idea that “Humanists are skilled at complexity and ambiguity. Computers, as is well known, are not. [7]” A computer will only do what you tell it to, so artistic and intellectual integrity remains with us, and for as much as people say that computers make people lazy, I’d say we all have good evidence that such is not the case by any sense of the definition.

With regard to all of these considerations, the author clearly takes the stance that attention to detail is of the essence. When discussing how to chunk or tag texts in XML, the author states that “Such decisions might seem trivial, hairsplitting, but not if attention to material features of a text is considered important. [13]” In other words, while it may be tempting to leave certain elements alone, only the finished project suffers and worthy reputations become diminished. This is certainly not the path I hope to travel, even though in my daily life I’m frequently looking for nice ways to cut a corner. But we do what we do for the expansion of scholarship, “art for art’s sake!” so to speak.

The modeling and structuring of a project is the true core of what is being visualized. “It is all an expression of form that embodies a generalized idea of the knowledge it is presenting. [16]. Without a thorough intellectual plan that takes into account the many considerations of design (“visualization, psychology and physiology of vision, and cultural history of visual epistemology [21]”) and computation (statistics, coding, logic theory), the end result is not thorough… “the metatext is only as good as the model of knowledge it encodes. [15]” I have heard of TEI and been exposed to it “under the roof” in a minimal sense, but I know little of the dictations of the “organization setting standards for knowledge representation. [14]” in a broader sense, or really in any way that I could work amongst it at this point in my early career. But I am aware there are rules of functionality that must be interpreted for appropriate text layout. As I broaden my skillset in text analysis, I’m sure the process becomes more and more intuitive, however, I’d be lying if I said now it wasn’t a bit intimidating.

Are the objects we are creating tangible in nature? Or do they only stem from tangible products (books, paintings, song lyrics)? Is there value in discerning between the two? Is the output we create secondary to the primary source it is coming from? Or do our projects take on a new life of their own? “A discourse field is indeterminate, neither random, chaotic, nor fixed, but probabilistic. It is also social, historical, rooted in real and traceable material artifacts. [28]” As Digital Humanists, without having criterion of standards that dictate the work we do, or, the underlying philosophy of a project, what are we left with? There is little point in even bothering to make anything if you can’t summarize what its purpose is, intellectually, from the outset.

Every object has its place in history and I believe it is our job to bring that historicity into modernity in order to illuminate the changing nature of the humanities over centuries. “We are far less concerned with making devices to do things-sort, organize, list, order, number, compare-than with creating ways to expose any form of expression (book, work, text, image, scholarly debate, bibliographical research, description, or paraphrase) as an act of interpretation ( and any interpretive act as a subjective deformance). [26]” In other words, we are learning to read between the blurry lines of theory and practicality, and create work that harbors the two amongst a host of scholarly concerns and quandaries.

Where Not to Screw Around

In “The Hermeneutics of Screwing Around,” Ramsay is very interested in the concept of serendipity – the ways that readers unexpectedly come across just the right information at just the right time. Of course, one of the things you learn in library school is that serendipity is an illusion created by the infrastructure of the library, as Ramsay acknowledges. Serendipity comes from the fact that someone has put all this information in order so that readers can find it.

Ramsay distinguishes between two ways of encountering information; purposeful searching through the use of indexes and the like, and browsing, which he characterizes rather delightfully as “wander[ing] around in a state of insouciant boredom.” He argues that because the internet is not organized in a human-readable way, search engines are very good at searching, but poor at browsing, which means that it’s more difficult to find that thing you didn’t know you wanted.

I’m with Ramsay up to this point. One major problem with electronic resources is that they don’t support browsing well. There have, at least among libraries, been some attempts to replicate some of the tools that make browsing work offline, but they don’t work as well. OneSearch has a feature that lets you look at books nearby on the shelf, and I’d love to know whether anyone in this class has ever used this, because I suspect not! Part of the problem is that the process still has to begin with a specific search; part of the problem is that, unlike when you are browsing the physical shelves, you don’t have access to the full text of the book.

What really brought me up short when I was reading this, though, was the very casual way he tosses out the notion that algorithms can fix this problem.

A few weeks ago, Data & Society published a report, “Alternative Influence: Broadcasting the Reactionary Right on YouTube.” It’s definitely worth your time to read the whole thing if you can, but it examines the network of white supremacist/white nationalist/alt-right microcelebrities on YouTube. Because many of these individuals have strong relationships with their audiences and are able to “sell” them on their ideologies, they are considered “influencers.” The report maps out the relationships among them based on their appearances on each other’s shows. It reveals connections between the extremists and those who are or describe themselves as more moderate libertarians or conservatives.

The process by which these YouTubers’ audience members become radicalized is a browsing process; the description of how an audience member moves from one video channel to another isn’t, mechanically, that different from Ramsay’s description of moving from Zappa to Varèse, as different as it obviously is in subject matter. He happens to come across one who then points him to another.

For example, influencers often encourage audience members to reject the mainstream media in favor of their content, thus priming their audiences for a destabilized worldview and a rejection of popular narratives around current events. Then, when libertarian and conservative influencers invite white nationalists onto their channels, they expose their audiences to alternative frameworks for understanding the world. Thus, audiences may quickly move from following influencers who criticize feminism to those promoting white nationalism. (35)

Algorithms facilitate this process by recommending similar videos, but the report points out that this is really a social problem; even if the recommendations didn’t exist, these YouTubers would still be promoting each other and radicalizing their viewers. Certainly we could argue that the larger problem is not the mechanism of discovery but just the fact that there’s so much of this kind of content out there. After all, older kinds of browsing can turn up similar content, and let’s not pretend that the Library of Congress headings don’t have a ton of baked-in bias.

All the same, there are clearly some problems with the rather blithe suggestion that the problems of browsing online can be solved by algorithms when this kind of content is so common on the internet, and especially when these algorithms are largely written by companies that are making money on this kind of content. I’m guessing we’ll talk about this further when we get to Algorithms of Oppression…

On the Sound Workshop

In this week’s reading, Lauren Klein teased an important critique of the distant reading method and tools by asking: “what might be hidden in [the corpus we are analyzing]? And are there methods we might use to bring out significant texts, or clusters of words, that the eye cannot see?” Today’s workshop on sound recording and analysis engaged with that critique by presenting the different ways in which we can analyze and present data through sound. “Sound” as a category that relates to sensual perception has longed been overlooked by humanistic inquiry and literature that privileges visual and textual data and analysis. “Sonification” therefore refers to the methods of displaying date through sound. The workshop was led by digital fellow Kelsey Chatlosh who is a PhD candidate in cultural anthropology and an active voice on the Sound Studies and Methods Working Group here at the GC.

So instead of summarizing the entire workshop, I will just share the topics and tools that engaged directly with the approach raised above. I will also share the link to the presentation: https://docs.google.com/presentation/d/11dEYJrUepH_75WKgQkDlhwPkp4dCGCegCpCMqvFTCEE/edit#slide=id.p3

First, it’s important that we think about what falls under the category of sound/audio that we want to record, analyze or share. Here we are speaking of music, remixes, oral histories, radio, podcasts, sound art, soundscapes, sound installation/sound walk, sound map, sound therapy, etc.

Second, various scholarly works on sound studies and anthropology have theorized about sound. The most significant contributions in the field are that of anthropologist Steve Feld and composer Muray Schuffer. Feld develops the notion of acoustemology that regards sound “as a modality of knowing and being in the world” (2000). Schuffer introduced the concept of soundscape to treat growing concerns about acoustic ecology and created The World Soundscape Project. Both contributions were widely received and expanded on within their disciplines. On that note I wonder how much and in what ways other disciplines in the humanities may be open to these scholarly approaches.

Third, we were introduced to:

Sound open source archives:
Freesoundproject
NoiseMakers
DH related projects/resources:

John Barber sound resources archive (he also teaches at the DH Summer Institute)

If you want to share your oral archive or create a website to share it, you can check the Oral History Metadata Synchronizer

Software for (qualitative) coding and analyzing:
Atlas.ti, it’s the software widely used by anthropologists (it’s not open source, but it’s available on the GC PCs)
Trint, open source for transcribing
Jacket, open source for analyzing performative speech

Google speech to text, non-open source for transcribing and requires knowledge of python to operate

High Performance Sound Technologies for Access and Scholarship, open source audio toolkit for analyzing and preparing audio for machine learning

And finally, my favorite ethnopoetic transcription that addresses an issue I wrestled with when doing the praxis assignment; that is the challenge of representing the semantically loaded spoken words with words on paper.

Sound art projects:
works by artist Zach Poff ; and by Markus Kison who does WWII memorial exhibits of touch echo installations that help transmit sounds of cities which devasted in 1945 using bone conduction technology

Overall, the workshop served as a great introduction to the sound/audio tools and DH projects that deal with sound. It still seems to me however that most software/tools are designed in such a way to render sound into text that can be analyzable. The reason to that may be related to the nature of sound itself. After all, sound is ephemeral. Even when captured in recording, it can never be grasped in its entirety. What I think this reveal (about us people working in the academe) is perhaps an intellectual anxiety at the challenges raised by the immediate sensual experience and what it engenders at the moment of its unfolding—that I believe what we actually do (writing/text) cannot rightly encompass.

Finally, I hope you find my summary useful!

Grad School in Wonderland

This assignment, as much as this semester, has felt somewhat like an Alice in Wonderland experience. Weightlessly bumping up and down in slow motion I am still slightly confused while gazing into this new universe with wide eyed awe.

As Alice stands bewildered in the forest deciding upon the right path to take, so have also I been confused at where to start and exploring text mining has been yet another trip to wonderland.

Text

For the project I chose to use a scene from “Antony and Cleopatra” by William Shakespeare. The play revolves around the romance between a Roman General and the Queen of Egypt set against the politics of territorial gain and political power. I used scene xv in the fourth act which is a short but significant scene where Antony dies whereupon Cleopatra decides to beat the Romans to it and “make death proud to take her”.

Purpose

The text was sampled from the Gutenberg Project and was chosen based on a desire to get a sense of how text mining works and how useful it could be in the context of analyzing a play for research purposes. For an actor studying for a part text mining does not seem to be of much use as actors try to connect with the feeling of words rather than frequency but for scholars researching Shakespeare it’s a tool that could prove practical.

Tools

Initially the text was analyzed in Voyant with ambiguous results. The most frequent words came up as Antony, Cleopatra, Charmian, come and women. The program, however, wasn’t able to detect the difference between character names and other frequently used words such as “come”. The word “Antony” appears 17 times, but no further categorization is made differentiating how the word is used. When manually searching through the text it appears that seven of those times indicates a line spoken, seven refers to his name being uttered, and three incidents refers to the name being used in stage directions, demonstrating that Voyant does not differentiate between character name, spoken name and name in stage directions. So for instance used as a tool searching for detailed information regarding how many times a character speaks the tool can produce misleading information while it can produce an overview as to what a scene is about in singling out the most common words describing which actions take place in the scene as shown below where the words death, dying and dead pretty much sums it up.

In order to obtain a more nuanced filtering of words and character names I tried different text mining tools without success. The tools tested, such as “Bookworm” and especially “Orange” seemed to be great tools, but I was not able to operate them properly, thus not being able to utilize their potential.

Challenges

As for challenges I have come across many. Coming from a non-tech background means starting from scratch, trying to figure out the basics, covering everything from understanding concepts to converting files into different formats. I might have been going into this new field being slightly naïve in thinking that technology has become so advanced, yet simple and intuitive that computer programs can be navigated with ease. So far I can say that technology holds the potential for some adventurous trips, there’s just a lot of planning and packing needed to get done before one can set off.

Text Analysis with Voyant

I initially thought that I’d compare the inaugural speeches of Presidents Obama and Trump (which I did), and then thought I’d look at the first and second inaugural speeches of President Bush (43) and Obama. While I found these comparisons somewhat interesting, I didn’t think that they were as enlightening as I’d hoped. I was actually surprised with the way that Trump’s speech appeared, in that it did not portray the somewhat bleak picture of the nation that I thought his speech conveyed.

(somehow I was not able to display these slides here, so I added the links…..)

https://voyant-tools.org/?corpus=063848fbc1f9d4e357b0bace3c0ea0f4

I thought that comparing speeches of Presidents was an obvious choice, so then I decided to compare the opening statements of Brett Kavanaugh and Dr. Blasey Ford’s during the recent Senate Hearing. Here again, while interesting, I was somewhat disappointed by what the software displayed.

Kavanaugh’s Opening Statement:

https://voyant-tools.org/?corpus=ee640979b2271bc5b58bcb570983468d

If one were to rely on the links and word trend screens, it is accurate in that he was speaking of his high school years, and the frequency of words paint the picture of an adolescent’s focus on friends, parties, beer, and boys and girls. Sounds innocent enough, right? But what did not get conveyed was his adamant denial of the allegations, the outright partisan statements that he made, and the overall impression that he left on many that questioned his judicial temperament, and his ability to be impartial and balanced.

Dr Ford’s opening statement:

https://voyant-tools.org/?corpus=cace51eabd9b1f94a2ebaa28c083ec30

The word links and word trends window for Dr. Ford’s statement appear to be a more accurate representation of her testimony. One could infer that it is more specific and focused on the event. But again, what is missing here is the general impression that the speaker imparts to the world, which is conveyed in tone of voice, cadence, emphasis, and overall demeanor.

I want to also temper my impressions here, in that this is the first time that I’m using this tool, and I could be missing something, or not parsing the data, deleting certain words which could skew the results, or any number of techniques that would make the results more meaningful. Which is to say that with any tool, understanding HOW to use it correctly, and WHEN are critical.

I then looked at some tutorials on YouTube to learn more. The question that I find interesting is what are the best uses of this tool, in order to provide data and insights that are not evident from reading specific texts? I stumbled upon a presentation given by by Stéfan Sinclair (McGill), one of the developers of Voyant. There was one slide that I found very interesting:

https://www.youtube.com/watch?v=fYmngzBtrLI

At 13.00 minutes : This was a comparison of the texts from advertising for toys, for boys, and for girls. These two images are striking in that they really say it all, and there is no other explanation needed to illustrate the gendered stereotypes that are still reinforced by advertising to children.

If I were to make a conclusion, perhaps this tool is more instructive when looking at a body of work, as opposed to evaluating a single text. Then patterns that cannot easily be gleaned from reading individual documents or transcripts might be teased out of the analysis that this type of a tool could provide.

Text Mining Praxis: Remember the Past? Or Find Lost Time?

I start with a provocation from Ted Underwood in answer to his own question of whether it is necessary to have a large collection of texts to do text mining (well, distant reading, really):

This is up for debate. But I tend to think the answer is “yes.”

Not because bigger is better, or because “distant reading” is the new hotness. It’s still true that a single passage, perceptively interpreted, may tell us more than a thousand volumes.

But if you want to interpret a single passage, you fortunately already have a wrinkled protein sponge that will do a better job than any computer. Quantitative analysis starts to make things easier only when we start working on a scale where it’s impossible for a human reader to hold everything in memory. Your mileage may vary, but I’d say, more than ten books? (Underwood, “Where to start with text mining”)

Naturally, my text mining investigation transgresses this theory and, ultimately, doesn’t disprove it.

At first, I wanted to compare translations of Rimbaud’s Illuminations, the first by Wallace Fowlie and the second by John Ashbery. I wasn’t able to easily get the latter in digital form, so I looked at a set of corpora I had “lying around” and chose to do the same with translations of Proust I found in my digital desk drawer.

The canonical translation of A La Recherche du Temps Perdu was, for many years, the C.K. Scott Moncrieff one. More recently, Terence Kilmartin reworked the Moncrieff translation. And then D.J. Enright reworked Moncrieff in light of the Kilmartin translation. In the process, the translated title “Remembrance of Things Past” became “In Search of Lost Time”.

In attempting to avoid lost time myself, I chose to render just the first volume Swann’s Way (in both).

Clearly this is not ten texts, but I am very interested in whether empirical investigations into different translations of an individual work can uncover interesting information about translation in general; and many other questions about translation that I certainly don’t address here. Here is just a start.

Methodology:

I started with an html file of Enright’s translation of Swann’s Way, copied it into a plain text file and started to clean it manually. This involved removing a bunch of text from the beginning and end. I did the same for the Moncrieff translation, except I had an epub file of that which I converted to plain text using an online conversion tool. The Moncrieff translation had separated the Place Names: the Name section from the end of the Enright file, which was better left off anyway.

I made the assumption that this process wasn’t introducing any error that wasn’t already in the texts as I got them. If I were doing this for an actual study, I would not make that assumption.

My first thought was to try to reproduce the methodology of a Stanford group that had compared translations of Rilke’s The Notebooks of Malte Laurids Brigge. This would have required me to use two tools, Python NLTK to “tokenize” the texts into sentences and then Stanford’s Lexical Parser to analyze them syntactically. As I sat down to do this this weekend, away from home in the cold north with family and under conditions of intermittent and slow internet, I realized the time investment was beyond the scope of a weekend. Syntactic analysis is thus a huge gap in my efforts — and was really what I was interested in initially.

Coming back home, I decided to see what I could do with Voyant. So I used the upload function from the Voyant-tools landing page and selected both texts (separate plain text files), then clicked the circus-like “Reveal” button.

There were a lot of proper names and a few other terms (such as “swann”, “odette”, “mme”, etc.) that I didn’t think were useful, so I used the options button in the Cirrus tool to add “stop words” to filter them out and then applied that filter globally.

I used the Summary for a lot of my statistics because it was the clearest spot to see big picture comparisons between the texts. I also explored every tool available, switching them into and out of display panes. For my purposes, the most useful were Document Terms, Trends, Summary, and Terms. I think I could have used Contexts more if I were asking the right questions. The Reader did not help much at all because the corpora were listed one after the other rather than side by side.

I also consulted with WordNet to come up with words that surrounded the concept of remembering and memory. I did not see a way to do this, or invoke outside help, from within Voyant, so I just went to the source.

The Reveal:

One thing that jumped out is the total word count comparison. Moncrieff uses 196981 words to Enright’s 189615 — a difference of 7366 words, or 3.8% more words in Moncrieff than Enright. At the same time Enright’s vocabulary density is consistently higher, 0.4% over the course of the novel and 0.7% in the Overture itself. This suggests that Enright’s translation is more economical — dare I say “tighter”? Of course, the semantic richness of the two texts and how closely they “capture” the original are not questions these stats answer.

Doing some simple math by dividing the total number of words by the average sentence length (why doesn’t Voyant just give the actual # of sentences?), shows that Moncrieff has about 5184 sentences to Enright’s 5016. Given that Enright’s average sentence size is shorter, he has clearly not increased the length of his sentences to “fit in” the extra clauses/phrases that Moncrieff uses. Does this mean he is leaving things out? Or is it further evidence of Enright’s greater efficiency? If one is translating sentence for sentence from the original, how does one end up with such a discrepancy of sentences (168 difference) without combining them? Unfortunately, I don’t immediately see a way of answering this question from a distance.

Comparing the two texts for word frequency for the top words didn’t show much difference. However, the differences are magnified by the relative word counts of the corpora. For example “time” — a crucial word for the work — looking for it in the “Document Terms” tool shows that it occurred 429 times in Enright to 420 for Moncrieff, but dividing by approximate sentence count, shows “time” occuring at a rate of about 0.086 per sentence in Enright whereas Moncrieff’s per sentence frequency of “time” is about 0.081. Projecting Enright’s frequency to the length of Moncrieff’s text, we would get about 446 instances of “time” in the later translation, which is 1.06 times as much as the earlier translation.

Let’s look then at the ways “time” can be used. There is some ambiguity to the word “time” which can be fun but obfuscating. Given “time”s ambiguity, I wondered if perhaps the additional uses of it indicated something less precise in Enright. So I looked at the text itself in a few spots.

I didn’t think I was going far afield to grab “time” and drill down with the word. Not only does it have a lot of vectors of meaning and ways to be used, but it is central to the theme of the whole work (i.e. Remembrance of Things Past, In Search of Lost Time). Seeing how it is invoked by the two translators could tell us a lot about how they see “time” operating in the work as a whole.

Comparing the first paragraphs of the two texts:

Enright:

For a long time I would go to bed early. Sometimes, the candle barely out, my eyes closed so quickly that I did not have time to tell myself: “I’m falling asleep.” And half an hour later the thought that it was time to look for sleep would awaken me; I would make as if to put away the book which I imagined was still in my hands, and to blow out the light; I had gone on thinking, while I was asleep, about what I had just been reading, but these thoughts had taken a rather peculiar turn; it seemed to me that I myself was the immediate subject of my book: a church, a quartet, the rivalry between François I and Charles V. This impression would persist for some moments after I awoke; it did not offend my reason, but lay like scales upon my eyes and prevented them from registering the fact that the candle was no longer burning. Then it would begin to seem unintelligible, as the thoughts of a previous existence must be after reincarnation; the subject of my book would separate itself from me, leaving me free to apply myself to it or not; and at the same time my sight would return and I would be astonished to find myself in a state of darkness, pleasant and restful enough for my eyes, but even more, perhaps, for my mind, to which it appeared incomprehensible, without a cause, something dark indeed.
I would ask myself what time it could be;

Moncrieff:

FOR A LONG time I used to go to bed early. Sometimes, when I had put out my candle, my eyes would close so quickly that I had not even time to say ‘I’m going to sleep.’ And half an hour later the thought that it was time to go to sleep would awaken me; I would try to put away the book which, I imagined, was still in my hands, and to blow out the light; I had been thinking all the time, while I was asleep, of what I had just been reading, but my thoughts had run into a channel of their own, until I myself seemed actually to have become the subject of my book: a church, a quartet, the rivalry between François I and Charles V. This impression would persist for some moments after I was awake; it did not disturb my mind, but it lay like scales upon my eyes and prevented them from registering the fact that the candle was no longer burning. Then it would begin to seem unintelligible, as the thoughts of a former existence must be to a reincarnate spirit; the subject of my book would separate itself from me, leaving me free to choose whether I would form part of it or no; and at the same time my sight would return and I would be astonished to find myself in a state of darkness, pleasant and restful enough for the eyes, and even more, perhaps, for my mind, to which it appeared incomprehensible, without a cause, a matter dark indeed. I would ask myself what o’clock it could be;

Time 1: “For a long time” — a temporal period with some vagueness
Time 2: “I did not have time” — time as a resource in which to accomplish a goal
Time 3: “the thought that it was time to go to/look for sleep” — a discrete moment suggesting a demand
Time 4: “I had been thinking all the time, while I was asleep” — only in Moncrieff, vague use of a temporal period, perhaps rendered better in Enright’s interpretation “I had gone on thinking, while I was asleep,” which feels more accurate to the process described.
Note: I’m leaving out “sometimes”

Nothing here indicates a lack of precision on Enright’s part. Indeed, his paragraph reads better, is more modern and succinct.

Over and over as I looked at statistics, contexts, and collocations, I found that being able to look at the texts “side by side” was crucial if I wanted to investigate something, or test a theory. I didn’t believe that Voyant was showing me what I wanted to see, but was leading me to things that I might want to look at more closely. I’m sure that my close reading bias is an influence here — but Underwood’s provocation continues to loom.

Riffing on this, I thought perhaps I could look at the way the two authors approach the concept of memory — perhaps the most important theme of the entire work. This time (pun intended), I was working from a concept rather than a particular word. So, in lieu of having a way to look at a concept in Voyant directly, I pulled a set of terms from WordNet around the idea of “remembering”, such as: recall, remember, revisit, retrieve, recollect, think of, think back, etc. I then added “memory” to the mix and fed the set of terms into the trend tool. I noticed that the search field would dynamically take my search term and return me the option of, say, “remember” and “remember*”. “Remember*” seemed to look for words with “remember” as a root. Phew. So I didn’t have to include past forms like “remembered”.

I’ll include the graph, but the upshot is that the distribution of these words was very close in the two translations as well as the total use of the concepts. Again, I had to do some extra math to compare these. I didn’t see a way in Voyant to group them under a user-defined concept and count from there.

This is already getting very long, so I won’t show the side by side comparisons that “remember” led me to. It is clear, however, that some of the most interesting differences between the translations come at the sentence construction layer. Crucial sentences have significantly different syntactic constructions. I could get a sense of what the initial wordcount statistics led me to surmise, the economy of Enright’s translation over Moncrieff’s, and was struck by what I would consider “improvements” both stylistically and semantically, by going to the texts themselves to confirm.

Conclusions:

First off, disclaimer: I need to read the stats books Patrick sent out to be better able to analyze the data I did get and assess its significance. To do a study of translations, I should be more well versed in Translation Theory and literary translation theory in particular.

Second, I’m not sure comparing translations of a single work is a good application for distant reading; as Underwood suggests, the human brain is better at comparing smaller data sets. And there aren’t many data sets including translations of larger works that would generate enough volume to make sense. On the other hand, taking a larger view of the trends in translation over a longer span of time, with multiple texts, could be very interesting. Further, I wonder what I could have unearthed, at a more distant view, if I had been able to drill down to syntactic analysis, sentence construction, etc.

The focus of this assignment was to get started with this sort of tool, to play around with it. As a result of my “play” and the constraints of what I chose to play with, I found myself jumping back and forth between distant reading and close reading, finding more in the latter that seemed meaningful to me. But I was struck by what some basic stats could tell me — giving me some areas to further investigate in close reading — such as the implications of the disparity in number of sentences and word count.

Now to Voyant. Once I had “cleaned” my texts, it was incredibly simple to get up and running in Voyant. In order to compare different texts in the same upload, it took some digging, however. I would suggest uploading files, because then Voyant knows you have separate corpora. The information is there, mostly, but sometimes hidden; for example the Summary tool compares word count, vocabulary density and average sentence lengths, but then combines the corpora to determine the most frequently used words. It was only in the Document Terms tool that I could see the word counts separated by corpus — and this only on words I selected myself, either by clicking on them somewhere else or entering them in the search field. I would like to see a spread of most commonly used words by corpus.

The Terms tool is fantastic for drilling down to a terms collocations and correlations. Again, I wish that it had separated the stats by corpus, or given us a way to do that without invoking other tools. The Topics tool confused me. I’m not sure what it was doing ultimately; I could tell it was clustering words, but I couldn’t tell the principle. I will need to look at this more.

So, Voyant was very easy to get up and going, had a short learning curve to explore the significance of the tools provided and can really get at some excellent data on the macro level. It becomes harder to use it to customize data views to see exactly what you want to see on a more micro level. I don’t fault Voyant for that. It’s scope is different. As a learning tool for me, I am very grateful to have given it a shot. I might use it in the future in conjunction with, say nltk, to streamline my textual investigations.

Kavanaugh and Ford: A Mindful Exercise in Text-Mining

I’ve been fairly excited to utilize Voyant to do some textual analysis. I wanted to choose text to analyze that would engage with structural political issues to draw attention to inequalities within our societal structures. Thus, I’m particularly interested in engaging in discussions surrounding systems of power and privilege in modern America. This is why I’ve chosen to do a text analysis comparing and contrasting Brett Kavanaugh’s opening statement for the Senate Judiciary Committee with Dr. Christine Blasey Ford’s opening statement.

Before going any further, I would like to issue a trigger warning for topics of sexual violence and assault. I recognize that these past few weeks have been difficult and overwhelming for some (myself included), and I would like to be transparent as I move forward with my analysis on topics that may come up.

My previous research has centered around topics of feminist theory, and rape culture, so the recent events regarding Supreme Court Justice Nominee, Judge Brett Kavanaugh, have been particularly significant to my past work and personal interests.

To provide some very brief context on current events, Kavanaugh was recently nominated by President Donald Trump on July 9, 2018 to replace retiring Associate Supreme Court Justice Anthony Kennedy. When it became known that Kavanaugh would most likely become the nominee, Dr. Christine Blasey Ford came forward with allegations that Kavanaugh had sexually assaulted her in the 1980’s while she was in high school. In addition to Dr. Ford’s allegations, two other women came forward with allegations of sexual assault against Kavanaugh as well. To read a full timeline of the events that occurred (and continue to occur) surrounding Kavanaugh’s appointment, I suggest checking out the current New York Times politics section or Buzzfeed’s news tab regarding the Brett Kavanaugh vote.

The Senate Judiciary Committee surrounding Kavanaugh’s potential appointment invited both Kavanaugh and Ford to provide testimony about the allegation on September 24, 2018. Both Ford and Kavanaugh agreed to testify. Ford and Kavanaugh both gave a prepared speech ((initial testimony) on September 24, 2018 and then were asked questions from the committee. For this project, I am only comparing each opening statement–not the questions asked and answers given after the statements were provided. In the future, I believe much could be learned from a full and more thorough analysis including both the statements and the questions/responses given, however for the breadth of this current research and assignment I am only very briefly looking at both individuals opening statements.

This research is primarily exploratory in that I have no concrete hypothesis on what I will find. More-so, I am interested in engaging with each text to see if there are any conclusions that can be drawn from the language. Specifically, do either of the texts have implications regarding structurally oppressive systems of patriarchy and rape culture? Can the language of each speech tell us something about the ways in which sexual assault accusations are handled in the United States by the ways an accuser and the accused present themselves via issued statements? While this is only one example, I would be curious to see what type of questions can be raised from the text.

To begin, I googled both “Kavanaugh Opening Statement” and “Ford Opening Statement” to obtain the text for the analysis.

Here is the link I utilized to access Brett Kavanaugh’s opening statement.

Here is the link I utilized to access Dr. Christine Blasey Ford’s opening statement.

Next, I utilized Voyant, the open source web-based application for performing text analysis.

Here are my findings from Voyant

Kavanaugh’s Opening Statement:

Ford’s Opening Statement:

Comparison (Instead of directly copying and pasting the text in as I had done separately above, I simply inputed both links to the text into Voyant)

There are several questions that can be raised from this data. In fact, an entire essay could be written on a variety of discussions and arguments that compare and contrast the text and further look at how they compare with other cases like this one; however for the breadth of this short post I will only pull together a few key points that I noted, centering how the text potentially relates to the structural oppression of women in the United States.

First, I thought it was interesting that Kavanaugh’s opening statement was significantly longer (5,266 total words) than Ford’s (2,510 total words). Within a patriarchal society, women are traditionally taught (both directly and indirectly) to take up less space (physically and metaphorically), so I wondered if this could be relevant when considering the fact that Ford’s opening statement was significantly shorter than Kavanaugh’s. Does the internalized oppression of sexism in female-identified individuals contribute to the length of women’s responses to sexual violence–i.e. do women who experience sexual violence take up less space (potentially without even noticing or directly trying to) in regard to their statements than the accused (in this case men)? Perhaps a larger sample of research comparing both accuser’s and the accused sexual assault statements (specifically when the accuser is female and the accused is male) could provide more insight on this.

Additionally, another observation I had while comparing and contrasting the texts with one another was the most used words within each text. Specifically, one of the most used words in Kavanaugh’s (the accused) speech was “women” which I found to be interesting. Do other people (specifically men) who are accused of sexual violence often use the word “women” in statements regarding sexual violence? Is this repetitive use of the word used to somehow prove that an individual would not harm women (even when they are being accused of just that)? It makes me consider an aspect of rape culture that is often seen when dealing with sexual violence–the justification that one could ultimately not commit crimes of sexual violence because they are a “good man” who has many healthy relationships (friendships or romantic) with women. There is no evidence that just because a man has some positive relationships with women that he is less likely to commit sexual assault; however there is data that states that people are more likely to be sexually assaulted by someone they know (RAINN). I would be curious to look into this further by utilizing tools like Voyant to consider the most used words in other statements from accused people of sexual violence.

Ultimately, this was a brief and interesting exercise in investigation and exploration. I think that there could be many different interesting and important research opportunities utilizing tools like Voyant that look at statements provided by sexual violence survivors and those who are accused of sexual violence. This was just a starting point and by no means is the necessary and extensive research that most done on this topic, rather it remains the beginning for further questions to be asked and analyzed. I’m eager to dive into more in-depth research on these topics in the future, possibly using Voyant or other text-mining web-based applications.

Text mining praxis: mining for evidence of course learning outcomes in student writing

I’ve been hearing more and more about building corpora of student writing of late, and while I haven’t actually consulted any of these, I was happy to have the opportunity to see what building a small corpus of student writing would be like in Voyant. I was particularly excited about using samples from ENGL 21007: Writing for Engineering which I taught at City College in Spring 2018, because I had a great time teaching that course and know the writing samples well.

Of the four essays written in ENGL 21007 I chose the first assignment, a memo, because it is all text (the subsequent assignments contain graphs, charts and images and I wasn’t sure how these would behave in Voyant). I downloaded the student essays from Blackboard as .docx and redacted them in Microsoft Word. This was a bad move because Microsoft Word 365 held on to the metadata, so student email accounts showed up when I uploaded my corpus to Voyant. I quickly removed my corpus from Voyant and googled how do I remove the metadata, then decided that it would be faster to convert all .docx to .pdf and redact them with Acrobat Pro (I got a one-week free trial) so I did this, zipped it up and voila.

22 Essays Written by Undergraduate Engineering Majors at City College of New York, Spring 2018

I love how Voyant automatically saves my corpus to the web. No registration, no logging in and out. There must be millions of corpora out there.

I was excited to see how the essays looked in Voyant and what I could do with them there. I decided to get the feeling of Voyant by first asking a simple question: what did students choose to write about? The assignment was to locate something on the City College campus, in one’s community or on one’s commute to college that could be improved with an engineering solution.

Cirrus view shows most frequently used words in 22 memos written by engineering majors in ENGL 21007

What strikes me as I look at the word cloud is that students’ concern with “time” (61 occurrences) was only slightly less marked than the reasonable – given that topics had to be related to the City College campus – concern with “students” (66 occurrences). I was interested to see that “escalators” (48 occurrences) got more attention than “windows” (40 occurrences), but I think we all felt strongly about both. “Subway” (56 occurrences) and “MTA” (50 occurrences), which are the same thing, were a major concern. Uploading samples of student writing and seeing them magically visualized in a word cloud summarizes the topics we addressed in ENGL 21007 in a useful and powerful way.

Secondly and in a more pedagogical vein, I wanted to see how Voyant could be used to measure the achievement of course learning outcomes in a corpus of student writing. This turned out to be a way more difficult question than my first simple what did students write about. The challenge lies in figuring out what query will tell me whether the eight English 21007 course learning outcomes listed on the CCNY First Year Writing Program website were achieved through the essay assignment that produced the 22 samples I put in Voyant, and whether evidence of having achieved or not achieved these outcomes can be mined from student essays with Voyant. Two of the course learning outcomes seemed more congenial to the Memo assignment than others. These are:

“Negotiate your own writing goals and audience expectations regarding conventions of genre, medium, and rhetorical situation.”

“Practice using various library resources, online databases, and the Internet to locate sources appropriate to your writing projects.”

To answer the question of whether students were successful in negotiating their writing goals would require knowing what their goals were. Not knowing this, I set this part of the question aside. Audience expectations was easier. In the assignment prompt I had told students that the memo had to be addressed to the department, office or institution that had the power to approve the implementation of proposed engineering solutions or the power to move engineering proposals on to the department, office or institution that could eventually approve these. There are, however, many differently named addressees in the student memos I put in this corpus. Furthermore, addressing the memo to an official body does not by itself achieve the course learning outcome. My question therefore becomes, what general conventions of genre, medium and rhetorical situation do all departments, offices or institutions expect to see in a memorandum, and how do I identify these in a query? What words or what combinations of words constitute memo-speak? To make things worse (or better :)!), I had told students that they could model their memos on the examples of memos I gave them or, if they preferred, they could model them differently so long as they were coherent and good. I therefore cannot rely on form to measure convention of genre. I’m sorry to say I have no answers to my questions as of yet; I’m still trying to figure out how to ask my corpus if students negotiated audience expectations regarding conventions of genre, medium and rhetorical situation (having said this, I think I can rule out medium, because I asked students to write the memo in Microsoft Word).

The second course learning outcome I selected – that students practice using library resources, online databases and the internet – strikes me as more quantifiable than the first.

Only one of 22 memos contains the words “Works Cited”

Unfortunately, I hadn’t required students do research for the memos I put in the corpus. When I looked for keywords that would indicate that students had done some research Voyant came up with one instance of “Bibliography,” one instance of “Works Cited” and no instances of “references” or “sources.” The second course learning outcome I selected is not as congenial to the memo assignment – or the memo assignment not congenial to that course learning outcome – as I first thought.

I tried asking Veliza the bot for help in figuring out whether course learning outcomes had been achieved (Veliza is the sister of Eliza, a psychotherapist, and says she isn’t too good at text analysis yet). Veliza answers questions with expressions of encouragement or more questions but she’s not much help. The “from text” button in the lower right corner of the Veliza tool is kind of fun because it fetches sentences from the text (according to what criteria I haven’t a clue) but conversation quickly gets surreal because Veliza doesn’t really engage.

In conclusion, I am unsure how to use text mining to measure course learning outcomes in student writing done in 200-level courses. I think that Voyant may work better for measuring course learning outcomes in courses with more of an emphasis on vocabulary and grammar, such as, for example, EAL. It’s a bit of a dilemma for me, because I think that the achievement of at least some course learning outcomes should be measurable in the writing students produce in a course.

Jameson and His Theory Explored

Patrick Grady O’Malley

For the text analysis assignment, I used as more corpus to literary theory articles: Jameson’s Third-World Literature in the Age of Multinational Capitalism and a response to that very article by Ahmed, Jameson’s Rhetoric of Otherness and the ‘National Allegory.‘ I chose these two articles as my corpus because I wanted to play around with theoretical texts in response to my latest blog post. In the future, I would consider looking for importable text files of literary examples discussed in these articles, to compare and contrast what could be found in the theoretical work versus the literary.

One weakness I noticed right of the bat, or maybe it is my own ignorance to the tool, but I would have liked to have been able to load the two different articles separately but been able to look through the results comparatively. In other words, I wanted to have separate results but in the same window/screen. Right now, I am just operating with two different browser windows of Voyant and looking back and forth amongst them.

The first notable thing of the results is that the word “world” is by far the most frequent word of both texts. This is not surprising considering they are articles on world literature. Since Jameson’s article is theoretical in nature and Ahmed’s is a response to that work, the differences begin to pile up after that one similarity. In Jameson’s article, the next four most frequent words are “cultural,” “political,” “social” and “new.” These seem to sum up the theme of most of what he was saying throughout the article the argued for the necessity of considering first, second and third-world literature through different lenses. Ahmed’s piece was argumentative to that standpoint, and one of his major contentions was thinking of world literature as all of one world. So his next four most frequent words were less thematic in nature, and more supportive of his argument. They are “texts,” “Jameson’s,” “experience” and “theory.”

Jameson’s word cloud

Ahmed’s word cloud

As you can see in the word cloud images, there is some overlap of word distribution, namely “world” and “literature.” But what I find striking is how Jameson’s article (as per the word cloud) emphasizes the “social” and “political” in a broader sense, with these two word respectively appearing in the word cloud. However, in Ahmed’s critique, the social and political are less nuanced and more direct with words such as “capitalist,” “colonialism,” and “Urdu.” This would make sense considering Ahmed’s piece is a critique, and his argument is more specific and less generalized than Jameson’s overarching work.

“World literature” is the most commonly appearing collocate in both articles. What is interesting is that Jameson’s second most frequent collocate is the writer Lu Xun (and Lu Xun’s), which with both the name and the possessive counted together, actually appear more than “world literature.” He spends a good deal of time discussing the work of Xun, but he also talks about other authors, so I am surprised Lu Xun and his possessive had such frequency.

One could also see through the collocates my point about Ahmed being more specific of particular types of theoretical or political issues. He has collocates near the top such as “experience/imperialism,” “experience/colonialism,” “capitalist/world,” and “Jameson’s/rhetoric.” Meanwhile, Jameson’s top collocates are more optimistic in nature: “world/culture,” great/great” and “world/intellectual.” I suppose it could be argued that in Jameson’s collocate “world/intellectual” that Jameson is personifying that intellectual with the appearance of his own name and rhetoric as one of Ahmed’s top collocates.

Jameson’s word bubble

Ahmed’s word bubble

It is striking that Ahmed’s word bubble is the one to include literature, as that is the focus of really both articles. “Social,” “cultural,” “political” and “new” in Jameson’s attests to his work being more descriptive and larger in scope than Ahmed’s rebuttal, which takes a close reading of the article and responds very specifically.

Overall, I can see how mining the text of theoretical works can be very useful in building a corpus of theory that can help with my research goals. Distant reading of theoretical works shows the overall nature of the work at a surface level and the prevalence of concepts and themes at a closer level.

Some Resources with Indigenous Maps

Today is Indigenous People’s Day, so I thought I’d look around for some indigenous mapping projects, which certainly count as DH. I found a couple resources that I wanted to share here.

Native Land seems to be a well-known resource. The first thing a user of the website sees is a disclaimer:

This map does not represent or intend to represent official or legal boundaries of any Indigenous nations. To learn about definitive boundaries, contact the nations in question.

Also, this map is not perfect — it is a work in progress with tons of contributions from the community. Please send us fixes if you find errors.

If you would like to read more about the ideas behind Native Land or where we are going, check out the blog. You can also see the roadmap.

…So this may not be an accurate source of information and it’s definitely a work in progress. According to its “About” page, the website is run by Victor G. Temprano, who is not himself Native. However, I do think that being upfront about the potential flaws in the map is a good move. Additionally, the map links to a page about using its information critically. In particular, this page deals with some of the difficulties of using the colonial format of a map to illustrate the overlapping indigenous territories. Interestingly, this map also doesn’t address the issue of time, so we can’t see how territories may have changed over time.

All that said, I really like the immediacy of this map and the way it shows those overlaps. According to the map, the Graduate Center is on Lenape land, and the Delaware and Montauk languages were spoken here. The map also includes links to information about these languages and the websites for the nations/tribes (both words seem to be used in the links?)

In any case, the other interesting resource I came across was this Indigenous Mapping Workshop, which provides ” geospatial training and capacity building to bring culturally relevant and appropriate earth observation technologies to support Indigenous mapping.” This workshop has been offered annually since 2014. I poked around the website but didn’t see links to any of the projects people have created in this workshop. However, it reminds me Miriam Posner’s piece, ” “What’s Next: The Radical, Unrealized Potential of Digital Humanities.” In that piece, she critiques the way that many DH projects have built on existing, colonialist infrastructure. I’m interested in how the work done in this workshop breaks free of that.

DHUM 70000 – Introduction to Digital Humanities

Fall 2018 CUNY Graduate Center | #dhintro18