Tag Archives: praxis

Project T.R.I.K.E – Principles and Origin Myths

Hannah’s already provided some use cases that I hope help to illustrate why we think that Project T.R.I.K.E will be useful, and to whom.  I wanted to backtrack and give some context. Although, as Hannah’s post suggests, it’s quite difficult to suggest a specific starting point for our thought processes, which have developed iteratively until we’re not sure whether we’re trapped in a time loop or not.  However, I think I can trace through some of the things I think are important about it.

We really wanted to do something that would be useful for pedagogy. Again, if you want to know how it’s useful for pedagogy, please see Hannah’s post! But we were specifically interested in a resource that would teach methodology, because all of us were methodological beginners who really felt the need for more tools and resources that would help us to develop in that respect.  During our environmental scan, we were impressed by the efforts of the DH community to produce a number of useful guides to tools, methodologies, and processes (in particular, please see Alan Liu’s DH Toy Chest and devdh.org), although none of them were doing exactly what we want to do. There are plenty of dead resources out there, too, and we should take that as a warning.

We really wanted to take a critical stance on data by creating something that would highlight its contingent, contextual, constructed nature, acknowledging that datasets are selected and prepared by human researchers, and that the questions one can ask are inextricably connected to the process through which the dataset is constituted. Our emphasis on a critical approach does not originate in this class; I believe all of us had been exposed to theories about constructedness before this. What’s curious about our process is that we went out seeking datasets and tutorials with this in mind, thinking about what we hoped to do, and this conversation ranged far from the class readings, focusing on our own work and also Rawson and Muñoz’s “Against Cleaning”   but eventually brought us back to Posner, Bode, and Drucker.  None of them, however, came away with exactly the solution we did; we decided that the constructed nature of data is best represented by making transparent the process of construction itself! Project T.R.I.K.E. will provide snapshots of the data at different stages in the process, highlighting the decisions made by researchers and interrogating how these decisions are embodied in the data.

Finally, we really wanted to ensure that we could produce something that could be open to the community. Again, a lot of work in the DH community is openly available, but we also came across some datasets behind paywalls.  One repository aggregating these datasets not only made it difficult to access the databases but also had a series of stern lectures about copyright, occupying much the same space on their website that instruction in methodology would occupy on ours! While it is true that some humanities data may invoke copyright in a way that other kinds of data usually don’t, we’d much rather host datasets that we can make available to a wide variety of users with a wide variety of use cases. Limiting access to data limits research.

Think carefully, though. As part of the environmental scan, we came across an article that argues, on the basis of a report partially sponsored by Elsevier, that researchers seldom make their data available, even when they are required to do so. While I expect this is true, I am also suspicious of claims like this when they are made by major publishers, because their next step will probably be to offer a proprietary solution which will give them yet more control over the scholarly communication ecosystem.  In a context in which major publishers are buying up repositories, contacting faculty directly, and co-opting the language of open access as they do so, I’d argue that it’s more and more important for academics to build out their (our) own infrastructure. Project T.R.I.K.E. has slightly humbler ambitious, for time being, but it’s an opportunity for us to begin building some infrastructure of our own.

Network Analysis of Wes Anderson’s Stable of Actors

I had initially planned to have my network analysis praxis build on the work I had started in my mapping praxis, which involved visualizing the avant-garde poets and presses represented in Craig Dworkin’s Eclipse, the free on-line archive focusing on digital facsimiles of the most radical small-press writing from the last quarter century. Having already mapped the location of presses that had published work in Eclipse’s “Black Radical Tradition” list, I thought that I might try to expand my dataset to include the names and addresses for those presses that had published works captured in other lists in the archive (e.g., periodicals, L=A=N=G=U=A=G=E poets). My working suspicion was that I would find through these mapping and networking visualizations unexpected connections among the disparate poets in Eclipse and (possibly, later) those featured in other similar archives like UbuWeb or PennSound, which could potential yield new comparative and historical readings of these limited-run works by important poets.

The dataset I wanted and needed didn’t already exist, though, and the manual labor involved in my creating it–I would have to open the facsimile for each of multiple dozens of titles and read through its front and back matter hunting for press names and affiliated addresses–was more than I was able to offer this week. So I’ve tabled the Eclipse work only momentarily in favor of experimenting with a more or less already-ready dataset whose network analysis I could actually see through from beginning (collection) to end (interpretation).

Unapologetically twee, I built a quick dataset of all the credited actors and voice actors in each of Wes Anderson’s first nine feature-length films: Bottle Rocket (1996), Rushmore (1998), The Royal Tenenbaums (2001), The Life Aquatic with Steve Zissou (2004), The Darjeeling Limited (2007), Fantastic Mr. Fox (2009), Moonrise Kingdom (2012), The Grand Budapest Hotel (2014), and Isle of Dogs (2018). As anyone who has seen any of Anderson’s films knows, his aesthetic is markedly distinct and immediately recognizable by its right angles, symmetrical frames, unified color palettes, and object-work/tableaux. He also relies on the flat affective delivery of lines from a core stable of actors, many of whom return again and again to the worlds that Anderson creates. Because of the way these actors both confirm and surprise expectations–of course Adrian Brody would be an Anderson guy, but Bruce Willis?–I wanted to use this network analysis praxis to visualize the stable in relation to itself and to start to pick at interpreting the various patterns or anomalies therein.

Fortunately IMDB automated a significant portion of the necessary prep work by providing the full cast list for each film and formatting each cast member’s first and last name in a long column–a useful tip I picked up while digging around Miriam Posner’s page of DH101 network analysis resources–so I was able to easily copy and paste all of my actor data into a Google Sheet and manually add the individual film data after. (I couldn’t copy and paste actor names from IMDB without grabbing character names as well, so I kept them, not knowing if they would end up being useful. For this brief experiment, they weren’t.)

I used Google’s Fusion Tables and its accompanying instructions to build a Network Graph of the Anderson stable, the final result of which you can access here. As far as other tools went, Palladio timed out on my initial upload, buffering forever, and Gephi had an intimidating interface for what I intended to be a light-hearted jaunt. Fusion Tables was familiar enough and seemed to have sufficient default options for analyzing my relatively small dataset (500-ish rows in three columns), so I took the path of least resistance, for now.

A quick upload of my Sheet and a + Add Chart later, my first (default) visualization looked taxonomical and useless, showing links between actor and character that, as you might expect, mapped pretty much one-to-one except in those instances where multiple actors played generic background roles with identical character names (e.g., Pirate, Villager).

A poorly organized periodic table of characters

I changed the visualization to instead show a link between actor and film, and was surprised to find that this still didn’t show me anything expected (only one film?) or intriguing. Then I noticed that only 113 of the 449 nodes were showing, so I upped the number to show all 449 nodes. Suddenly, the visualization became not only more robust and legible, but also quite beautiful! Something like a flower bloom, or simultaneous and overlapping fireworks.

Beautiful as the fireworks were, I felt like the visualization was still telling me too much information, with each of the semi-circles consisting primarily of actors who had one-off relationships to these films. Because I wanted to know more about the stable of actors and not the one-offs, I filtered my actor column to include only those who had appeared in more than one of Anderson’s films (i.e., names that showed up on the list two or more times). I also clicked a helpful button that automatically color-coded columns so that the films appeared in orange and the actors in blue. This resulted in a visualization just complex enough to be worth my interrogating and/or playing with, yet fixed or structured enough to keep my queries contained.

As far as reading these visualizations go, it’s something like this: Anderson’s first three films fall bottom-left; his next three films fall top-center; and his three most recent films fall bottom-right. Thus, the blue dots bottom-left are actors featured among the first three films only; blue dots bottom-center are actors who appear consistently throughout Anderson’s work; and blue dots bottom-right are actors included among his most recent films. As you can see by hovering over an individual actor node: the data suggests (e.g.) that Bill Murray is the most central (or at least, most frequently recurring) actor in the Anderson oeuvre, appearing in eight of the nine feature-length films; meanwhile, Tilda Swinton, along with fellow heavyweights Ed Norton and Harvey Keitel, appears to be a more recent Anderson favorite, surfacing in each of his last three films.

Also of interest: the name Eric Chase Anderson sits right next to Murray at the center of the network; Eric is the brother of Wes, the illustrator of much of what we associate with Wes Anderson’s aesthetic, and apparently also an actor in the vast majority of his brother’s films. (I’m not sure this find would have surfaced as quickly without the visualization.)

Elsewhere, the data suggests that Anderson’s first film Bottle Rocket was more of a boutique operation that consisted of a relatively small number of repeat actors (8), only two of which–Kumar Pallana and Owen Wilson–appeared in films beyond the first three. Anderson’s seventh film The Grand Budapest Hotel, released nearly twenty years later, expanded to include a considerable number of repeat actors (22: the highest total on the list), nine of whom were first “introduced” to the Anderson universe here and subsequently appeared in the next film or two.

I wonder what we would see if we visualized nodes according to some sort of sliding scale from “lead actor” to “ensemble actor” in each of these films, perhaps by implementing darker/more vibrant edges depending on screen time or number of lines? Would Bill Murray be more or less central than he is now? Would Eric Chase Anderson materialize at all?

And I wonder what opportunities there are to further visualize nodes based on actor prestige (say, award nominations and wins get you a bigger circle) or to create “famous actor” heat maps (maybe actors within X number of years of a major award nomination or win get hot reds and others cool blues) that might show us how Anderson’s casting choices change over time to include more big names. Conversely, what could these theoretical large but cool-temperature circles tell us about Anderson’s use of repeat “no-name” character actors to flesh out his wolds?

Further, I wonder if there are ways of using machine learning to analyze these networks and to predict the likelihood of certain actors’ being cast in Anderson’s next film based on previous appearances (i.e., the “once you’re in, you’re in” phenomenon) or recent success. Could we compare the Anderson stable versus, say, the Sofia Coppola or Martin Scorsese stables, to learn about casting preferences or actor “types”?

A Network Analysis of our Initial Class Readings

This praxis project visualizes a network analysis of the bibliographies from the September 4th required readings in our class syllabus plus the recommended “Digital Humanities” piece by Professor Gold. My selection of topic was inspired by a feeling of being swamped by PDFs and links that were accumulating in my “readings” folder with little easy-to-reference surrounding context or differentiation. Some readings seemed to be in conversation with each other, but it was hard to keep track. I wanted a visualization to help clarify points of connection between the readings. This is inherently reductionist and (unless I’m misquoting here, in which case sorry!) it makes Professor Gold “shudder”, but charting things out need not replace the things themselves. To me, it’s about creating helpful new perspectives from which to consider material and ways to help it find purchase in my brain.

Data Prep
I copy/pasted author names from the bibliographies of each reading into a spreadsheet. Data cleaning (and a potential point for the introduction of error) consisted of manually editing names as needed to make all follow the same format (last name, first initial). For items with summarized “et al” authorship, I looked up and included all author names.

I performed the network analysis in Cytoscape, aided by Miram Posner’s clear and helpful tutorial. Visualizing helped me identify and fix errors in the data, such as an extra space causing two otherwise identical names to display separately.

The default Circular Layout option in the “default black” style rendered an attractive graph with the nodes arranged around two perfect circles, but unfortunately the labels overlapped and many were illegible. To fix the overlapping I individually adjusted the placement of the nodes, dragging alternating nodes either toward or away from the center to create room for each label to appear and be readable in its own space. I also changed the label color from gray to white for improved contrast and added yellow directional indicators, as discussed below. I think the result is beautiful.

Network Analysis Graph
Click the placeholder image below and a high-res version will open in a new tab. You can zoom in and read all labels on the high-res file.

An interactive version of my graph is available on CyNetShare, though unfortunately that platform is stripping out my styling. The un-styled, harder-to-read, but interactive version can be seen here.

Author nodes in this graph are white circles and connecting edges are green lines. This network analysis graph is directional. The class readings are depicted with in-bound connections from the works cited terminating in yellow diamond shapes. From the clustering of yellow diamonds around certain nodes, one can identify that our readings were authored by Kirschenbaum, Fitzpatrick, Gold, Klein, Spiro, Hockey, Alvarado, Ramsey, and (off in the lower left) Burke. Some of these authors cited each other, as can be seen by the green edges between yellow-diamond-cluster nodes. Loops at a node indicate the author citing themselves. Multiple lines connecting the same two nodes indicate citations of multiple pieces by the same author.

It is easy to see in this graph that all of the readings were connected in some way, with the exception of an isolated two-node constellation in the lower left of my graph. That constellation represents “The Humane Digital” by Burke, which had only one item (which was by J. Scott) in its bibliography. Neither Burke nor Scott authored nor were cited in any of the other readings, therefore they have no connections to the larger graph.

The vast majority of the nodes fall into two concentric circle forms. The outer circle contains the names of those who were cited in only one of the class readings. The inner circle contains those who were cited in more than one reading, including citations by readings-authors of other readings-authors. These inner circle authors have greater out-degree connectedness and therefore more influence in this graphed network than do the outer circle authors. The authors with the highest degree of total connections among the inner circle are Gold, Klein, Kirschenbaum, and Spiro. The inner circle is a hub of interconnected digital humanities activity.

We can see that Spiro and Hockey had comparitively extensive bibliographies, but that Spiro’s work has many more connections to the inner circle digital humanities hub. This is likely at least partly due to the fact that Hockey’s piece is from 2004, while the rest of the readings are from 2012 or 2016 (plus one which will be published next year in 2019). One possible factor, some of the other authors may not have been yet publishing related work when Hockey was writing her piece in the early 2000’s. Six of our readings were from 2012, the year of Spiro’s piece. Perhaps a much richer and more interconnected conversation about the digital humanities developed at some point between 2004 and 2012.

This network analysis and visualization is useful for me as a mnemonic aide for keeping the readings straight. It can also serve to refer a student of the digital humanities to authors they may find it useful to read more of or follow on Twitter.

A Learning about Names
I have no indication that this is or isn’t occurring in my network analysis, but in the process of working on this I realized any name changes, such as due to a change in marital status, would make an author appear as two different people. This predominantly affects women and, without a corrective in place, could make them appear less central in graphed networks.

There are instances where people may have published with different sets of initials. In the bibliography to Hockey’s ‘The History of Humanities Computing,’ an article by ‘Wisbey, R.’ is listed just above a collection edited by ‘Wisbey, R. A.’ These may be the same person but it cannot be determined with certainty from the bibliography data alone. Likewise, ‘Robinson, P.’ and ‘Robinson, P. M. W.’ are separately listed authors for works about Chaucer. These are likely the same person, but without further research I cannot be 100% certain. I chose to not manually intervene and so these entries remain separate. It is useful to be aware that changing how one lists oneself in authorship may affect how algorithms understand the networks to which you belong.

Potential Problems
I would like to learn to what extent the following are problematic and what remedies may exist. My network analysis graph:

  • Doesn’t distinguish between authors and editors
  • I had to split apart collaborative works into individual authors
  • Doesn’t include works that had no author or editor listed

Postscript: Loose Ties to a Current Reading
In “How Not to Teach Digital Humanities,” Ryan Cordell suggests that introductory classes should not lead with “meta-discussions about the field” or “interminable discussions of what counts or does not count [as digital humanities]”. In his experience, undergraduate and graduate students alike find this unmooring and dispiriting.

He recommends that instructors “scaffold everything [emphasis in the original]” to foster student engagement. There is no one-size-fits-all in pedagogy. Even within the same student learning may happen quicker or information may be stickier if it is presented in context or in more than one way. Providing multiple ways into the information that a course covers can lead to good student learning outcomes. It can also be useful to provide scaffolding for next steps or going beyond the basics for students who want to learn more. My network analysis graph is not perfect, but having something as a visual reference is useful to me and likely other students as well.

Cordell also endorses teaching how the digital humanities are practiced locally and clearly communicating how courses will build on each other. This can help anchor students in where their institution and education fit in with the larger discussions about what the field is and isn’t. Having gone through the handful of assigned “what is DH” pieces, I look forward to learning more about the local CUNY GC flavor in my time as a student here. This is an exciting field!


Update 11/6/18:

As I mentioned in the comments, it was bothering me that certain authors who appeared in the inner circle rightly belonged in the outer circle. This set of authors were ones who were cited once in the Introductions to the Debates in Digital Humanities M. K. Gold and L. Klein. Due to a challenge depicting co-authorship, M. K. Gold and L. Klein appear separately in the network article, so authors were appearing to be cited twice (once each by Gold and Klein), rather than the once time they were cited in the pieces co-authored by Gold and Klein.

I have attempted to clarify the status of those authors in the new version of my visualization below by moving them into the outer ring. It’s not a perfect solution, as each author still shows two edges instead of one, but it does make the visualization somewhat less misleading and clarifies who are the inner circle authors.


Bibliographies, Networks, and CUNY Academic Works

I was really excited about doing a network analysis, even though I seem to have come all the way over here to DH just to do that most librarianly of research projects, a citation analysis.

I work heavily with our institutional repository, CUNY Academic Works, so I wanted to do a project having to do with that.  Institutional repositories are one of the many ways that scholarly works can be made openly available.  Ultimately, I’m interested in seeing whether the works that are made available through CAW are, themselves, using open access research, but for this project, I thought I’d start a little smaller.

CAW allows users to browse by discipline using this “Sunburst” image.

Each general subject is divided into smaller sub-disciplines.  Since I was hoping to find a network, I wanted to choose a sub-discipline that was narrow but fairly active. I navigated to “Arts and Humanities,” from there to “English Language and Literature,” and finally to “Literature in English, North America, Ethnic and Cultural Minority.” From there, I was able to look at works in chronological order. Like most of the repository, this subject area is dominated by dissertations and capstone papers; this is really great for my purposes because I am very happy to know which authors students are citing and from where.

The data cleaning process was laborious, and I think I got a little carried away with it. After I’d finished, I tweeted about it, and Hannah recommended pypdf as a tool I could have used to do this work much more quickly.  Since I’d really love to do similar work on a larger scale, this is a really helpful recommendation, and I’m planning on playing with it some more in the future (thanks, Hannah!)

I ended up looking at ten bibliographies in this subject, all of which were theses and dissertations from 2016 or later.  Specifically:

 Jarzemsky, John. “Exorcizing Power.”

Green, Ian F. P. “Providential Capitalism: Heavenly Intervention and the Atlantic’s Divine Economist”

La Furno, Anjelica. “’Without Stopping to Write a Long Apology’: Spectacle, Anecdote, and Curated Identity in Running a Thousand Miles for Freedom”

Danraj, Andrea A. “The Representation of Fatherhood as a Declaration of Humanity in Nineteenth-Century Slave Narratives”

Kaval, Lizzy Tricano. “‘Open, and Always, Opening’: Trans- Poetics as a Methodology for (Re)Articulating Gender, the Body, and the Self ‘Beyond Language ’”

Brown, Peter M. “Richard Wright’ s and Chester Himes’s Treatment of the Concept of Emerging Black Masculinity in the 20th Century”

Brickley, Briana Grace. “’Follow the Bodies”: (Re)Materializing Difference in the Era of Neoliberal Multiculturalism”

Eng, Christopher Allen. “Dislocating Camps: On State Power, Queer Aesthetics & Asian/Americanist Critique”

Skafidas, Michael P. “A Passage from Brooklyn to Ithaca: The Sea, the City and the Body in the Poetics of Walt Whitman and C. P. Cavafy”

Cranstoun, Annie M. “Ceasing to Run Underground: 20th-Century Women Writers and Hydro-Logical Thought”

Many other theses and dissertations are listed in Academic Works, but are still under embargo. For those members of the class who will one day include your own work in CAW, I’d like to ask on behalf of all researchers that you consider your embargo period carefully! You have a right to make a long embargo for your work if you wish, but the sooner it’s available, the more it will help people who are interested in your subject area.

In any case, I extracted the authors’ names from these ten bibliographies and put them into Gephi to make a graph.  I thought about using the titles of journals, which I think will be my next project, but when I saw that all the nodes on the graph have such a similar appearance graphically, I was reluctant to mix such different data points as authors and journals.

As I expected, each bibliography had its own little cluster of citations, but there were a few authors that connected them, and some networks were closer than others.

Because I was especially interested in the authors that connected these different bibliographies, I used Betweenness Centrality to map these out, to produce a general shape like this:

This particular configuration of the data uses the Force Atlas layout.  There were several available layouts and I don’t how they’re made, but this one did a really nice job of rendering my data in a way that looked 3D and brought out some relationships among the ten bibliographies.

Some Limitations to My Data

Hannah discussed this in her post, and I’d run into a lot of the same issues and had forgotten to include it in my blog post!  Authors are not always easy entities to grasp. Sometimes a cited work may have several authors, and in some cases, dissertation authors cited edited volumes by editor, rather than the specific pieces by their authors. Some of the authors were groups rather than individuals (for instance, the US Supreme Court), and some pieces were cited anonymously.

In most cases, I just worked with what I had. If it was clear that an author was being cited in more than one way, I tried to collapse them, because there were so few points of contact that I wanted to be sure to bring them all out. There were a few misspellings of Michel Foucault’s name, but it was really important to me to know how relevant he was in this network.

Like Hannah, I pretended that editors were authors, for the sake of simplicity.  Unlike her, I didn’t break out the authors in collaborative ventures, although I would have in a more formal version of this work.  It simply added too much more data cleaning on top of what I’d already done.  So I counted all the co-authored works as the work of the first author — flawed, but it caught some connections that I would have missed otherwise.

Analyzing the Network

Even from this distance, we can get a sense of the network. For instance, there is only one “island bibliography,” unconnected to the rest.

Note, however, that another isolated node is somewhat obscured by its positioning: Jarzemsky, whose only connection to the other authors is through Judith Butler.

So, the two clearest conclusions were these:

  • There is no source common to all ten bibliographies, but nine of them share at least one source with at least one other bibliography!
  • However, no “essential” sources really stand out on the chart, either. A few sources were cited by three or four authors, but none of them were common to all or even a majority of bibliographies.

My general impression, then, is that there are a few sources that are important enough to be cited very commonly, but perhaps no group of authors that are so important that nearly everyone needs to cite them. This makes sense, since “Ethnic and Cultural Minority” lumps together many different groups, whose networks may be more visible with a more focused corpus.

There’s also a disparity among the bibliographies; some included many more sources than others (perhaps because some are PhD dissertations and others are master’s theses, so there’s a difference in length and scope). Eng built the biggest bibliography, so it’s not too surprising that his bibliography is near the center of the grid and has the most connections to other bibliographies; I suspect this is an inherent bias with this sort of study.

The triangle of Eng, Brickley and Kaval had some of the densest connections in the network.  I try to catch a little of it in this screenshot:

In the middle of this triangle, several authors are cited by each of these authors, including Judith Butler, Homi Babhi, Sara Ahmed, and Gayle Salamon.  The connections between Brickley and Eng include some authors who speak to their shared interest in Asian-American writers, such as Karen Tei Yamashita, but also authors like Stuart Hall, who theorizes multiculturalism.  On the other side, Kaval and Eng both cite queer theorists like Jack Halberstam and Barbara Voss, but there are no connections between Brickley and Kaval that aren’t shared by Eng. There’s a similar triangle among Eng, Skafidas, and Green, but Skafidas has fewer connections to the four authors I’ve mentioned than they have to each other. This is interesting given the size of Skafidas’s bibliography; he cites many others that aren’t referred to in the other bibliographies.

(Don’t mind Jarzmesky; he ended up here but doesn’t share any citations with either Skafidas or Cranstoun.)

On the other hand, there is a stronger connection between Skafidas and Cranstoun. Skafidas writes on Cavafy and Cranstoun on Woolf, so they both cite modernist critics. However, because they are not engaging with multiculturalism as many of the other authors are, they have fewer connections to the others. In fact, Cranstoun’s only connection to an author besides Skafidas is to Eng, via Eve Kosofsky Sedgwick (which makes sense, as Cranstoun is interested in gender and Eng in queerness).  Similarly, La Furno and Danraj, who both write about slave narratives, are much more closely connected to each other than to any of the other authors – but not as closely as I’d have expected, with only two shared connections between them. The only thing linking them to the rest of the network is La Furno’s citation of Hortense Spillers, shared by Brickley.

My Thoughts

I’d love to do this work at a larger scale. Perhaps if I could get a larger sample of papers from this section of CAW, I’d start seeing the different areas that fall into this broad category of “Literature in English, North America, Ethnic and Cultural Minority.” I’m seeing some themes already – modernism, Asian-American literature, gender, and slave narratives seem to form their own clusters.  The most isolated author on my network wrote about twentieth-century African American literature and would surely have been more connected if I’d found more works dealing with the same subject matter. As important as intersectionality is, there are still networks based around specific literatures related to specific identity categories, with only a few  prominent authors that speak to overlapping identities. We may notice that Eng, who is interested in the overlap between ethnicity and queerness, is connected to Brickley on one side (because she is also interested in Asian-American literature) and Kaval on the other (because she is also interested in queerness and gender).

Of course, there are some flaws with doing this the way that I have; since I’m looking at recent works, they are unlikely to cite each other, so the citations are going in only one direction and not making what I think of as a “real” network. However, I do think it’s valuable to see what people at CUNY are doing!

But I guess I’m still wondering about that – are these unidirectional networks useful, or is there a better way of looking at those relationships? I suppose a more accurate depiction of the network would involve several layers of citations, but I worry about the complexity that would produce.

In any case, I still want to look at places of publication. It’s a slightly more complex approach, but I’d love to see which authors are publishing in which journals and then compare the open access policies of those journals. Which ones make published work available without a subscription? Which ones allow authors to post to repositories like this one?

Also: I wish I could post a link to the whole file! It makes a lot more sense when you can pan around it instead of just looking at screenshots.

Ten Things: Mapping the Eclipse Archive’s “Black Radical Tradition”

1 // Most of my reading and writing centers on poetic experiments. Usually the adjectives involved include at least one from a short list that is: computational, constraint-based, conceptual. Other common adjectives are avant-garde and radical, the latter of which appears twice in the source material for my mapping praxis.

2 // Constraint-based, conceptual poet Craig Dworkin manages Eclipse, the free on-line archive focusing on digital facsimiles of the most radical small-press writing from the last quarter century. I return to the Eclipse archive regularly to look at works from poets like Clark Coolidge, Lyn Hejinian, Bernadette Meyer, and Michael Palmer. These are the poets with whom I most familiar. There are many poets in this particular archive with whom I am not familiar at all. In fact, I would say most. These are the poets with whom I want to get familiar. My sense is that I would say most of the poets with whom I am not familiar at all, given their proximity in this particular archive to those poets with whom I am familiar, deserve to have I would say most of their work looked at regularly alongside the others’.

3 // “Given their proximity in this particular archive…”: I am jumping ahead and have one eye on our third dataset/network praxis assignment, wondering to what extent spatial, temporal, racial, gendered, and influential proximity manifests in this particular network of poetic experiments. Conceptual poetry is notoriously white and male, but where isn’t it that way? Where are the radical and avant-garde titles that aren’t being looked at? Where are they? With one eye on our third praxis assignment, I start building a dataset to use for the second. I start with the Black Radical Tradition.

4 // As a rule, for each title in the archive, Eclipse offers: a graf on the title’s publication and material history, a facsimile view of each page, and a PDF download. With lousy Amtrak wifi, I let the facsimiles of each of the 39 titles in the Black Radical Tradition slowly drip down my screen. I don’t yet know what I’ll want for my dataset down the line, but to get started I try to snag from Dworkin’s notes and the first three/last three pages the most obvious data points: author, title, publisher, publication date. Because Eclipse features both authored titles and edited volumes, I learn to add a column to distinguish between the two. I soon add another column to capture notes on the edition, usually to reflect whether the title is part of a series or is significantly different in a subsequent printing. Because I aim to map these spatially–I’m guessing these will cluster on the coasts, but I don’t know this for sure–I snag addresses (street, city, state, zip, country) for each of the publishers. Except for Russell Atkins’s Juxtapositions, which Dworkin notes is self-published and for which I can find no address.

5 // I start my map with ArcGIS’s simplest template, noting two other available templates–the Story Map Shortlist, which allows you to curate sets of places like Great Places in America‘s three “neighborhoods,” “public spaces,” and “streets” maps, and the Story Map Swipe, which allows you to swipe between two contiguous maps like in the Hurricane Florence Damage Viewer–that I might return to in the future if I want to, say, provide curated maps by individual poet, or else compare “publisher maps” of the Black Radical Tradition and the L=A=N=G=U=A=G=E poets (another set of titles in the Eclipse archive).

6 // Even with the basic template, I experience four early issues with ArcGIS:

First, the map doesn’t recognize, and therefore can’t map, the addresses for each of my three United Kingdom-based publishers. This seems to be a limit of the free version of ArcGIS or possibly the specific template I am working with. This is problematic because it keeps me from making an international analysis or comparison, if I want to.

As I click ahead without a lot of customization, the default visualization presented to me assigns each author a different colored circle (fine). The problem with this is that it, for some reason, lumps four of the poets into a single grey color as “Other,” making it impossible to distinguish Bob Kaufman in San Francisco from Joseph Jarman in Chicago.  Those in the grey “Other” category each have one title to their name, but, confusingly, so do several “named” authors, including Fred Moten in green and Gwendolyn Brooks in purple.

Third, beyond placing a dot on each location (fine), the map suggests and kind of defaults to confusing aesthetic labels/styles, such as making the size of the dot correspond to its publication year. In my first map, the big dots signal the most recently published title, which, worse than telling me nothing, appears to tell me something it doesn’t, like how many titles were published out of a single city or zip code. The correlation between year and dot size seems irrelevant, and ArcGIS is unable to read my data in such a way as to offer me any other categories to filter on (e.g., number of titles by a single author in the dataset, so that more prolific authors look bigger, or smaller, I’m not sure).

Once I make all the dots equally sized, a fourth problem appears: from a fully scoped-out view, multiple authors published in the same city (e.g. San Francisco) vanish under whichever colored circle (here: grey) sits “on top.” This masks the fact that San Francisco houses three publishers, not just one. You don’t know it until you drill down nearly all the way (and, even then, you can barely see it: I had to draw arrows for you).

7 // I test out the same dataset in Google Maps, just to compare. I find the upload both faster and more intuitive. Google Maps is also able to handle all three of my UK addresses, better than the ArcGICS zero. Unlike in ArcGIS, though, Google Maps isunable to map one of my P.O. boxes in Chicago, despite having a working zip code; this is almost certainly a problem with my formatting of the data set, but Google Maps does virtually nothing to let me know what the actual problem is or how I can fix it. Nevertheless, Google Maps proves to be more responsive and easier to see (big pins rather than small circles), so I continue my mapping exploration there.

8 // A sample case study: my dataset tells me that New York in 1970 saw the publication of Lloyd Addison’s Beau-Cocoa Volume 3 Numbers 1 and 2 in Harlem; Tom Weatherly’s Mau Mau American Cantos from Corinth Press in the West Village; and N. H. Pritchard’s The Matrix from Doubleday in Garden City, Long Island. When I look on the map, the triangulation of these 1970 titles “uptown,” “downtown,” and “out of town” roughly corresponds to the distribution of other titles in the following decade. Is there any correlation between the spatial placement of publishers and the qualities of the individual literary titles? Do downtown titles resemble each other in some ways, out of town titles in other ways? Is the location of the publisher as important as, say, the location of the author–and even then, would I want the hometown, the known residence(s) at the time of writing, the city or the neighborhood?

9 // And what about this “around the corner” phenomenon I see in New York, where clusters of titles are published on the same block as one another. My dataset is small–a larger one would tell me more–but, as a gathering hypothesis, perhaps there’s something to having a single author’s titles “walk up the street,” moving through both space and time. What, or who, motivates this walk? There’s a narrative to it. What might the narrative be in, say, Harlem, where after publishing the first two instances (Volume 1 and Volume 2 Number 1) of the periodical Beau-Cocoa from (his home?) 100 East 123 Street, editor/poet/publisher Lloyd Addison moves (in the middle of 1969) Beau Cocoa, Inc. to a P.O. box at the post office around the corner. Did an increased national or international demand for this periodical require more firepower than Addison’s personal mailbox?

And what might the narrative be in the West Village, where Tom Weatherly publishes his 1970 Mau Mau American Cantos and his 1971 Thumbprint with two publishers in a four block radius? A larger dataset might show me a network of poets publishing within this neighborhood. Could it lead me to finding information about poetry readings, salons, collaborative projects? (I’m making a leap without evidence here to evoke a possible trajectory.)

10 // Future steps could have me expand this dataset to include data from the rest of the titles in the Eclipse archive (see #5 // above). It could also go the other direction and have me double down on collecting bibliographic data for these authors in the Black Radical Tradition: the material details and individual printings of their titles (some of which Dworkin provides in an unstructured way, but I skipped over during my first pass through my emerging dataset), perhaps performances of individual poems from these titles that have been documented in poetry/sound archives like PennSound, maybe related titles (by these authors, by others) in other “little databases” like UbuWeb. Stay tuned.


Text-Mining Praxis: Poetry Portfolios Over Time

For this praxis assignment I assembled a corpus of three documents, each produced over a comparable three-year period:

  1. The poetry I wrote before my first poetry workshop (2004-07);
  2. The final portfolios for each of my undergraduate poetry workshops (2007-10); and
  3. My MFA thesis (2010-13).

A few prelimary takeaways:

I used to be more prolific, though much less discriminate. Before I took my first college poetry workshop, I had already written over 20,500 words, equivalent to a 180-page PDF. During undergrad, that number halved, dropping to about 10,300 words, or an 80-page PDF. My MFA thesis topped out at 6,700 words in a 68-page PDF. I have no way of quantifying “hours spent writing” during these three intervals, but anecdotally that time at least doubled at each new stage. This double movement toward more writing (time) and away from more writing (stuff) suggests a growing commitment to revision as well as a more discriminate eye for what “makes it into” the final manuscript in the end.

Undergrad taught me to compress; grad school to expand. In terms of words-per-sentence (wps), my pre-workshop poetry was coming in at about 26wps. My poetry instructor in college herself wrote densely-packed lyric verse, so it’s not surprising to see my own undergraduate poems tightening up to 20wps as images came to the forefront and exposition fell to the wayside. We were also writing in and out of a number of poetic forms–sonnet, villanelle, pantoum, terza rima–which likely further compresses the sentences making up these poems. When I brought to my first graduate workshop one these sonnet-ish things that went halfway down the page and halfway across it, I was immediately told the next poem needed to fill the page, with lines twice as long and twice as many of them. In my second year, I took a semester-long hybrid seminar/workshop on the long poem, which positioned poetry as a time art and held up more poetic modes of thinking such as digression, association, and meandering as models for reading and producing this kind of poem. I obviously internalized this advice, as, by the time I submitted my MFA thesis, my sentences were nearly twice as as long as they’d ever been before, sprawling out to a feverish and ecstatic 47wps.

Things suddenly stopped “being like” other things. Across the full corpus, “like” turns out to be my most commonly-used word, appearing 223 different times. Curiously, only 13 of these are in my MFA thesis, 4 of which appear together in a single stanza of one poem. Which isn’t to say the figurative language stopped, but that it became more coded: things just started “being” (rather than “being like”) other things. For example:

Tiny errors in the Latin Vulgate
have grown horns from the head of Moses.

It is radiant. The deer has seen the face of God

spent a summer living in his house sleeping on his floor.

This one I like. But earlier figurative language was, at best, the worst, always either heavy-handed or confused–and often both. In my pre-MFA days, these were things that were allowed to be “like” other things:

  • loose leaves sprinkled like finely chopped snow” (chopped snow?)
  • “lips that pull back like wrapping paper around her teeth” (what? no.)
  • lights of a distant airplane flickering like fireflies on a heavy playhouse curtain” (ugh.)
  • tossing my wrapper along the road like fast silver ash out a casual window” (double ugh.)

Other stray observations. I was still writing love poems in college, but individual names no longer appeared (Voyant shows that most of the “distinctive words” in the pre-workshop documents were names or initials of ex-girlfriends). “Love” appears only twice in the later poems.

Black, white, and red are among the top-15 terms used across the corpus, and their usage was remarkably similar from document to document (black is omenous; white is ecstatic or otherworldly; red is to call attention to something out of place). The “Left-Term-Right” feature in Voyant is really tremendous in this regard.

And night-time conjures different figures over time: in the pre-workshop poems, people walk around alone at night (“I stand exposednaked as my handbeneath the night’s skylight moon”); in the college workshop poems, people come together at night for a party or rendezvous (“laughs around each bend bouncing like vectors across the night”); and, in the MFA thesis, night is the time for prophetic animals to arrive (“That night a deer chirped not itself by the thing so small I could not see it that was on top of it near it or inside of it & and how long had it been there?”).

Text Mining Game Comments (Probably Too Many at Once!)

To tell the truth, I’ve been playing with Voyant a lot, trying to figure out what the most interesting thing is that I could do with it! Tenen could critique my analysis on the grounds that it’s definitely doing some things I don’t fully understand; Underwood would probably quibble with my construction of a corpus and my method of selecting words to consider.  Multiple authors could very reasonably take issue with the lack of political engagement in my choice. However, if the purpose here is to get my feet wet, I think it’s a good idea to start with a very familiar subject matter, and in my case, that means board games.

Risk Legacy was published in 2011. This game reimagined the classic Risk as a series of scenarios, played by the same group, in which players would make changes to the board between (or during!) scenarios. Several years later,* the popularity and prevalence of legacy-style, campaign-style, and scenario-based board games has skyrocketed.  Two such games, Gloomhaven and Pandemic Legacy, are the top two games on BoardGameGeek as of this writing.

I was interested in learning more about the reception of this type of game in the board gaming community. The most obvious source for such information is BoardGameGeek (BGG).  I could have looked at detailed reviews, but since I preferred to look at reactions from a broader section of the community, I chose to look at the comments for each game.  BGG allows users to rate games and comment on them, and since all the games I had in mind were quite popular, there was ample data for each.  Additionally, BGG has an API that made extracting this data relatively easy.**

As I was only able to download the most recent 100 comments for each game, this is where I started.  I listed all the games of this style that I could think of, created a file for each set of comments, and loaded them into Voyant. Note that I personally have only played five of these nine games. The games in question are:

  • The 7th Continent, a cooperative exploration game
  • Charterstone, a worker-placement strategy game
  • Gloomhaven, a cooperative dungeon crawl
  • Star Wars: Imperial Assault, a game based on the second edition of the older dungeon crawl, Descent, but with a Star Wars theme. It’s cooperative, but with the equivalent of a dungeon master.
  • Near and Far, a strategy game with “adventures” which involve reading paragraphs from a book. This is a sequel to Above and Below, an earlier, simpler game by the same designer
  • Pandemic Legacy Season One, a legacy-style adaptation of the popular cooperative game, Pandemic
  • Pandemic Legacy Season Two, a sequel to Pandemic Legacy Season One
  • Risk Legacy, described above
  • Seafall, a competitive nautical-themed game with an exploration element

The 7th Continent is a slightly controversial inclusion to this list; I have it here because it is often discussed with the others. I excluded Descent because it isn’t often considered as part of this genealogy (although perhaps it should be). Both these decisions felt a little arbitrary; I can certainly understand why building a corpus is such an important and difficult part of the text-mining process!

These comments included 4,535 unique word forms, with the length of each document varying from 4,059 words (Risk Legacy) to 2,615 (7th Continent).  Voyant found the most frequent words across this corpus, but also the most distinctive words for each game. The most frequent words weren’t very interesting: game, play, games, like, campaign.*** Most of these words would probably be the most frequent for any set of game comments I loaded into Voyant! However, I noticed some interesting patterns among the distinctive words. These included:

Game Jargon referring to scenarios. That includes: “curse” for The 7th Continent (7 instances), “month” for Pandemic Legacy (15 instances), and “skirmish” for Imperial Assault (15 instances). “Prologue” was mentioned 8 times for Pandemic Legacy Season 2, in reference to the practice scenario included in the game.

References to related games or other editions. “Legacy” was mentioned 15 times for Charterstone, although it is not officially a legacy game. “Descent” was mentioned 15 times for Imperial Assault, which is based on Descent. “Below” was mentioned 19 times for Near and Far, which is a sequel to the game Above and Below. “Above” was also mentioned much more often for Near and Far than for other games; I’m not sure why it didn’t show up among the distinctive words.

References to game mechanics or game genres. Charterstone, a worker placement game, had 20 mentions of “worker” and 17 of “placement.” The word “worker” was also used 9 times for Near and Far, which also has a worker placement element; “threats” (another mechanic in the game) were mentioned 8 times. For Gloomhaven, a dungeon crawl, the word “dungeon” turned up 20 times.  Risk Legacy had four mentions of “packets” in which the new materials were kept. The comments about Seafall included 6 references to “vp” (victory points).  Near and Far and Charterstone also use victory points, but for some reason they were mentioned far less often in reference to those games.

The means by which the game was published. Kickstarter, a crowdfunding website, is very frequently used to publish board games these days. In this group, The 7th Continent, Gloomhaven, and Near and Far were all published via Kickstarter. Curiously, both the name “Kickstarter” and the abbreviation “KS” appeared with much higher frequency in the comments on the 7th Continent and Near and Far than in the comments for Gloomhaven. 7th Continent players were also much more likely to use the abbreviation than to type out the full word; I have no idea why this might be.

Thus, it appears that most of the words that stand out statistically (in this automated analysis) in the comments refer to facts about the game, rather than directly expressing an opinion. The exception to this rule was Seafall, which is by far the lowest-ranked of these games and which received some strongly negative reviews when it was first published. The distinctive words for Seafall included two very ominous ones: “willing” and “faq” (each used five times).

In any case, I suspected I could find more interesting information outside the selected terms. Here, again, Underwood worries me; if I select terms out of my own head, I risk biasing my results. However, I decided to risk it, because I wanted to see what aspects of the campaign game experience commenters found important or at least noteworthy. If I had more time to work on this, it would be a good idea to read through some reviews for good words describing various aspects of this style of game, or perhaps go back to a podcast where this was discussed, and see how the terms used there were (or weren’t) reflected in the comments. Without taking this step, I’m likely to miss things; for instance, the fact that the word “runaway” (as in, runaway leader) constitutes 0.0008 of the words used to describe Seafall, and is never used in the comments of any of the other games except Charterstone, where it appears at a much lower rate.**** As it is, however, I took the unscientific step of searching for the words that I thought seemed likely to matter. My results were interesting:

(Please note that, because of how I named the files, Pandemic Legacy Season Two is the first of the two Pandemics listed!)

It’s very striking to me how different each of these bars looks. Some characteristics are hugely important to some of the games but not at all mentioned in the others! “Story*” (including both story and storytelling) is mentioned unsurprisingly often when discussing Near and Far; one important part of that game involves reading story paragraphs from a book. It’s interesting, though, that story features so much more heavily in the first season of Pandemic Legacy than the second. Of course, the mere mention of a story doesn’t mean that the story of a game met with approval; most of the comments on Pandemic Legacy’s story are positive, while the comments on Charterstone’s are a bit more mixed.

Gloomhaven comments are much more about characters than any of the other terms I used; one of the distinguishing characteristics of this game is the way that characters change over time. Many of the comments also mentioned that the characters do not conform to common dungeon crawl tropes. However, the fact that characters are mentioned in every game except for two suggests that characters are important to players of campaign-style games.

I also experimented with some of the words that appeared in the word cloud, but since this post is already quite long, I won’t detail everything I noticed! It was interesting, for instance, to note how the use of words like “experience” and “campaign” varied strongly among these games.  (For instance: “experience” turned out to be a strongly positive word in this corpus, and applied mainly to Pandemic Legacy.)

In any case, I had several takeaways from this experience:

  • Selecting an appropriate corpus is difficult. Familiarity with the subject matter was helpful, but someone less familiar may have selected a less biased corpus.
  • The more games I included, the more difficult this analysis became!
  • My knowledge of the subject area allowed me to more easily interpret the prevalence of certain words, particularly those that constituted some kind of game jargon.
  • Words often have a particularly positive or negative connotation throughout a corpus, though they may not have that connotation outside that corpus. (For instance: rulebook. If a comment brings up the rulebook of a game, it is never to compliment it.)
  • Even a simple tool like this includes some math that isn’t totally transparent to me. I can appreciate the general concept of “distinctive words,” but I don’t know exactly how they are calculated. (I’m reading through the help files now to figure it out!)

I also consolidated all the comments on each game into a single file, which was very convenient for this analysis, but prevented me from distinguishing among the commenters.  This could be important if, for example, all five instances of a word were by the same author.

*Note that there was a lag of several years due to the immense amount of playtesting and design work required for this type of game.

**Thanks to Olivia Ildefonso who helped me with this during Digital Fellows’ office hours!

***Note that “like” and “game” are both ambiguous terms. “Like” is used both to express approval and to compare one game to another. “Game” could refer to the overall game or to one session of it (e.g. “I didn’t enjoy my first game of this, but later I came to like it.”).

****To be fair, it is unlikely anyone would complain of a runaway leader in 7th Continent, Gloomhaven, Imperial Assault, or either of the Pandemics, as they are all cooperative games.

Text mining the Billboard Country Top 10

My apologies to anyone who read this before the evening of October 8. I set this to post automatically, but for the wrong date and without all that I wanted to include.

I’m a big fan of music but as I’ve gotten further away from my undergrad years, I’ve become less familiar with what is currently playing on the radio. Thanks to my brother’s children, I have some semblance of a grasp on certain musical genres, but I have absolutely no idea what’s happening in the world of country music (I did at one point, as I went to undergrad in Virginia).

I decided to use Voyant Tools to do a text analysis of the first 10 songs on the Billboard Country chart from the week of September 8, 2018. The joke about country music is that it’s about dogs, trucks, and your wife leaving you. When I was more familiar with country music, I found it to be more complex than this, but a lot could have changed since I last paid attention. Will a look at the country songs with the most sales/airplay during this week support these assumptions? For the sake of uniformity, I accepted the lyrics on Genius.com as being correct and removed all extraneous words from the lyrics (chorus, bridge, etc.).

The songs in the top 10 are as follows:

  1. Meant to Be – Bebe Rexha & Florida Georgia Line
  2. Tequila – Dan + Shay
  3. Simple – Florida Georgia Line
  4. Drowns the Whiskey – Jason Aldean featuring Miranda Lambert
  5. Sunrise, Sunburn, Sunset – Luke Bryan
  6. Life Changes – Thomas Rhett
  7. Heaven – Kane Brown
  8. Mercy – Brett Young
  9. Get Along – Kenny Chesney
  10. Hotel Key – Old Dominion

If you would like to view these lyrics for yourself, I’ve left the files in a google folder.

As we can see, the words “truck,” “dog,” “wife,” and “left” were not among the most frequently used, although it may not be entirely surprising that “ain’t” was.

The most frequently used word in the corpus, “it’s” appeared only 19 times, showing that there is a quite a bit of diversity in these lyrics. I looked for other patterns, such as whether vocabulary density or average words per sentence had an effect on the song’s position on the chart, but there was no correlation.

Text-Mining the MTA Annual Report

After some failed attempts at text-mining other sources [1], I settled on examining the New York Metropolitan Transportation Authority’s annual reports. The MTA offers online access to its annual reports going back to the year 2000 [2]. As a daily rider and occasional critic of the MTA, I thought this might provide insight to its sometimes murky motivations.

I decided to compare the 2017, 2009, and 2001 annual reports. I chose these because 2017 was the most current, 2009 was the first annual report after the Great Recession became a steady factor in New York life, and 2001 was the annual report after the 9/11 attacks on the World Trade Center. I thought there might be interesting differences between the most recent annual report and the annual reports written during periods of intense social and financial stress.

Because the formats of the annual reports vary from year to year, I was worried that some differences emerging from text-mining might be due to those formatting changes rather than operational changes. So at first I tried to minimize this by finding sections of the annual reports that seemed analogous in all three years. After a few tries, though, I finally realized that dissecting the annual reports in this manner had too much risk of leaving out important information. It would therefore be better to simply use the entirety of the text in each annual report for comparison, since any formatting changes to particular sections would probably not change the overall tone of the annual report (and the MTA in general).

I downloaded the PDFs of the annual reports [3], copied the full text within, and ran that text through Voyant’s online text-mining tool (https://voyant-tools.org/).

The 20 most frequent words for each annual report are listed below. It is important to note that these lists track specific spellings of words, but it is sometimes more important to track all related words (words with the same root, like “complete” and “completion”). Voyant allows users to search for roots instead of specific spellings, but the user needs to already know which root to search for.

2001 Top 20:
mta (313); new (216); capital (176); service (154); financial (146); transit (144); year (138); operating (135); december (127); tbta (125); percent (121); authority (120); york (120); bonds (112); statements (110); total (105); million (104); long (103); nycta (93); revenue (93)

2009 Top 20:
new (73); bus (61); station (50); mta (49); island (42); street (41); service (39); transit (35); annual (31); long (31); report (31); completed (30); target (30); page (29); avenue (27); york (24); line (23); performance (23); bridge (22); city (22)

2017 Top 20:
mta (421); new (277); million (198); project (147); bus (146); program (140); report (136); station (125); annual (121); service (110); total (109); safety (105); pal (100); 2800 (98); page (97); capital (94); completed (89); metro (85); north (82); work (80)

One of the most striking differences to me was the use of the word “safety” and other words sharing the root “safe.” Before text-mining, I would have thought that “safe” words would be most common in the 2001 annual report, reflecting a desire to soothe public fears of terrorist attacks after 9/11. Yet the most frequent use by far of “safe” words was in 2017. This was not simply a matter of raw volume, but also the frequency rate. “Safe” words were mentioned almost four times as often in 2017 (frequency rate: 0.0038) than in 2001 (0.001). “Secure” words might at first seem more equitable in 2001 (0.0017) and 2017 (0.0022). However, these results are skewed, because in 2001, many of the references to “secure” words were in their financial meaning, not their public-safety meaning. (e.g. “Authority’s investment policy states that securities underlying repurchase agreements must have a market value…”)

This much higher recent focus on safety might be due to the 9/11 attacks not being the fault of the MTA, so any disruptions in safety could have been generally seen as understandable. The 2001 annual report mentioned that the agency was mostly continuing to follow the “MTA all-agency safety initiative, launched in 1996.” However, by 2017, a series of train and bus crashes (one of which happened just one day ago), and heavy media coverage of the MTA’s financial corruption and faulty equipment, were possibly shifting blame for safety issues to the MTA’s own internal problems. Therefore, the MTA might now be feeling a greater need to emphasize its commitment to safety, whereas it was more assumed before.

In a similar vein, “replace” words were five times more frequent in 2017 (0.0022) than in 2001 (0.0004). “Repair” words were also much more frequent in 2017 (0.0014) than 2001 (0.00033). In 2001, the few mentions of “repair” were often in terms of maintaining “a state of good repair,” which might indicate that the MTA thought the system was already working pretty well. By 2017, public awareness of the system’s dilapidation might have changed that. Many mentions of repair and replacement in the 2017 annual report are also in reference to damage done by Hurricane Sandy (which happened in 2012).

In contrast to 2017’s focus on safety and repair, the 2001 annual report is more concerned with financial information than later years. Many of the top twenty words are related to economics, such as “capital,” “revenue,” and “bonds.” In fact, as mentioned above, the 2001 annual report often uses the word “security” with its financial meaning.

The 2009 annual report was extremely shorter (6,272 words) than in 2001 (36,126 words) and 2017 (29,706 words). Perhaps the Great Recession put such a freeze on projects that there simply wasn’t as much to discuss. However, even after considering the prevalence of “New York,” 2009 still had a much higher frequency rate of the word “new.” (The prevalence of “new” every year at first made me think that the MTA was obsessed with promoting new projects, but the Links tool in Voyant reminded me that this was largely because of “New York.”) Maybe even though there weren’t many new projects to trumpet, the report tried particularly hard to highlight what there was.

The recession might also be why “rehabilitate” and its relative words were used almost zero times in 2001 and 2017, but were used heavily in 2009 (0.0043). Rehabilitating current infrastructure might be less costly than completely new projects, yet still allow for the word “new” to be used. “Rehabilitate” words were used even more frequently in 2009 than the word “York.”

One significant flaw in Voyant is that it doesn’t seem to provide the frequency rate of a word for the entire document. Instead, it only provides the frequency rate for each segment of the document. The lowest possible number of segments that a user can search is two. This means that users have to calculate the document-length frequency rate themselves by dividing the number of instances by the number of words in the document. If the document-length frequency rate is available somewhere in the Voyant results, it doesn’t seem intuitive and it isn’t explained in the Voyant instructions.

Although I generally found Voyant to be an interesting and useful tool, it always needs to be combined with traditional analysis of the text. Without keeping an eye on the context of the results, it would be easy to make false assumptions about why particular words are being used. Helpfully, Voyant has “Contexts” and “Reader” windows that allow for users to quickly personally analyze how a word is being used in the text.

[1] I first ran Charles Darwin’s “Origin of Species” and “Descent of Man” through Voyant, but the results were not particularly surprising. The most common words were ones like “male,” “female,” “species,” “bird,” etc.

In a crassly narcissistic decision, I then pasted one of my own unpublished novels into Voyant. This revealed a few surprises about my writing style (the fifth most common word was “like,” which either means I love similes or being raised in Southern California during the 1980s left a stronger mark than I thought). I also apparently swear a lot. However, this didn’t seem socially relevant enough to center an entire report around.

Then I thought it might be very relevant to text-mine the recent Supreme Court confirmation hearings of Brett Kavanaugh and compare them to his confirmation hearings when he was nominated to the D.C. Circuit Court of Appeals. Unfortunately, there are no full transcripts available yet of the Supreme Court hearings. The closest approximation that I found was the C-Span website, which has limited closed-caption transcripts, but their user interface doesn’t allow for copying the full text of the hearing. The transcripts for Kavanaugh’s 2003 and 2006 Circuit Court hearings were available from the U.S. Congress’s website, but the website warned that transcripts of hearings can take years to be made available. Since the deadline for this assignment is October 9, I decided that was too much of a gamble. I then tried running Kavanaugh’s opening statements through Voyant, but that seemed like too small of a sample to draw any significant conclusions. (Although it’s interesting that he used the word “love” a lot more in 2018 than he did back in 2003.)

[2] 2017: http://web.mta.info/mta/compliance/pdf/2017_annual/SectionA-2017-Annual-Report.pdf
2009: http://web.mta.info/mta/compliance/pdf/2009%20Annual%20Report%20Narrative.pdf
2001: http://web.mta.info/mta/investor/pdf/annualreport2001.pdf

[3] It’s important to download the PDFs before copying text. Copying directly from websites can result in text that has a lot of formatting errors, which then requires data-cleaning and can lead to misleading results.