Tag Archives: praxis

Project T.R.I.K.E – Principles and Origin Myths

Hannah’s already provided some use cases that I hope help to illustrate why we think that Project T.R.I.K.E will be useful, and to whom. I wanted to backtrack and give some context. Although, as Hannah’s post suggests, it’s quite difficult to suggest a specific starting point for our thought processes, which have developed iteratively until we’re not sure whether we’re trapped in a time loop or not. However, I think I can trace through some of the things I think are important about it.

We really wanted to do something that would be useful for pedagogy. Again, if you want to know how it’s useful for pedagogy, please see Hannah’s post! But we were specifically interested in a resource that would teach methodology, because all of us were methodological beginners who really felt the need for more tools and resources that would help us to develop in that respect. During our environmental scan, we were impressed by the efforts of the DH community to produce a number of useful guides to tools, methodologies, and processes (in particular, please see Alan Liu’s DH Toy Chest and devdh.org), although none of them were doing exactly what we want to do. There are plenty of dead resources out there, too, and we should take that as a warning.

We really wanted to take a critical stance on data by creating something that would highlight its contingent, contextual, constructed nature, acknowledging that datasets are selected and prepared by human researchers, and that the questions one can ask are inextricably connected to the process through which the dataset is constituted. Our emphasis on a critical approach does not originate in this class; I believe all of us had been exposed to theories about constructedness before this. What’s curious about our process is that we went out seeking datasets and tutorials with this in mind, thinking about what we hoped to do, and this conversation ranged far from the class readings, focusing on our own work and also Rawson and Muñoz’s “Against Cleaning” but eventually brought us back to Posner, Bode, and Drucker. None of them, however, came away with exactly the solution we did; we decided that the constructed nature of data is best represented by making transparent the process of construction itself! Project T.R.I.K.E. will provide snapshots of the data at different stages in the process, highlighting the decisions made by researchers and interrogating how these decisions are embodied in the data.

Finally, we really wanted to ensure that we could produce something that could be open to the community. Again, a lot of work in the DH community is openly available, but we also came across some datasets behind paywalls. One repository aggregating these datasets not only made it difficult to access the databases but also had a series of stern lectures about copyright, occupying much the same space on their website that instruction in methodology would occupy on ours! While it is true that some humanities data may invoke copyright in a way that other kinds of data usually don’t, we’d much rather host datasets that we can make available to a wide variety of users with a wide variety of use cases. Limiting access to data limits research.

Think carefully, though. As part of the environmental scan, we came across an article that argues, on the basis of a report partially sponsored by Elsevier, that researchers seldom make their data available, even when they are required to do so. While I expect this is true, I am also suspicious of claims like this when they are made by major publishers, because their next step will probably be to offer a proprietary solution which will give them yet more control over the scholarly communication ecosystem. In a context in which major publishers are buying up repositories, contacting faculty directly, and co-opting the language of open access as they do so, I’d argue that it’s more and more important for academics to build out their (our) own infrastructure. Project T.R.I.K.E. has slightly humbler ambitious, for time being, but it’s an opportunity for us to begin building some infrastructure of our own.

A Network Analysis of our Initial Class Readings

Introduction
This praxis project visualizes a network analysis of the bibliographies from the September 4th required readings in our class syllabus plus the recommended “Digital Humanities” piece by Professor Gold. My selection of topic was inspired by a feeling of being swamped by PDFs and links that were accumulating in my “readings” folder with little easy-to-reference surrounding context or differentiation. Some readings seemed to be in conversation with each other, but it was hard to keep track. I wanted a visualization to help clarify points of connection between the readings. This is inherently reductionist and (unless I’m misquoting here, in which case sorry!) it makes Professor Gold “shudder”, but charting things out need not replace the things themselves. To me, it’s about creating helpful new perspectives from which to consider material and ways to help it find purchase in my brain.

Data Prep
I copy/pasted author names from the bibliographies of each reading into a spreadsheet. Data cleaning (and a potential point for the introduction of error) consisted of manually editing names as needed to make all follow the same format (last name, first initial). For items with summarized “et al” authorship, I looked up and included all author names.

I performed the network analysis in Cytoscape, aided by Miram Posner’s clear and helpful tutorial. Visualizing helped me identify and fix errors in the data, such as an extra space causing two otherwise identical names to display separately.

The default Circular Layout option in the “default black” style rendered an attractive graph with the nodes arranged around two perfect circles, but unfortunately the labels overlapped and many were illegible. To fix the overlapping I individually adjusted the placement of the nodes, dragging alternating nodes either toward or away from the center to create room for each label to appear and be readable in its own space. I also changed the label color from gray to white for improved contrast and added yellow directional indicators, as discussed below. I think the result is beautiful.

Network Analysis Graph
Click the placeholder image below and a high-res version will open in a new tab. You can zoom in and read all labels on the high-res file.

An interactive version of my graph is available on CyNetShare, though unfortunately that platform is stripping out my styling. The un-styled, harder-to-read, but interactive version can be seen here.

Discussion
Author nodes in this graph are white circles and connecting edges are green lines. This network analysis graph is directional. The class readings are depicted with in-bound connections from the works cited terminating in yellow diamond shapes. From the clustering of yellow diamonds around certain nodes, one can identify that our readings were authored by Kirschenbaum, Fitzpatrick, Gold, Klein, Spiro, Hockey, Alvarado, Ramsey, and (off in the lower left) Burke. Some of these authors cited each other, as can be seen by the green edges between yellow-diamond-cluster nodes. Loops at a node indicate the author citing themselves. Multiple lines connecting the same two nodes indicate citations of multiple pieces by the same author.

It is easy to see in this graph that all of the readings were connected in some way, with the exception of an isolated two-node constellation in the lower left of my graph. That constellation represents “The Humane Digital” by Burke, which had only one item (which was by J. Scott) in its bibliography. Neither Burke nor Scott authored nor were cited in any of the other readings, therefore they have no connections to the larger graph.

The vast majority of the nodes fall into two concentric circle forms. The outer circle contains the names of those who were cited in only one of the class readings. The inner circle contains those who were cited in more than one reading, including citations by readings-authors of other readings-authors. These inner circle authors have greater out-degree connectedness and therefore more influence in this graphed network than do the outer circle authors. The authors with the highest degree of total connections among the inner circle are Gold, Klein, Kirschenbaum, and Spiro. The inner circle is a hub of interconnected digital humanities activity.

We can see that Spiro and Hockey had comparitively extensive bibliographies, but that Spiro’s work has many more connections to the inner circle digital humanities hub. This is likely at least partly due to the fact that Hockey’s piece is from 2004, while the rest of the readings are from 2012 or 2016 (plus one which will be published next year in 2019). One possible factor, some of the other authors may not have been yet publishing related work when Hockey was writing her piece in the early 2000’s. Six of our readings were from 2012, the year of Spiro’s piece. Perhaps a much richer and more interconnected conversation about the digital humanities developed at some point between 2004 and 2012.

This network analysis and visualization is useful for me as a mnemonic aide for keeping the readings straight. It can also serve to refer a student of the digital humanities to authors they may find it useful to read more of or follow on Twitter.

A Learning about Names
I have no indication that this is or isn’t occurring in my network analysis, but in the process of working on this I realized any name changes, such as due to a change in marital status, would make an author appear as two different people. This predominantly affects women and, without a corrective in place, could make them appear less central in graphed networks.

There are instances where people may have published with different sets of initials. In the bibliography to Hockey’s ‘The History of Humanities Computing,’ an article by ‘Wisbey, R.’ is listed just above a collection edited by ‘Wisbey, R. A.’ These may be the same person but it cannot be determined with certainty from the bibliography data alone. Likewise, ‘Robinson, P.’ and ‘Robinson, P. M. W.’ are separately listed authors for works about Chaucer. These are likely the same person, but without further research I cannot be 100% certain. I chose to not manually intervene and so these entries remain separate. It is useful to be aware that changing how one lists oneself in authorship may affect how algorithms understand the networks to which you belong.

Potential Problems
I would like to learn to what extent the following are problematic and what remedies may exist. My network analysis graph:

Doesn’t distinguish between authors and editors
I had to split apart collaborative works into individual authors
Doesn’t include works that had no author or editor listed

—

Postscript: Loose Ties to a Current Reading
In “How Not to Teach Digital Humanities,” Ryan Cordell suggests that introductory classes should not lead with “meta-discussions about the field” or “interminable discussions of what counts or does not count [as digital humanities]”. In his experience, undergraduate and graduate students alike find this unmooring and dispiriting.

He recommends that instructors “scaffold everything [emphasis in the original]” to foster student engagement. There is no one-size-fits-all in pedagogy. Even within the same student learning may happen quicker or information may be stickier if it is presented in context or in more than one way. Providing multiple ways into the information that a course covers can lead to good student learning outcomes. It can also be useful to provide scaffolding for next steps or going beyond the basics for students who want to learn more. My network analysis graph is not perfect, but having something as a visual reference is useful to me and likely other students as well.

Cordell also endorses teaching how the digital humanities are practiced locally and clearly communicating how courses will build on each other. This can help anchor students in where their institution and education fit in with the larger discussions about what the field is and isn’t. Having gone through the handful of assigned “what is DH” pieces, I look forward to learning more about the local CUNY GC flavor in my time as a student here. This is an exciting field!

Update 11/6/18:

As I mentioned in the comments, it was bothering me that certain authors who appeared in the inner circle rightly belonged in the outer circle. This set of authors were ones who were cited once in the Introductions to the Debates in Digital Humanities M. K. Gold and L. Klein. Due to a challenge depicting co-authorship, M. K. Gold and L. Klein appear separately in the network article, so authors were appearing to be cited twice (once each by Gold and Klein), rather than the once time they were cited in the pieces co-authored by Gold and Klein.

I have attempted to clarify the status of those authors in the new version of my visualization below by moving them into the outer ring. It’s not a perfect solution, as each author still shows two edges instead of one, but it does make the visualization somewhat less misleading and clarifies who are the inner circle authors.

Bibliographies, Networks, and CUNY Academic Works

I was really excited about doing a network analysis, even though I seem to have come all the way over here to DH just to do that most librarianly of research projects, a citation analysis.

I work heavily with our institutional repository, CUNY Academic Works, so I wanted to do a project having to do with that. Institutional repositories are one of the many ways that scholarly works can be made openly available. Ultimately, I’m interested in seeing whether the works that are made available through CAW are, themselves, using open access research, but for this project, I thought I’d start a little smaller.

CAW allows users to browse by discipline using this “Sunburst” image.

Each general subject is divided into smaller sub-disciplines. Since I was hoping to find a network, I wanted to choose a sub-discipline that was narrow but fairly active. I navigated to “Arts and Humanities,” from there to “English Language and Literature,” and finally to “Literature in English, North America, Ethnic and Cultural Minority.” From there, I was able to look at works in chronological order. Like most of the repository, this subject area is dominated by dissertations and capstone papers; this is really great for my purposes because I am very happy to know which authors students are citing and from where.

The data cleaning process was laborious, and I think I got a little carried away with it. After I’d finished, I tweeted about it, and Hannah recommended pypdf as a tool I could have used to do this work much more quickly. Since I’d really love to do similar work on a larger scale, this is a really helpful recommendation, and I’m planning on playing with it some more in the future (thanks, Hannah!)

I ended up looking at ten bibliographies in this subject, all of which were theses and dissertations from 2016 or later. Specifically:

Jarzemsky, John. “Exorcizing Power.”

Green, Ian F. P. “Providential Capitalism: Heavenly Intervention and the Atlantic’s Divine Economist”

La Furno, Anjelica. “’Without Stopping to Write a Long Apology’: Spectacle, Anecdote, and Curated Identity in Running a Thousand Miles for Freedom”

Danraj, Andrea A. “The Representation of Fatherhood as a Declaration of Humanity in Nineteenth-Century Slave Narratives”

Kaval, Lizzy Tricano. “‘Open, and Always, Opening’: Trans- Poetics as a Methodology for (Re)Articulating Gender, the Body, and the Self ‘Beyond Language ’”

Brown, Peter M. “Richard Wright’ s and Chester Himes’s Treatment of the Concept of Emerging Black Masculinity in the 20th Century”

Brickley, Briana Grace. “’Follow the Bodies”: (Re)Materializing Difference in the Era of Neoliberal Multiculturalism”

Eng, Christopher Allen. “Dislocating Camps: On State Power, Queer Aesthetics & Asian/Americanist Critique”

Skafidas, Michael P. “A Passage from Brooklyn to Ithaca: The Sea, the City and the Body in the Poetics of Walt Whitman and C. P. Cavafy”

Cranstoun, Annie M. “Ceasing to Run Underground: 20th-Century Women Writers and Hydro-Logical Thought”

Many other theses and dissertations are listed in Academic Works, but are still under embargo. For those members of the class who will one day include your own work in CAW, I’d like to ask on behalf of all researchers that you consider your embargo period carefully! You have a right to make a long embargo for your work if you wish, but the sooner it’s available, the more it will help people who are interested in your subject area.

In any case, I extracted the authors’ names from these ten bibliographies and put them into Gephi to make a graph. I thought about using the titles of journals, which I think will be my next project, but when I saw that all the nodes on the graph have such a similar appearance graphically, I was reluctant to mix such different data points as authors and journals.

As I expected, each bibliography had its own little cluster of citations, but there were a few authors that connected them, and some networks were closer than others.

Because I was especially interested in the authors that connected these different bibliographies, I used Betweenness Centrality to map these out, to produce a general shape like this:

This particular configuration of the data uses the Force Atlas layout. There were several available layouts and I don’t how they’re made, but this one did a really nice job of rendering my data in a way that looked 3D and brought out some relationships among the ten bibliographies.

Some Limitations to My Data

Hannah discussed this in her post, and I’d run into a lot of the same issues and had forgotten to include it in my blog post! Authors are not always easy entities to grasp. Sometimes a cited work may have several authors, and in some cases, dissertation authors cited edited volumes by editor, rather than the specific pieces by their authors. Some of the authors were groups rather than individuals (for instance, the US Supreme Court), and some pieces were cited anonymously.

In most cases, I just worked with what I had. If it was clear that an author was being cited in more than one way, I tried to collapse them, because there were so few points of contact that I wanted to be sure to bring them all out. There were a few misspellings of Michel Foucault’s name, but it was really important to me to know how relevant he was in this network.

Like Hannah, I pretended that editors were authors, for the sake of simplicity. Unlike her, I didn’t break out the authors in collaborative ventures, although I would have in a more formal version of this work. It simply added too much more data cleaning on top of what I’d already done. So I counted all the co-authored works as the work of the first author — flawed, but it caught some connections that I would have missed otherwise.

Analyzing the Network

Even from this distance, we can get a sense of the network. For instance, there is only one “island bibliography,” unconnected to the rest.

Note, however, that another isolated node is somewhat obscured by its positioning: Jarzemsky, whose only connection to the other authors is through Judith Butler.

So, the two clearest conclusions were these:

There is no source common to all ten bibliographies, but nine of them share at least one source with at least one other bibliography!
However, no “essential” sources really stand out on the chart, either. A few sources were cited by three or four authors, but none of them were common to all or even a majority of bibliographies.

My general impression, then, is that there are a few sources that are important enough to be cited very commonly, but perhaps no group of authors that are so important that nearly everyone needs to cite them. This makes sense, since “Ethnic and Cultural Minority” lumps together many different groups, whose networks may be more visible with a more focused corpus.

There’s also a disparity among the bibliographies; some included many more sources than others (perhaps because some are PhD dissertations and others are master’s theses, so there’s a difference in length and scope). Eng built the biggest bibliography, so it’s not too surprising that his bibliography is near the center of the grid and has the most connections to other bibliographies; I suspect this is an inherent bias with this sort of study.

The triangle of Eng, Brickley and Kaval had some of the densest connections in the network. I try to catch a little of it in this screenshot:

In the middle of this triangle, several authors are cited by each of these authors, including Judith Butler, Homi Babhi, Sara Ahmed, and Gayle Salamon. The connections between Brickley and Eng include some authors who speak to their shared interest in Asian-American writers, such as Karen Tei Yamashita, but also authors like Stuart Hall, who theorizes multiculturalism. On the other side, Kaval and Eng both cite queer theorists like Jack Halberstam and Barbara Voss, but there are no connections between Brickley and Kaval that aren’t shared by Eng. There’s a similar triangle among Eng, Skafidas, and Green, but Skafidas has fewer connections to the four authors I’ve mentioned than they have to each other. This is interesting given the size of Skafidas’s bibliography; he cites many others that aren’t referred to in the other bibliographies.

(Don’t mind Jarzmesky; he ended up here but doesn’t share any citations with either Skafidas or Cranstoun.)

On the other hand, there is a stronger connection between Skafidas and Cranstoun. Skafidas writes on Cavafy and Cranstoun on Woolf, so they both cite modernist critics. However, because they are not engaging with multiculturalism as many of the other authors are, they have fewer connections to the others. In fact, Cranstoun’s only connection to an author besides Skafidas is to Eng, via Eve Kosofsky Sedgwick (which makes sense, as Cranstoun is interested in gender and Eng in queerness). Similarly, La Furno and Danraj, who both write about slave narratives, are much more closely connected to each other than to any of the other authors – but not as closely as I’d have expected, with only two shared connections between them. The only thing linking them to the rest of the network is La Furno’s citation of Hortense Spillers, shared by Brickley.

My Thoughts

I’d love to do this work at a larger scale. Perhaps if I could get a larger sample of papers from this section of CAW, I’d start seeing the different areas that fall into this broad category of “Literature in English, North America, Ethnic and Cultural Minority.” I’m seeing some themes already – modernism, Asian-American literature, gender, and slave narratives seem to form their own clusters. The most isolated author on my network wrote about twentieth-century African American literature and would surely have been more connected if I’d found more works dealing with the same subject matter. As important as intersectionality is, there are still networks based around specific literatures related to specific identity categories, with only a few prominent authors that speak to overlapping identities. We may notice that Eng, who is interested in the overlap between ethnicity and queerness, is connected to Brickley on one side (because she is also interested in Asian-American literature) and Kaval on the other (because she is also interested in queerness and gender).

Of course, there are some flaws with doing this the way that I have; since I’m looking at recent works, they are unlikely to cite each other, so the citations are going in only one direction and not making what I think of as a “real” network. However, I do think it’s valuable to see what people at CUNY are doing!

But I guess I’m still wondering about that – are these unidirectional networks useful, or is there a better way of looking at those relationships? I suppose a more accurate depiction of the network would involve several layers of citations, but I worry about the complexity that would produce.

In any case, I still want to look at places of publication. It’s a slightly more complex approach, but I’d love to see which authors are publishing in which journals and then compare the open access policies of those journals. Which ones make published work available without a subscription? Which ones allow authors to post to repositories like this one?

Also: I wish I could post a link to the whole file! It makes a lot more sense when you can pan around it instead of just looking at screenshots.

Text Mining Game Comments (Probably Too Many at Once!)

To tell the truth, I’ve been playing with Voyant a lot, trying to figure out what the most interesting thing is that I could do with it! Tenen could critique my analysis on the grounds that it’s definitely doing some things I don’t fully understand; Underwood would probably quibble with my construction of a corpus and my method of selecting words to consider. Multiple authors could very reasonably take issue with the lack of political engagement in my choice. However, if the purpose here is to get my feet wet, I think it’s a good idea to start with a very familiar subject matter, and in my case, that means board games.

Risk Legacy was published in 2011. This game reimagined the classic Risk as a series of scenarios, played by the same group, in which players would make changes to the board between (or during!) scenarios. Several years later,* the popularity and prevalence of legacy-style, campaign-style, and scenario-based board games has skyrocketed. Two such games, Gloomhaven and Pandemic Legacy, are the top two games on BoardGameGeek as of this writing.

I was interested in learning more about the reception of this type of game in the board gaming community. The most obvious source for such information is BoardGameGeek (BGG). I could have looked at detailed reviews, but since I preferred to look at reactions from a broader section of the community, I chose to look at the comments for each game. BGG allows users to rate games and comment on them, and since all the games I had in mind were quite popular, there was ample data for each. Additionally, BGG has an API that made extracting this data relatively easy.**

As I was only able to download the most recent 100 comments for each game, this is where I started. I listed all the games of this style that I could think of, created a file for each set of comments, and loaded them into Voyant. Note that I personally have only played five of these nine games. The games in question are:

The 7th Continent, a cooperative exploration game
Charterstone, a worker-placement strategy game
Gloomhaven, a cooperative dungeon crawl
Star Wars: Imperial Assault, a game based on the second edition of the older dungeon crawl, Descent, but with a Star Wars theme. It’s cooperative, but with the equivalent of a dungeon master.
Near and Far, a strategy game with “adventures” which involve reading paragraphs from a book. This is a sequel to Above and Below, an earlier, simpler game by the same designer
Pandemic Legacy Season One, a legacy-style adaptation of the popular cooperative game, Pandemic
Pandemic Legacy Season Two, a sequel to Pandemic Legacy Season One
Risk Legacy, described above
Seafall, a competitive nautical-themed game with an exploration element

The 7th Continent is a slightly controversial inclusion to this list; I have it here because it is often discussed with the others. I excluded Descent because it isn’t often considered as part of this genealogy (although perhaps it should be). Both these decisions felt a little arbitrary; I can certainly understand why building a corpus is such an important and difficult part of the text-mining process!

These comments included 4,535 unique word forms, with the length of each document varying from 4,059 words (Risk Legacy) to 2,615 (7th Continent). Voyant found the most frequent words across this corpus, but also the most distinctive words for each game. The most frequent words weren’t very interesting: game, play, games, like, campaign.*** Most of these words would probably be the most frequent for any set of game comments I loaded into Voyant! However, I noticed some interesting patterns among the distinctive words. These included:

Game Jargon referring to scenarios. That includes: “curse” for The 7th Continent (7 instances), “month” for Pandemic Legacy (15 instances), and “skirmish” for Imperial Assault (15 instances). “Prologue” was mentioned 8 times for Pandemic Legacy Season 2, in reference to the practice scenario included in the game.

References to related games or other editions. “Legacy” was mentioned 15 times for Charterstone, although it is not officially a legacy game. “Descent” was mentioned 15 times for Imperial Assault, which is based on Descent. “Below” was mentioned 19 times for Near and Far, which is a sequel to the game Above and Below. “Above” was also mentioned much more often for Near and Far than for other games; I’m not sure why it didn’t show up among the distinctive words.

References to game mechanics or game genres. Charterstone, a worker placement game, had 20 mentions of “worker” and 17 of “placement.” The word “worker” was also used 9 times for Near and Far, which also has a worker placement element; “threats” (another mechanic in the game) were mentioned 8 times. For Gloomhaven, a dungeon crawl, the word “dungeon” turned up 20 times. Risk Legacy had four mentions of “packets” in which the new materials were kept. The comments about Seafall included 6 references to “vp” (victory points). Near and Far and Charterstone also use victory points, but for some reason they were mentioned far less often in reference to those games.

The means by which the game was published. Kickstarter, a crowdfunding website, is very frequently used to publish board games these days. In this group, The 7th Continent, Gloomhaven, and Near and Far were all published via Kickstarter. Curiously, both the name “Kickstarter” and the abbreviation “KS” appeared with much higher frequency in the comments on the 7th Continent and Near and Far than in the comments for Gloomhaven. 7th Continent players were also much more likely to use the abbreviation than to type out the full word; I have no idea why this might be.

Thus, it appears that most of the words that stand out statistically (in this automated analysis) in the comments refer to facts about the game, rather than directly expressing an opinion. The exception to this rule was Seafall, which is by far the lowest-ranked of these games and which received some strongly negative reviews when it was first published. The distinctive words for Seafall included two very ominous ones: “willing” and “faq” (each used five times).

In any case, I suspected I could find more interesting information outside the selected terms. Here, again, Underwood worries me; if I select terms out of my own head, I risk biasing my results. However, I decided to risk it, because I wanted to see what aspects of the campaign game experience commenters found important or at least noteworthy. If I had more time to work on this, it would be a good idea to read through some reviews for good words describing various aspects of this style of game, or perhaps go back to a podcast where this was discussed, and see how the terms used there were (or weren’t) reflected in the comments. Without taking this step, I’m likely to miss things; for instance, the fact that the word “runaway” (as in, runaway leader) constitutes 0.0008 of the words used to describe Seafall, and is never used in the comments of any of the other games except Charterstone, where it appears at a much lower rate.**** As it is, however, I took the unscientific step of searching for the words that I thought seemed likely to matter. My results were interesting:

(Please note that, because of how I named the files, Pandemic Legacy Season Two is the first of the two Pandemics listed!)

It’s very striking to me how different each of these bars looks. Some characteristics are hugely important to some of the games but not at all mentioned in the others! “Story*” (including both story and storytelling) is mentioned unsurprisingly often when discussing Near and Far; one important part of that game involves reading story paragraphs from a book. It’s interesting, though, that story features so much more heavily in the first season of Pandemic Legacy than the second. Of course, the mere mention of a story doesn’t mean that the story of a game met with approval; most of the comments on Pandemic Legacy’s story are positive, while the comments on Charterstone’s are a bit more mixed.

Gloomhaven comments are much more about characters than any of the other terms I used; one of the distinguishing characteristics of this game is the way that characters change over time. Many of the comments also mentioned that the characters do not conform to common dungeon crawl tropes. However, the fact that characters are mentioned in every game except for two suggests that characters are important to players of campaign-style games.

I also experimented with some of the words that appeared in the word cloud, but since this post is already quite long, I won’t detail everything I noticed! It was interesting, for instance, to note how the use of words like “experience” and “campaign” varied strongly among these games. (For instance: “experience” turned out to be a strongly positive word in this corpus, and applied mainly to Pandemic Legacy.)

In any case, I had several takeaways from this experience:

Selecting an appropriate corpus is difficult. Familiarity with the subject matter was helpful, but someone less familiar may have selected a less biased corpus.
The more games I included, the more difficult this analysis became!
My knowledge of the subject area allowed me to more easily interpret the prevalence of certain words, particularly those that constituted some kind of game jargon.
Words often have a particularly positive or negative connotation throughout a corpus, though they may not have that connotation outside that corpus. (For instance: rulebook. If a comment brings up the rulebook of a game, it is never to compliment it.)
Even a simple tool like this includes some math that isn’t totally transparent to me. I can appreciate the general concept of “distinctive words,” but I don’t know exactly how they are calculated. (I’m reading through the help files now to figure it out!)

I also consolidated all the comments on each game into a single file, which was very convenient for this analysis, but prevented me from distinguishing among the commenters. This could be important if, for example, all five instances of a word were by the same author.

*Note that there was a lag of several years due to the immense amount of playtesting and design work required for this type of game.

**Thanks to Olivia Ildefonso who helped me with this during Digital Fellows’ office hours!

***Note that “like” and “game” are both ambiguous terms. “Like” is used both to express approval and to compare one game to another. “Game” could refer to the overall game or to one session of it (e.g. “I didn’t enjoy my first game of this, but later I came to like it.”).

****To be fair, it is unlikely anyone would complain of a runaway leader in 7th Continent, Gloomhaven, Imperial Assault, or either of the Pandemics, as they are all cooperative games.

Text mining the Billboard Country Top 10

My apologies to anyone who read this before the evening of October 8. I set this to post automatically, but for the wrong date and without all that I wanted to include.

I’m a big fan of music but as I’ve gotten further away from my undergrad years, I’ve become less familiar with what is currently playing on the radio. Thanks to my brother’s children, I have some semblance of a grasp on certain musical genres, but I have absolutely no idea what’s happening in the world of country music (I did at one point, as I went to undergrad in Virginia).

I decided to use Voyant Tools to do a text analysis of the first 10 songs on the Billboard Country chart from the week of September 8, 2018. The joke about country music is that it’s about dogs, trucks, and your wife leaving you. When I was more familiar with country music, I found it to be more complex than this, but a lot could have changed since I last paid attention. Will a look at the country songs with the most sales/airplay during this week support these assumptions? For the sake of uniformity, I accepted the lyrics on Genius.com as being correct and removed all extraneous words from the lyrics (chorus, bridge, etc.).

The songs in the top 10 are as follows:

Meant to Be – Bebe Rexha & Florida Georgia Line
Tequila – Dan + Shay
Simple – Florida Georgia Line
Drowns the Whiskey – Jason Aldean featuring Miranda Lambert
Sunrise, Sunburn, Sunset – Luke Bryan
Life Changes – Thomas Rhett
Heaven – Kane Brown
Mercy – Brett Young
Get Along – Kenny Chesney
Hotel Key – Old Dominion

If you would like to view these lyrics for yourself, I’ve left the files in a google folder.

As we can see, the words “truck,” “dog,” “wife,” and “left” were not among the most frequently used, although it may not be entirely surprising that “ain’t” was.

The most frequently used word in the corpus, “it’s” appeared only 19 times, showing that there is a quite a bit of diversity in these lyrics. I looked for other patterns, such as whether vocabulary density or average words per sentence had an effect on the song’s position on the chart, but there was no correlation.

Text-Mining the MTA Annual Report

After some failed attempts at text-mining other sources [1], I settled on examining the New York Metropolitan Transportation Authority’s annual reports. The MTA offers online access to its annual reports going back to the year 2000 [2]. As a daily rider and occasional critic of the MTA, I thought this might provide insight to its sometimes murky motivations.

I decided to compare the 2017, 2009, and 2001 annual reports. I chose these because 2017 was the most current, 2009 was the first annual report after the Great Recession became a steady factor in New York life, and 2001 was the annual report after the 9/11 attacks on the World Trade Center. I thought there might be interesting differences between the most recent annual report and the annual reports written during periods of intense social and financial stress.

Because the formats of the annual reports vary from year to year, I was worried that some differences emerging from text-mining might be due to those formatting changes rather than operational changes. So at first I tried to minimize this by finding sections of the annual reports that seemed analogous in all three years. After a few tries, though, I finally realized that dissecting the annual reports in this manner had too much risk of leaving out important information. It would therefore be better to simply use the entirety of the text in each annual report for comparison, since any formatting changes to particular sections would probably not change the overall tone of the annual report (and the MTA in general).

I downloaded the PDFs of the annual reports [3], copied the full text within, and ran that text through Voyant’s online text-mining tool (https://voyant-tools.org/).

The 20 most frequent words for each annual report are listed below. It is important to note that these lists track specific spellings of words, but it is sometimes more important to track all related words (words with the same root, like “complete” and “completion”). Voyant allows users to search for roots instead of specific spellings, but the user needs to already know which root to search for.

2001 Top 20:
mta (313); new (216); capital (176); service (154); financial (146); transit (144); year (138); operating (135); december (127); tbta (125); percent (121); authority (120); york (120); bonds (112); statements (110); total (105); million (104); long (103); nycta (93); revenue (93)

2009 Top 20:
new (73); bus (61); station (50); mta (49); island (42); street (41); service (39); transit (35); annual (31); long (31); report (31); completed (30); target (30); page (29); avenue (27); york (24); line (23); performance (23); bridge (22); city (22)

2017 Top 20:
mta (421); new (277); million (198); project (147); bus (146); program (140); report (136); station (125); annual (121); service (110); total (109); safety (105); pal (100); 2800 (98); page (97); capital (94); completed (89); metro (85); north (82); work (80)

One of the most striking differences to me was the use of the word “safety” and other words sharing the root “safe.” Before text-mining, I would have thought that “safe” words would be most common in the 2001 annual report, reflecting a desire to soothe public fears of terrorist attacks after 9/11. Yet the most frequent use by far of “safe” words was in 2017. This was not simply a matter of raw volume, but also the frequency rate. “Safe” words were mentioned almost four times as often in 2017 (frequency rate: 0.0038) than in 2001 (0.001). “Secure” words might at first seem more equitable in 2001 (0.0017) and 2017 (0.0022). However, these results are skewed, because in 2001, many of the references to “secure” words were in their financial meaning, not their public-safety meaning. (e.g. “Authority’s investment policy states that securities underlying repurchase agreements must have a market value…”)

This much higher recent focus on safety might be due to the 9/11 attacks not being the fault of the MTA, so any disruptions in safety could have been generally seen as understandable. The 2001 annual report mentioned that the agency was mostly continuing to follow the “MTA all-agency safety initiative, launched in 1996.” However, by 2017, a series of train and bus crashes (one of which happened just one day ago), and heavy media coverage of the MTA’s financial corruption and faulty equipment, were possibly shifting blame for safety issues to the MTA’s own internal problems. Therefore, the MTA might now be feeling a greater need to emphasize its commitment to safety, whereas it was more assumed before.

In a similar vein, “replace” words were five times more frequent in 2017 (0.0022) than in 2001 (0.0004). “Repair” words were also much more frequent in 2017 (0.0014) than 2001 (0.00033). In 2001, the few mentions of “repair” were often in terms of maintaining “a state of good repair,” which might indicate that the MTA thought the system was already working pretty well. By 2017, public awareness of the system’s dilapidation might have changed that. Many mentions of repair and replacement in the 2017 annual report are also in reference to damage done by Hurricane Sandy (which happened in 2012).

In contrast to 2017’s focus on safety and repair, the 2001 annual report is more concerned with financial information than later years. Many of the top twenty words are related to economics, such as “capital,” “revenue,” and “bonds.” In fact, as mentioned above, the 2001 annual report often uses the word “security” with its financial meaning.

The 2009 annual report was extremely shorter (6,272 words) than in 2001 (36,126 words) and 2017 (29,706 words). Perhaps the Great Recession put such a freeze on projects that there simply wasn’t as much to discuss. However, even after considering the prevalence of “New York,” 2009 still had a much higher frequency rate of the word “new.” (The prevalence of “new” every year at first made me think that the MTA was obsessed with promoting new projects, but the Links tool in Voyant reminded me that this was largely because of “New York.”) Maybe even though there weren’t many new projects to trumpet, the report tried particularly hard to highlight what there was.

The recession might also be why “rehabilitate” and its relative words were used almost zero times in 2001 and 2017, but were used heavily in 2009 (0.0043). Rehabilitating current infrastructure might be less costly than completely new projects, yet still allow for the word “new” to be used. “Rehabilitate” words were used even more frequently in 2009 than the word “York.”

One significant flaw in Voyant is that it doesn’t seem to provide the frequency rate of a word for the entire document. Instead, it only provides the frequency rate for each segment of the document. The lowest possible number of segments that a user can search is two. This means that users have to calculate the document-length frequency rate themselves by dividing the number of instances by the number of words in the document. If the document-length frequency rate is available somewhere in the Voyant results, it doesn’t seem intuitive and it isn’t explained in the Voyant instructions.

Although I generally found Voyant to be an interesting and useful tool, it always needs to be combined with traditional analysis of the text. Without keeping an eye on the context of the results, it would be easy to make false assumptions about why particular words are being used. Helpfully, Voyant has “Contexts” and “Reader” windows that allow for users to quickly personally analyze how a word is being used in the text.

[1] I first ran Charles Darwin’s “Origin of Species” and “Descent of Man” through Voyant, but the results were not particularly surprising. The most common words were ones like “male,” “female,” “species,” “bird,” etc.

In a crassly narcissistic decision, I then pasted one of my own unpublished novels into Voyant. This revealed a few surprises about my writing style (the fifth most common word was “like,” which either means I love similes or being raised in Southern California during the 1980s left a stronger mark than I thought). I also apparently swear a lot. However, this didn’t seem socially relevant enough to center an entire report around.

Then I thought it might be very relevant to text-mine the recent Supreme Court confirmation hearings of Brett Kavanaugh and compare them to his confirmation hearings when he was nominated to the D.C. Circuit Court of Appeals. Unfortunately, there are no full transcripts available yet of the Supreme Court hearings. The closest approximation that I found was the C-Span website, which has limited closed-caption transcripts, but their user interface doesn’t allow for copying the full text of the hearing. The transcripts for Kavanaugh’s 2003 and 2006 Circuit Court hearings were available from the U.S. Congress’s website, but the website warned that transcripts of hearings can take years to be made available. Since the deadline for this assignment is October 9, I decided that was too much of a gamble. I then tried running Kavanaugh’s opening statements through Voyant, but that seemed like too small of a sample to draw any significant conclusions. (Although it’s interesting that he used the word “love” a lot more in 2018 than he did back in 2003.)

[2] 2017: http://web.mta.info/mta/compliance/pdf/2017_annual/SectionA-2017-Annual-Report.pdf
2009: http://web.mta.info/mta/compliance/pdf/2009%20Annual%20Report%20Narrative.pdf
2001: http://web.mta.info/mta/investor/pdf/annualreport2001.pdf

[3] It’s important to download the PDFs before copying text. Copying directly from websites can result in text that has a lot of formatting errors, which then requires data-cleaning and can lead to misleading results.

DHUM 70000 – Introduction to Digital Humanities

Fall 2018 CUNY Graduate Center | #dhintro18