Author Archives: Sean Patrick Palmer

Text mining of Native American Speeches with Voyant

Analysis of Short Well-known Speeches by Native Americans using Voyant

My students have to recite these speeches when I teach Voice and Diction, so I have copies on my desktop.[1] These are not long speeches. They vary in length from 75 to 100 seconds, or two or three paragraphs.[2] The speeches are all about how badly the White Man has treated them. Most are defiant speeches, calling for resistance. A few are speeches of peace or surrender, and one in particular, is a cry of pain (Standing Bear). I thought these would be interesting speeches to analyze.

At first, I decided to do a test run with three of them, so I copied and pasted three of the speeches into the text box.  This did not provide the results I wanted because Voyant treated them as one document, not as three short ones. This was my fault. I saw that Voyant let users upload individual files, but I didn’t do it.

Starting over, I uploaded all eleven speeches as word documents. That worked. It gave me an analysis of all the speeches, as well as information about the individual speeches. Overall, this corpus has 2,180 total words and 757 unique word forms. The longest speech was Sitting Bull’s, with 301 words. The shortest was Chikataubat’s at 153[3].  Interestingly, Chickataubat’s speech was the shortest, but it had the most words per sentence (30.6) and the highest vocabulary density (0.758)[4][5].

They did not calculate the overall vocabulary density of the corpus, but based on their formula, it’s 0.347. That seems low, but, at the same time, most of these speeches are calls to action, and the arguments presented in speeches like might to be more straightforward.

The most common words were “people” (15 times), “man” (12), “shall” (9), “white” (9), and “away” (8). Voyant also provided the most common words in the individual pieces. For instance, “tells” occurs four times in Osceola’s speech and “neighbors” occurs three times in Sitting Bull’s speech.

Considering that these are all speeches that arose out of conflict, the most common words are unsurprising: these words are used to call out to the community, and to tell them not to just to resist, but who to resist.  The frequency of “tells” in Osceola’s speech is not surprising. Osceola is calling on the Seminole to resist relocation to what is now Oklahoma. “Neighbors” in Sitting Bull’s speech refers to White Americans who are continually encroaching on Dakota territory.

I noticed that Voyant didn’t make links between related words: died and dead are treated as separate words, they aren’t really linked in any way in the analysis.

Sometimes, though, the link is more subtle: Sitting Bull’s “neighbors” clearly refers to white people, like “white”, but Voyant doesn’t link them either, which isn’t really a surprise.

Voyant has a function called “links”, which showed the three most common words in the corpus and the words “in close proximity” to them. It also has a “context” function, where you can click on a word, and it will show you all the sentences that word appears in. It also marks which speeches those sentences come from.

Next, I decided to split the works up into categories to see what, if anything, changed. I chose to focus on word count, most common words, and vocabulary density for this.

First, I divided the speeches into two groups: those that were given before the Civil War, and those given after. The pre-Civil War speeches were by Metacom, Chikatabaut, James Logan, Pushmataha, Wabashaw, and Osceola. The post-Civil War speeches were by Red Cloud, Spotted Tail, Sitting Bull, Chief Joseph, and Standing Bear.

The pre-Civil War speeches had a total of 1095 words, with 480 unique words. The most common words were “people” (10), “father” (7), “man” (7), “English” (6), “White” (6), “Logan” (6).

This makes sense: these speeches are all calls to resist, so they’d be making appeals and talking about their enemy. The interesting one is “Logan”. It is one of the most frequently encountered words in the entire corpus, yet it appears only in James Logan’s speech.

The longest speech here is James Logan’s at 202 words. The shortest is Chikataubat’s at 153.

The overall vocabulary density is 0.483. I’m not sure why. If, as I said above, calls to action tend to be less complicated than other kinds of speeches, the density should be lower, not higher. My initial hypothesis is either wrong or too simplistic.

As above, Chikataubat’s speech is the most dense, at 0.758, while the least dense speech is Osceola’s at 0.456. The longest speech is James Logan’s (202 words), and the shortest is Chickataubat’s (153). This is interesting because they both spend time describing individual situations in their speeches: the murder of Logan’s family at the hands of the Whites, and the desecration of the graves of Chickataubat’s family. The other speeches in this category are more generalized calls to resist.

The post-Civil War speeches (Red Cloud, Spotted Tail, Sitting Bull, Chief Joseph, and Standing Bear). These speeches are more of a mix. Chief Joseph is surrendering[6]; Standing Bear, asking for help[7], Spotted Tail saying resistance is futile, and Red Cloud and Sitting Bull calling for war.

The most frequently appearing words were “shall” (7 appearances), “children” (6), “died” (6), “men” (6), and “things” (6).

Standing Bear’s speech accounted for all the appearances of “died” and four of the appearances of “children”.  The most common word in Spotted Tail’s speech is “alas” (3).  Both of these make sense. Standing Bear is describing the state of his people: many died on the road to the new reservation, and more died once they got there. Spotted Tail has been defeated, and his speech reflects that. Meantime, the most common word in Red Cloud’s speech is “brought”, which appears three times. Again, context matters. Red Cloud is listing the things the White Man has done to his people, so the usage of “brought: makes sense.

The longest speech was Sitting Bull’s, and the shortest was Chief Joseph’s at 161 words. Chief Joseph’s speech, coming after weeks of flight and retreat, may be so short because he was exhausted and demoralized.

In terms of vocabulary density, the densest speech of this set is Spotted Tail’s at 0.655; Standing Bear’s is the least dense at 0.545. I don’t know why. I guess that linking vocabulary density to theme of speech doesn’t work, or at least doesn’t work with this corpus.

Finally, I decided to try to analyze these speeches by language family. This was difficult because the speakers’ languages came from five different language families. Only one language family, the Siouan, had more than two representatives. Since it had five: Ponca, Spotted Tail, Sitting Bull, Red Cloud, and Wabasha, I decided to try it to see if there were similarities.

Overall, this corpus contains 1,117 total words and 454 unique words, for a vocabulary density of 0.406.

Sitting Bull’s speech was the longest, at 301 words, while Wabashaw’s speech was the shortest at 193. Wabashaw’s speech had the highest vocabulary density, however, at 0.665 while Standing Bear’s had the lowest at 0.545. Again, the shortest speech is the densest (at least in terms of vocabulary).

The most common words overall were “died” (6), “man” (6), “shall” (6), “things” (6), and “children” (5). We can see Standing Bear’s influence here again, since, as mentioned above, all the occurrences of “died” and four of the occurrences of “children” were from his speech.

“Father” was the most common word in Pushmataha’s speech. Again, in context, this makes sense: Pushmataha was calling his people to war, extolling their bravery in the name of their father.

Overall, I thought this was interesting. I can see how almost all these tools can be useful. I’m not sure about vocabulary density, though: I can see that it has descriptive value. The argument can be made that a speech with higher vocabulary density might be more complex, but I don’t know that I saw that. I’d have to work with longer speeches to see if that bears out.


[1] These speeches are by Chief Joseph of the Nez Perce, Chikataubat of the Massachuset, James Logan of the Cayuga, Metacom (or King Philip) of the Wampanoag, Osceola of the Seminole, Pushmataha of the Choctaw, Red Cloud of the Oglala Dakota, Sitting Bull of the Hunkpapa Lakota, Spotted Tail of the Brulé Lakota, Standing Bear of the Ponca, and Wabashaw of the Dakota.

[2] Because I teach speaking skills, I think of length more as a function of time (how long it takes to recite) rather than word count.

[3] This makes me wonder if I shouldn’t replace this one with something a little longer.

[4] Voyant calculates vocabulary density like this: unique words/total number of words= vocabulary density.

[5] Chikataubat’s speech is essentially five run-on sentences. Maybe that’s why I’ve kept in it. It’s short, but it’s more difficult than the others.

[6] After being forcibly removed from their lands in Oregon to a reservation in Idaho, Chief Joseph and his people fled their lands, in an attempt to get to Canada while being pursued by the U.S. Cavalry. They had to surrender forty miles from the border. Look up his story. It’s worth a read.

[7] This is probably the saddest of the speeches.

I attended the Digital Academic Identity and WordPress 1 workshop this week.

The Digital Identity discussion fascinated me because, between message boards, blogs, Facebook, Tumblr, etc., I have had an online existence for at least twenty years. Honestly, there have been times in my life when my online existence was better than my “real world” one.  As a result, I wondered how the people running the seminar would approach the topic in an Academic context.

The folks running the seminar had us all google ourselves. In my case, the first two results were for an artist from California. The third was for a blank page on CUNY Commons that I started when I was in a Center for Teaching and Learning seminar at my college.

I was surprised. I have presented at many conferences, most of which have published their programs online, so I figured maybe they would be there. Not on the first page, they weren’t.

This is… not optimal. Clearly, I have to work on my professional digital identity. However, I’d prefer to keep my personal and professional lives separate, so I have to make some choices: do I establish a separate, professional Facebook, profile, for example? Or is it enough that I have a LinkedIn and a Twitter that I don’t use for personal stuff? Does this mean I’ll have to actually USE Twitter? (I’m not Twitter’s biggest fan: I’m just too wordy for it and I know people who have been harassed by the Trolls that Twitter refuses to do anything about.)

Fortunately, part of this Identity Crisis can be solved with WordPress and CUNY Commons. The second part of this seminar was an introduction to WordPress: how to set up a page, and the various things that can be done to personalize it (using templates, adding menus) and upload information, to build a professional website.

I can upload my CV there. That would be a start. Though, again, as the people running the seminar pointed out, it makes sense to upload a CV (as a pdf file) and then break it down, into categories like “conferences and publications”, “courses taught”, and “Academic Service”. I could also, if I wanted to, do blog entries there. I can also link to other sites I use professionally (my professional organizations or my LinkedIn, for example).

Downside? WordPress has a bit of a learning curve. It takes time to figure out. The seminar gave me a start: I can navigate through the basics of putting a WordPress site together, but to fully build it will take me some time.  If they offer this workshop again, I think you should consider attending for the advice but especially for the basics of WordPress. It’s not (for me, anyway) very intuitive.

Overall, this was a great seminar. We need to have some control over our online identity, and building your own Academic website on CUNY Commons (which is powered by WordPress) can help with that.