Category: Shakespeare

Love’s Labour’s Lost: The History

This passage from the Open Source Shakespeare’s Love’s Labour’s Lost shows language patterns that push the play into the area where the Histories cluster, something visible in the scatterplot discussed below. Returning to the taxonomy of Docuscope, this passage has a lot of Description strings combined with a relative lack of Interaction and First Person strings, both of which can be seen in the Docuscope screen shot below. We are looking at something slightly more complicated in this visualization of the text, however, because I have “turned on” the First Person and Interaction strings in the Docuscope Single Text Viewer. I did this because I want to show what cannot be shown: a relative lack of blue (Interaction) and red (First Person) strings combined with a relative abundance of yellow (Description) ones. To really “see” this in the wild, you would have to consult a completely color tagged text of the complete Folio Works and — while reading — keep track of the relative differences in quantities of blue, red and yellow in the different containers (the plays themselves). Only an Argus-eyed text tagger and a statistical analysis can do this. The results are heuristic in that they lead us toward certain areas of the text for continued interpretation. In this case, I have used the color coding facility in Docuscope to scan the entire play (once I knew the categories I was interested in) in order to find a passage like the one above: one that has lots of yellow and very little blue or red.

Inspection of such candidate “History” passages reveals a number of pedantic exchanges like this one between Sir Nathaniel, Holofernes and Dull. This scene is a hilarious sendup of of rhetorical display and vacuous learning, and it burlesques the famous Renaissance idea of verbal variety or copia that was recommended by Erasmus. It makes sense that this kind of passage would withdraw the play from the type of comic verbal interaction analyzed in the previous post. Because these characters are not speaking with one another, but rather are addressing an invisible audience of discerning rhetorical literates, there is not much interaction in the form of second person pronouns or corresponding first person singular pronouns — the very strings one would tend to find in Comic exchanges about acts or actions taken by characters themselves.

We expect this kind of thing from the pedants, but the analysis reveals a continuity of this History-like pattern among the French nobles who have vowed to live a life of Platonic study, characters like Biron who can never resist plumping their own rhetorical plumage. Don Armado, another parodic figure with an almost Quixotic appreciation for his own courtly expression, is also linguistically self-indulgent, and his passages would look similar to the one I have excerpted here and shown color coded below (although with Don Armado, there are marginally more interactions with his Page). The point of this analysis is to show that there are reasons why Docuscope would place Love’s Labour’s Lost with the Histories, and these reasons make sense to us once we begin to think about how the play is put together. This is a world of narcissists, something the Princess and her ladies point out when they defer the proposed courtship that is offered at the end of the play. That narcissism shows itself as a tendency to monologue, which cuts out the interaction that is characteristic of comedy and highlights instead a description-rich kind of oration that pushes these plays into the realm of History.

Would it be fair then to call Love’s Labour’s Lost a History? It depends. I would be comfortable saying that on the level of plot it has the elements of a Comedy, but on the level of its language, it is a History.

What, then, do we make of the historical decision made by Heminges and Condell to call this play a Comedy? Unsupervised statistical analysis has shown us (1) a pattern of groupings among the plays that roughly approximates at least two of H&Cs generic groupings of 1623 but also (2) exceptions to those classifications that make a certain amount of critical sense when we look at the construction of those plays. I would argue that we need an ontology here to sort out what elements of the analysis are fundamental as opposed to derivative. We could have an ontology of levels, for example, which says that “on the linguistic level, the play belongs with one group,” but “on the level of plot, the play belongs with another group.”

But eventually we would have to decide how the levels go together. That is also part of the point of this kind of work, since the overlap-with-divergence of linguistic and historical groupings of the plays introduce the possibility that there are levels of coherence here whose interaction needs to be explained. The language of levels needs a compliment in a theory of objects: what are the things that are being compared here? Aren’t the tagged texts themselves a kind of hypothetical or abstracted version of the text itself? And what is the relationship between this hypothetical object and those that are arrayed into a generic group by, say, the historical editors of the First Folio? I will try, in future posts, to show why these are not trivial metaphysical questions.

By way of preview, however, I think the most fundamental “level” here is the one on which individuals or groups make decisions and act. So I would say that Heminges and Condell’s decision about how to order the plays in the First Folio is the most real thing in the analysis, while the statistical objects (tagged texts, Principal Components, regions of a scatterplot) are derivative. How else could we be “surprised” to find LLL clustering with the Histories, unless we were already enticed by the idea (as I was) that the initial clusterings themselves coincided with the classes stipulated by Shakespeare’s editors? More interesting: what is the abstract recipe of family resemblances or species traits that human beings like Heminges and Condell are carrying around in their heads? Their decision to sort the plays a certain way is real. It is a historical fact. But the “sensibility” or “weightings” that led them to take this empirical action must itself be hypothesized or modeled. We might be able to reconstruct this model, but even H&C may not have had direct access to it. This detour might change the way we think about the status of our statistical model, since that model may be only an approximation of something far more comprehensive — capacity for literary judgment in historical actors — whose dynamic, differential powers of comparison are suggestively approximated things like “principal components.”

The latitude in linguistic practice that makes Loves Labour’s Lost look like a History is evidently something that Heminges and Condell did not notice, and I’m not sure why they should have. But once we have noticed it, this latitude in terms of linguistic practice may makes sense to us. Why couldn’t there be a filiation of Love’s Labour’s Lost with Histories on the level of stance and language that does not “show up” on the level of plot? Surely this filiation is real too. The question is, where and on what level?

July 20, 2009
An Untimely Piece of Richard II

The passage from Richard II 3.3 above, taken from the Open Source Shakespeare, is statistically speaking particularly illustrative of some of the things that Shakespeare does when he writes History plays. While it is tempting to go straight to Docuscope and the statistics, it is better to post the passage without any markup, since we don’t want to prejudge what is going on in the language here. So, we read the passage.

One of the things I have been saying about working with statistics and texts is that statistics can tell you that there is a pattern, but only a human being can tell you what that pattern is. The what/that distinction is crucial if we want to be precise about the division of labor that occurs in this kind of work. It is easy to think that statistics have “found” something from out of nowhere in the text, when what is really going on is that they have found “something,” and this something is a reflection of prior decisions we have made about what is worth counting. Perhaps it will become clearer as postings on this blog continue why this distinction is important. For now I’ll try to be as accurate as we can about why this is “History writing:” this is a piece of History because a certain class of words that Docuscope counts are more abundantly present here than in plays of other genres. Now let’s go to the same passage as marked up by Docuscope. (Click on the image to get a good look.)

The underlined words here and in other passages of Richard II are the ones that are responsible for “driving” this and other history plays to the right hand side of the scatterplot from the previous post. These words all belong to a cluster called “Description” in Docuscope’s taxonomy, one that contains various subcategories that are not visible to us at this level of the analysis. (Short story why: with only 36 plays to look at, we need for statistical reasons to be looking for fewer classes of things-to-count than items-in-which-to-count-them. In statistics-speak: we can’t have the number of variables exceed the number of observations.) What kind of words does Docuscope count as “Description?” The best answer is: “the words that are underlined in yellow here.” Jonathan Hope and I have tried to remain agnostic about the explanatory power of the names that have been given to Docuscope’s categories and just look at what the words, as a collection, are doing on the page. But you can begin to guess what the rationale cluster is here: they are words that describe the properties of objects (“little,” “small”), objects themselves (“dish,” “wood”), spatial relations (“buried in the”), and verbs showing changes of physical state (“sighs,” “lodge”). Notice that some strings here are contiguous word seqments, such as “buried in the.” Docuscope can count strings from 1-10 words in length, and counts 200 million of them, classifying them into up to 101 categories at its finest level of resolution.

In this case I have consulted the results of the Principal Component Analysis (see last post), in particular the component loadings shown below, which tell me what cluster does the most work in pulling plays that score highly on PC1 to the right in the scatterplot we were looking at. On the left hand of the loadings chart below, we are looking at the various clusters that Docuscope counts by their cluster names. In the columns to the right, we are seeing the “loadings” of each of these clusters on the different components (PC1-PC5) that carve up the variation within Shakespeare’s writings into underlying patterns. As you can see from the bold item under PC1, Description is overwhelmingly powerful in the first component, scoring 0.913 on a scale from 0 to 1. The yellow underlined items above are those that were tagged as Description by Docuscope, which I know from having pulled up the play in Docuscope’s single text viewer and “turning on” only the items in this cluster (the yellow one on the left) so that I could find a passage that had a lot of yellow in it. (I picked this passage by eyeballing the play for yellowness. We are developing an algorithm to identify exemplary passages of different lengths using a hands-off statistical method; but for now we can use this.) So these words or tokens tell us why the pink dots in the biplot are moved to the right of the origin. But why do they move down? That is the work of PC2, and to understand that, we must look once again at the component loadings:

Component Loadings from Principal Component Analysis of Shakespeare Plays in R

Now components can be combinations of correlated high and low items — a bit like a trend in a fixed deck of cards which has lots of face cards but very few low numbered cards. The loadings on a principal component can work in a similar way: a component, that is, can pull out a pattern in which Docuscope finds that plays containing lots of Emotion strings (as in PC3) also tend to have a lack of “Special Referencing” strings. This tells us that when Shakespeare does one thing, he is constrained — by genre, expectation, the limits of his actors, taste, style — not to do others. Explaining why this must be the case is for me perhaps the most interesting aspect of working quantitatively with the plays. Now, for PC2: it shows a corellation of high amounts of “First Person” strings with another group of strings called “Interaction.” Because the history plays cluster at the bottom right of the scatterplot, they score low on the second factor, and so lack the items that are highly loaded on this component. (Here we are focusing on boldfaced loadings that are greater than + or – 0.4, a significant statistical threshold.) So PC2 is really describing something that History plays lack: something that you probably wouldn’t look for when reading these plays, but which nevertheless is important to their construction and your experience of them. It is now time to find an un-Historical — untimely? — piece of Richard II that has these First Person and Interactivity strings: this item may show us, by negative example, what History plays and this play do not generally do in comparison to other plays.

Having read through the passage above, we can now look at how it was marked up by Docuscope. Remember that speech prefixes and stage directions have been stripped off of the Moby Shakespeare (which is the same one used by Open Source Shakespeare displayed above).

I have highlighted the Interaction and First Person strings from this passage in Richard II 4.1 that are atypical for history plays, and I think this is an interesting result. First Person includes the first person singular pronouns, first person possessive pronouns, but also references that relate actions or events to a speaker who is marking his or her relationship to those actions or events (“Make me,” “to me”). (Another post will deal with the question of the perspective from which utterances appear as marked; I suspect that Docuscope treats all terms as if they are being “mentioned” according to J.L. Austin’s criteria: use would be a far more complicated thing to tag.) Interaction includes several items, but here the ones that are shown are second person pronouns and possessive pronouns and verbs attached to such pronouns indicating something like recognition of a social relation or mediation (“mayst thou,” “Your care”). Notice that Docuscope is picking up a few archaic forms (“thee” and “mayst”). So what is it that Histories in general lack, but that this passage in particular has in an atypically high degree? The most accurate answer is: the underlined words. Principal Component Analysis tells us that there is a lower proportion of these strings in the Histories than in plays of other genres and says, in a mathematically defensible way, that this “lower” proportion is probably not-accidental. But it is our job to say what is going on, and perhaps why, not simply that something is the case. And so my provisional description (which is always a shorthand form of analysis) of this trend would be the following: History plays lack the verbal back and forth over personal matters and fortunes that is more common in other Shakespearean dramatic texts, a back and forth which seems to correlate — in its absence — with a high degree of concrete language about things and events. That’s what Docuscope “sees” when it sees History plays. I see stories about groups of people rather than individuals, stories whose action revolves around physical rather than emotional conflicts and so requires the description of concrete objects and events. The interpersonal or you-me back and forth style, on the other hand, is reserved for another of Shakespeare’s genres, one that lacks extensive descriptions of objects and things: his Comedies. Shakespeare’s Comedies will be the subject of another post. The next one, however, will treat an entire play that is high in what Histories have and low in what Histories lack, but is not itself a History: Shakespeare’s early comedy, Love’s Labour’s Lost.

July 8, 2009
A Genre Map of Shakespeare’s Plays from the First Folio (1623)

The following image, produced using the statistical package R, represents the position of the plays from Shakespeare’s First Folio according patterns discerned using inferential statistics. The plays themselves were “tagged” by Docuscope and then the frequency scores of each play were analyzed with standard unsupervised statistical procedures (prcomp, center=True, scale=False), producing two principal components that define underlying patterns of correlation and opposition in Shakespeare’s use of certain kinds of words in different genres. In this post, the first of several, I want to discuss what can be learned from such a map and how it either confirms or diverges from current critical understandings of Shakespeare’s genres as understood by Shakespeare’s contemporaries or later critics (who became interested in the so called “Late Plays”).

The first thing to notice is that these two components are doing a reasonable job of separating out at least two types of plays — Comedies and Histories — by placing them in opposite corners of the scatterplot. The two scatterplots you see here are really the same scatter viewed from two positions, so we will concentrate on the one in the upper right, which rates the plays on Principal Component 1 (PC1) on the horizontal axis and Principal Component 2 (PC2) on the vertical axis. In Principal Component Analysis (PCA), the earlier components tend to “suck up” more variation than later ones, which means they define progressively less powerful avenues for simplifying relationships among the variable — here, the types of words Shakespeare uses (or doesn’t use) in different types of plays. So PC1 does an excellent job of pushing the pink circles, the Histories, to the right of the scatter, which means they all use a proportionally higher degree of the types of words that are either favored or shunted by this factor. (More anon.) Note that the pink dots are plays classed as Histories in the First Folio, except for one green dot which is Henry VIII (we’ve pulled out four “late plays” in green: Henry VIII, Cymbeline, Winter’s Tale and Tempest). PC2 finished the job, pushing these histories down on the vertical axis and so partitioning them (roughly) in the lower right-hand corner of the scatter. Here we would say that the Histories or pink dots have comparatively fewer of the types of words favored by PC2 (and more of the words that it discriminates against). Think of components, then, as representing correlations of things that Shakespeare does and doesn’t do over the course of all his writing published in F: later we will see how his choice of one type of word — for example, first person singular pronouns — almost always “goes with” a lack of other types of words. Here the point is to show that a basic pattern emerges with so called unsupervised statistical techniques, the most powerful because they do not use any groupings provided by us, but rather look for latent patterns among the words classed by Docuscope.

Let’s “believe” what we’re seeing here and assume that these linguistic patterns correspond to something like the genre distinctions that Shakespeare’s editors, Heminges and Condell, saw when they classed the plays into three genres on the contents page of the First Folio. (Hope and I have argued for this assumption elsewhere.) Now there are two ways to proceed at this point. We could provide a spreadsheet detailing the loadings of the variables on each of the components, which would be interesting to you if you already knew Docuscope well and had a decent sense of how PCA works. But a more immediately useful thing to do would be to show passages that contain lots of the tagged words that are pulling the plays in these different directions — the one’s that “magnetize” the plays into groups, so to speak — so that we can have a sense of what is an “exemplary” History passage according to PCA, or an exemplary Comedy passage. (Note that the comedies are also being situated in the opposite quadrant — the upper left — from the histories: this is a pattern we see again and again at different levels of analysis.) So what does a historical passage look like from Richard II? What, more interestingly, does a “historical” passage look like from Love’s Labour’s Lost, which is “out of its quadrant” here? What does a typically comic passage from Twelfth Night look like? And what kind of passages in Othello are pushing it up into the comic quadrant? Do these “correct” and “incorrect” classings make sense to us from a literary critical standpoint? It has been claimed before that Othello contain many elements of Shakespeare’s comedy writing, so this would be a good place to start thinking about the value of statistically assisted linguistic analysis of genre, which I will do in the next post. (Jonathan may join in with comments, which will make this more of a back and forth.)

July 7, 2009
King or no [King]

I wanted to say a little about a problem we encountered early on when we began counting things in the plays, a problem that gets us into the question of what might be a trivial versus a non-trivial indicator of genre on the microlinguistic level. Several years ago Hope and I began a series of experiments with the plays contained in Shakespeare’s First Folio, feeding them into Docuscope — a text-tagger created at Carnegie Mellon — to see if we could find any ordered groupings in them. The results of that early work were published in the Journal for Early Modern Literary Studies in an article called “The Very Large Textual Object: A Prosthetic Reading of Shakespeare.” I will say more about Docuscope in subsequent posts, but suffice it to say here that it differs from other text-taggers in that it embodies a phenomenological approach to texts. (For the creator’s explanation of how it works, see an early online precis here.) Docuscope, that is, codes words and “strings” of words based on the ways in which they render a world experientially for a reader or listener. The theory behind how texts do this, and thus the rational for Docuscope’s coding strategy, is derived from Michael Halliday’s systemic-function grammar. But what is particularly interesting about Docuscope is the human element involved in its creation. The main architect of the system, a rhetorician named David Kaufer, spent 8 years hand-tagging several million pieces of English according to their rhetorical function, and then expanded out this initial tagging spread with wild-card operators so that Docuscope now classes over 200 million strings of English (1 to 10 words in length) into over 100 distinct categories of use or function.

Obviously there is a lot to say about the program itself, which represents a “built rhetoric” of sorts, one that has emerged through the interplay of one architect, his reading, and the texts he was interested in classifying. In any event, when Hope and I fed the plays into Docuscope, we had to make some initial decisions, and the first was whether to strip anything out of the plays we had obtained from the Moby online version. (We were already thinking about the shortcomings of this conflated, edited corpus as opposed to the text of the plays as it exists in various states in the First Folio, but we had to make do since we were not yet ready to modernize the spelling of F and decide among its internal variants.) So with the Moby text, we had things like Titles, Act and Scene Numbers, and Speech Prefixes (Othello, King Henry, Miranda, etc.). The speech prefixes created the greatest difficulty, because in the history plays the word “King” is, as you can imagine, used an awful lot — it appears in the speech prefixes of characters over and over. And because Docuscope tagged “King” as one of its visible tokens (assigning it to the “bucket” named “Common Authority”), this particular category was off the charts in terms of frequency when it came time to do unsupervised factor analysis on the frequency counts obtained from the plays. (I’ll post more on factor analysis in the future as well.)

Here’s the issue. In the end, we decided that it was “cheating” to let Docuscope count “King” in the speech prefixes, since this was a dead giveaway for History plays, and we wanted something more structural — something more buried in the coordination of word choices and exclusions — to serve as the basis of our linguistic “recipes” for Shakespeare’s genres. As the article shows, we were able to find such a recipe without relying on “King” in the speech prefixes. Indeed, subsequent research has shown that plural first person pronouns combined with a the profusion of concrete, sense objects are really the giveaway for Shakespeare’s histories. (They are also “missing” certain things that other genres have: this combination makes histories the most “visible” genre, statistically speaking” that he wrote.) But is it really fair to decide that certain types of tokens — King in the speech prefix, for example — are superficial marks of history as a genre, and so not worth using in an analysis? Isn’t there a certain interpretive bias here, one that I have and in a sense want to argue for, against the apparatus of the play in favor of something like a deeper set of patterns or stances? To argue for such an exclusion, I would begin by pointing out that they are an artifact of print and are not “said” (even if they are used) in performance, but there is still something to think about here.

A Google search algorithm looks for the “shortest vector” or easiest “tell” that identifies a text as this kind or that — even if it is one of a kind. But those of us who are interested in genre must by definition not be interested in the shortest vector or the easiest tell. We are looking for the longer path. The book historian in me, however, says that apparatus is important, and that “accidental” features never really are. So this is something I want to think more about.

July 2, 2009
The Plunge

I have created Wine Dark Sea as a point of interchange for collaborative work I am doing on the relationship between statistics, texts and history, as well as the images that might communicate this relationship. Since much of this activity has been experimental — I produce images, graphs, and data — I thought I needed a place where I could present this material. Currently I am researching the statistical profile of Shakespeare’s genres with my colleague Jonathan Hope (Strathclyde University, Glasgow) and have begun another on the nature of English prose genres in the Early English Books Online corpus with Eric Raimy (U.W. Madison) and Suguru Ishizaki (Carnegie Mellon). I am also collaborating with Matt Jockers and Franco Moretti at Stanford on a computational and historical study of genre change in the nineteenth century novel. Some of this work is conducted through the Working Group for Digital Inquiry, which is housed in the Memorial Library at the University of Wisconsin, where I teach Renaissance Studies. In the next few months I will be offering several images produced in the course of my collaborations with some ideas about what they mean.

June 22, 2009