The following image, produced using the statistical package R, represents the position of the plays from Shakespeare’s First Folio according patterns discerned using inferential statistics. The plays themselves were “tagged” by Docuscope and then the frequency scores of each play were analyzed with standard unsupervised statistical procedures (prcomp, center=True, scale=False), producing two principal components that define underlying patterns of correlation and opposition in Shakespeare’s use of certain kinds of words in different genres. In this post, the first of several, I want to discuss what can be learned from such a map and how it either confirms or diverges from current critical understandings of Shakespeare’s genres as understood by Shakespeare’s contemporaries or later critics (who became interested in the so called “Late Plays”).
The first thing to notice is that these two components are doing a reasonable job of separating out at least two types of plays — Comedies and Histories — by placing them in opposite corners of the scatterplot. The two scatterplots you see here are really the same scatter viewed from two positions, so we will concentrate on the one in the upper right, which rates the plays on Principal Component 1 (PC1) on the horizontal axis and Principal Component 2 (PC2) on the vertical axis. In Principal Component Analysis (PCA), the earlier components tend to “suck up” more variation than later ones, which means they define progressively less powerful avenues for simplifying relationships among the variable — here, the types of words Shakespeare uses (or doesn’t use) in different types of plays. So PC1 does an excellent job of pushing the pink circles, the Histories, to the right of the scatter, which means they all use a proportionally higher degree of the types of words that are either favored or shunted by this factor. (More anon.) Note that the pink dots are plays classed as Histories in the First Folio, except for one green dot which is Henry VIII (we’ve pulled out four “late plays” in green: Henry VIII, Cymbeline, Winter’s Tale and Tempest). PC2 finished the job, pushing these histories down on the vertical axis and so partitioning them (roughly) in the lower right-hand corner of the scatter. Here we would say that the Histories or pink dots have comparatively fewer of the types of words favored by PC2 (and more of the words that it discriminates against). Think of components, then, as representing correlations of things that Shakespeare does and doesn’t do over the course of all his writing published in F: later we will see how his choice of one type of word — for example, first person singular pronouns — almost always “goes with” a lack of other types of words. Here the point is to show that a basic pattern emerges with so called unsupervised statistical techniques, the most powerful because they do not use any groupings provided by us, but rather look for latent patterns among the words classed by Docuscope.
Let’s “believe” what we’re seeing here and assume that these linguistic patterns correspond to something like the genre distinctions that Shakespeare’s editors, Heminges and Condell, saw when they classed the plays into three genres on the contents page of the First Folio. (Hope and I have argued for this assumption elsewhere.) Now there are two ways to proceed at this point. We could provide a spreadsheet detailing the loadings of the variables on each of the components, which would be interesting to you if you already knew Docuscope well and had a decent sense of how PCA works. But a more immediately useful thing to do would be to show passages that contain lots of the tagged words that are pulling the plays in these different directions — the one’s that “magnetize” the plays into groups, so to speak — so that we can have a sense of what is an “exemplary” History passage according to PCA, or an exemplary Comedy passage. (Note that the comedies are also being situated in the opposite quadrant — the upper left — from the histories: this is a pattern we see again and again at different levels of analysis.) So what does a historical passage look like from Richard II? What, more interestingly, does a “historical” passage look like from Love’s Labour’s Lost, which is “out of its quadrant” here? What does a typically comic passage from Twelfth Night look like? And what kind of passages in Othello are pushing it up into the comic quadrant? Do these “correct” and “incorrect” classings make sense to us from a literary critical standpoint? It has been claimed before that Othello contain many elements of Shakespeare’s comedy writing, so this would be a good place to start thinking about the value of statistically assisted linguistic analysis of genre, which I will do in the next post. (Jonathan may join in with comments, which will make this more of a back and forth.)