A Genre Map of Shakespeare’s Plays from the First Folio (1623)

PCquadrantsCHThe following image, produced using the  statistical package R, represents the position of the plays from Shakespeare’s First Folio according patterns discerned using inferential statistics. The plays themselves were “tagged” by Docuscope and then the frequency scores of each play were analyzed with standard unsupervised statistical procedures (prcomp, center=True, scale=False), producing two principal components that define underlying patterns of correlation and opposition in Shakespeare’s use of certain kinds of words in different genres.  In this post, the first of several, I want to discuss what can be learned from such a map and how it either confirms or diverges from current critical understandings of Shakespeare’s genres as understood by Shakespeare’s contemporaries or later critics (who became interested in the so called “Late Plays”).

The first thing to notice is that these two components are doing a reasonable job of separating out at least two types of plays — Comedies and Histories — by placing them in opposite corners of the scatterplot.  The two scatterplots you see here are really the same scatter viewed from two positions, so we will concentrate on the one in the upper right, which rates the plays on Principal Component 1 (PC1) on the horizontal axis and Principal Component 2 (PC2) on the vertical axis.  In Principal Component Analysis (PCA), the earlier components tend to “suck up” more variation than later ones, which means they define progressively less powerful avenues for simplifying relationships among the variable — here, the types of words Shakespeare uses (or doesn’t use) in different types of plays.  So PC1 does an excellent job of pushing the pink circles, the Histories, to the right of the scatter, which means they all use a proportionally higher degree of the types of words that are either favored or shunted by this factor.  (More anon.)  Note that the pink dots are plays classed as Histories in the First Folio, except for one green dot which is Henry VIII (we’ve pulled out four “late plays” in green: Henry VIII, Cymbeline, Winter’s Tale and Tempest).  PC2 finished the job, pushing these histories down on the vertical axis and so partitioning them (roughly) in the lower right-hand corner of the scatter.  Here we would say that the Histories or pink dots have comparatively fewer of the types of words favored by PC2 (and more of the words that it discriminates against).  Think of components, then, as representing correlations of things that Shakespeare does and doesn’t do over the course of all his writing published in F:  later we will see how his choice of one type of word — for example, first person singular pronouns — almost always “goes with” a lack of other types of words.  Here the point is to show that a basic pattern emerges with so called unsupervised statistical techniques, the most powerful because they do not use any groupings provided by us, but rather look for latent patterns among the words classed by Docuscope.

Let’s “believe” what we’re seeing here and assume that these linguistic patterns correspond to something like the genre distinctions that Shakespeare’s editors, Heminges and Condell, saw when they classed the plays into three genres on the contents page of the First Folio.  (Hope and I have argued for this assumption elsewhere.)  Now there are two ways to proceed at this point.  We could provide a spreadsheet detailing the loadings of the variables on each of the components, which would be interesting to you if you already knew Docuscope well and had a decent sense of how PCA works.  But a more immediately useful thing to do would be to show passages that contain lots of the tagged words that are pulling the plays in these different directions — the one’s that “magnetize” the plays into groups, so to speak — so that we can have a sense of what is an “exemplary” History passage according to PCA, or an exemplary Comedy passage.  (Note that the comedies are also being situated in the opposite quadrant — the upper left — from the histories: this is a pattern we see again and again at different levels of analysis.)  So what does a historical passage look like from Richard II?  What, more interestingly, does a “historical” passage look like from Love’s Labour’s Lost, which is “out of its quadrant” here? What does a typically comic passage from Twelfth Night look like?  And what kind of passages in Othello are pushing it up into the comic quadrant?  Do these “correct” and “incorrect” classings make sense to us from a literary critical standpoint?  It has been claimed before that Othello contain many elements of Shakespeare’s comedy writing, so this would be a good place to start thinking about the value of statistically assisted linguistic analysis of genre, which I will do in the next post.  (Jonathan may join in with comments, which will make this more of a back and forth.)

This entry was posted in Shakespeare. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. Posted July 9, 2009 at 4:54 pm | Permalink

    Good stuff here, Mike, and the PCA results appear to correlate well with the hierarchical clustering I did back in February (see Machine-Classifying Novels and Plays by Genre). The big question for me–as a non Shakespearian–is how to explain Othello’s placement with the comedies and Love’s Labour with the tragedies. I’m sure there is an argument to be made here, but is it a reasonable one? Is there a way to argue that Othello is (at least stylistically) more like a comedy than a tragedy? Maybe this argument has already been made and this data is just confirming old news. Still, I’d like to here the argument. Matt

  2. admin
    Posted July 10, 2009 at 8:39 am | Permalink

    Thanks, Matt. I’m actually going to explore all of the outliers on this first map, looking at why plays are where we think they ought to be and why some are not. By coincidence, I happen to be writing this post in the Folger Shakespeare library, where the scholar Susan Snyder was a Scholar in Residence after her retirement from the University of Delaware. Snyder published a fascinating essay in 1979 called “Othello and the Conventions of Romantic Comedy” which claims that “the tragedy is generated and heightened through the relation to comedy rather than in spite of it” (see excerpt). I think she was perhaps the first Shakespearean to correctly identify how and why this play moves in relation to comedy, followed by Stephen Orgel who made other arguments along these lines in Othello and the End of Comedy. I will be exploring these claims in depth once I move to the upper left hand portion of the scatter plot. At this point, however, I am comfortable saying that statistical analysis is providing interesting exfoliation — on the linguistic level — of an aspect of Othello that was first explicated 30 years ago in terms of plot and scenario. What is interesting is that there are other plays that show this linguistic “counter-typing” as well, such as Love’s Labour’s Lost, but which have not yet been discussed by scholars in these ways (as far as I know).

  3. Martin Mueller
    Posted September 29, 2009 at 9:49 pm | Permalink

    Douglas Stewart, a classicist colleague of mine at Brandeis in the mid-sixties , published “Othello as Roman Comedy Turned Nightmare” in the Emory Quarteryly (1967).

    More recently Steve Ramsay developed a piece of software that translates the scenic progression of a play into graphs and seeks to infer genre from those graphs. This also ‘misclassifies’ Othello as a comedy.

    The interesting is that different algorithmic and critical methods point to very similar conclusions.,

  4. admin
    Posted October 5, 2009 at 10:48 am | Permalink

    I would be very interested in seeing Ramsay’s results. I agree that convergence is the thing to be explained here, since it suggests that genre takes place in a sensory-social manifold that becomes available to us in a number of ways (phenomenologically, quantitatively, etc.).

  5. David
    Posted January 27, 2010 at 7:34 pm | Permalink

    Hi everyone: I’m definitely an interloper here, so please forgive me if I ask a boorish question! I’ve enjoyed reading shakespeare and also use statistics in my job, and so was curious to find this site show up when I googled for resources on cluster analysis. But I cannot for the life of me see why anyone should care what a statistical algorithm has to ‘say’ about shakespeare. It seems to me the importance of the plays is what they have to say to us. No? Regards. David

  6. Michael Witmore
    Posted January 28, 2010 at 10:14 pm | Permalink

    Thanks for your question, David: it is the same one that got me interested in doing this kind of analysis around five years ago. Because I teach Shakespeare to hundreds of students at the University of Wisconsin every year, I get the chance to read the plays over and over again. This is a great pleasure, since each time I read or teach a play I appreciate more of its verbal complexity, historical significance, or emotional power.

    So I was shocked (scandalized would be a better word) to find out that the something very subtle about the plays – the way they become comic or tragic, for example – was “legible” statistically, through the tagging of words or strings of words into categories. This just did not seem possible to me. I find this convergence of human literary judgments with the results of statistical clustering interesting for two reasons, then:

    1) It tells us something about the subtlety of literary genres and conventions which human readers are picking up all the time (by recognizing a certain type of plot, character, or incident), but which is recognizable for completely different reasons at the level of the sentence. Why would these two things converge? Because PCA and cluster analysis imperfectly model the lightning fast series of comparisons that human beings make all the time when they read complicated literary/dramatic texts like those written by Shakespeare. They give us another way of appreciating the complexity of both literature and interpretation.

    2) Because it may tell us something new, potentially, about how language works. Are there certain things that you can and must do to tell a certain type of story, or to get your story recognized as belonging to a certain type? Are these things (patterns of word use) contagious in a particular cultural or social environment, like mannerisms or conventions? Do they travel, and if so, along what pathways? If dialects, accents, or phonological habits move culturally from speaker to speaker, is there a comparable mechanisms of transmission for types of stories in written (or transcribed) texts, and can statistical work with digital collections of texts help us identify such mechanisms? As someone who studies this body of literature, these are questions I find really intriguing. Which is why I started the blog.

2 Trackbacks

  • By Love’s Labour’s Lost: The History on July 20, 2009 at 10:28 am

    […] the play into the area where the histories cluster, something visible in the scatterplot discussed below.  The passage, that is, has a lot of Description strings combined with a relative lack of […]

  • […] such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where […]

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>