I’ve expanded the labels here on our PCA scatterplot in order to see a few more items. Several things worth thinking about here:
• Late Plays are clustering in neither the Comedy nor the History quadrants explored in the other posts. The three that we see here — Winter’s Tale, Cymbeline, and Henry VIII — thus lack the dialogic interactivity we saw in comedy and the profusion of concrete nouns and description in history. This is an interesting way of thinking of the Late Plays: as lacking something that is a defining presence in the two most linguistically “obvious” genres of Shakespeare’s writing (comedy and history). We might think of genres that show up as diagonally opposed in PCA as “linguistic primes” in that they seem to be composed of nothing simpler than themselves. Those that are caught in the remaining corners (themselves lacking any opposite partner) would then be called “secondary,” since they cohere indirectly on a set of differences that are more comprehensively ordering a different part of the field. Note too that Romeo and Juliet is virtually identical with The Tempest, our last Late Play, in this plot. Both plays break the most obvious “rule” that Shakespeare seems to honor in his writing of plays — that of choosing between either First Person + Interaction strings or Description strings, but not both– and they break this “rule” in exactly the same way. Instead of choosing one of these two linguistic “forks in the road,” Romeo and Juliet and The Tempest take both at the same time, combining lots of the dialogical element we saw in Twelfth Night with the profusion of concrete descriptions (nouns, adjectives) that characterized Richard II.
• In almost every visualization I have used of these data — Factor Analysis with various rotations, PCA — I find that A Midsummer Night’s Dream is unusual in terms of comedies. Sometimes it is grouped with the histories because it contains so much description in the passages dealing with the fairy landscape. Linguistically, this feature sets A Midsummer Night’s Dream apart from other comedies. For an illustration of what is unusual about MSND, which scores unusually high on the history component (Description) but also scores reasonably high on the comedy one (First Person/Interaction), click here. I also find that Henry VIII is often placed away from the pack, which in this case due to its relative lack of all three types of string types tracked in this exercise — Description somewhat, but very obviously First Person and Interaction. (For a sample passage where few of these are present, click here.) There are many reasons why this play might be distinctive — it is co-written with Fletcher, it is written at the very end of his career — but the only way to really know is to look at individual passages like the one I’ve posted and see what’s going on. Seeing what an absence of something is making possible, of course, is often more difficult than seeing what the presence of something makes possible.
• Two very unusual Comedies are showing up in the lower left-hand quadrant, where three of the four Late Plays are located. This makes a certain kind of sense, as Measure for Measure and All’s Well That Ends Well are regularly described by critics as “problem comedies.” From a critical standpoint, this means that they lack the bouyant tone of plays like Much Ado or As You Like It or that they veer into emotions or problems that cannot really be solved by a few marriages at the end of the play (e.g., Angelo’s redemption or Bertram’s romantic rehabilitation). Of course, from a statistical-linguistic standpoint, the description of what makes these plays “unusual” would be different: they lack the First Person and Interaction strings of the high comedies while simultaneously lacking the Description strings that characterize histories. This description could be more nuanced — there are more subtle ways of characterizing these patterns if we break the plays down into smaller parts (and so can use more refined categories) — but we will do this later.
• Tragedies are evenly spread out over the plot. This is in and of itself a significant finding; it does not mean that tragedies don’t have distinguishing traits, but that those traits aren’t tracked by the most obvious forms of coordinated variation that we can track in this corpus using Docuscope. I suspect that Matt Jockers’ most-frequent-word analysis would produce a similar result, as he and I have been finding very similar patterns in primary and secondary genre divisions using our different means. In fact, a combination of two other components (PC3 and PC5) does corner the tragedies in their own quadrant, and this will be the subject of a future post.
So what are the rest of these dots? Below is an R biplot which shows the items plotted in the PCA scatterplot above, but instead of distinguishing them by color, it lists them by item number. (The numbers correspond to play titles, which I have also posted on the left hand side of the image; please click on the image below to open in another screen, then click again to resize to your window.) The biplot is helpful because, in addition to plotting the plays in PCA space, it shows the component loadings, which means that it illustrates the relationship between the variables counted as they vary across this corpus. The magnitude of trackable variation in individual variables (First Person, Interaction, etc.) is represented by a line in space — a vector — and its variation with respect to other vectors (other variables) is registered geometrically by the variable names (X. [Variable Name]) when they are suitably arranged around the origin. I have numbered the plays in order of composition, using the dating scheme provided by the Oxford editors. It makes for an interesting connect the dots, which represents Shakespeare’s stylistic progress throughout his career. (Note: he leaps.)
Variables that extend opposite one another at an angle of 180 degrees are inversely correlated, while those that line up on top of one another vary with one another. Vectors that sit at right angles to one another have an interesting feature: because they are orthogonal, their variance is unrelated. So from the biplot below, we can see quite quickly that First Person and Interactivity strings tend to be found together in individual items (plays), whereas Description strings (which vary inversely with the amount of Topical Flow strings) tend to be present or absent in ways that have nothing to do with the presence or absence of the First Person and Interactivity. Another way of expressing this orthogonal relationship: behaviors among First Person and Interaction strings are (for whatever reason) indifferent to those of Description and Topical Flow strings, and vice versa. This doesn’t mean they aren’t connected on some other component (we are only looking at the first two here), but when we are thinking about the most statistically powerful description of variance in the corpus (which is captured in early principal components), this is how all of the quantities of counted things relate.
A parting thought: what two plays are the most opposite in terms of style, based on what Docuscope sees and PCA can find in terms of variation patterns? Two obvious candidates would be Henry V and A Comedy of Errors, number 19 at the bottom number 8 at the top; and A Midsummer Night’s Dream and Measure for Measure, numbers 12 and 25 on the left and right. If you’ve been following the discussion and this diagram makes sense to you — or if you’ve just read both pairs of plays — you know why they are so different.