There is another way to visualize the degrees of similarity or dissimilarity among the items we’ve been analyzing in Shakespeare’s Folio plays. A dendrogram, which looks like a family tree, is a two dimensional representation of similarities derived from a statistical technique known as hierarchical clustering. I will say something about this technique in a moment, but first, a review of what we have here.
At the top of this post is the scatterplot we have been working with throughout our analysis of the full texts of the Folio plays. This graph plots the principal components (1 and 2) derived from an analysis of the plays in R using the command “prcomp” (where we have centered but not scaled the data). This analysis took place at the Cluster level — Docuscope can group its most basic token types (the LATs) into seventeen meta-types called Clusters — which we choose because we want fewer variables than observations. Thus, because have 36 plays in the Folio, we choose to group the items we have counted into seventeen buckets or containers, the Clusters. In previous posts, we tried to explain how the tokens collected in these clusters explain the filiations among the plays that have been discovered by unsupervised methods of statistical analysis — methods requiring no prior knowledge of which genres the plays are “supposed” to fit into.
There are all sorts of subtle trends that can be gleaned with the aid of statistics, and we will be exploring some of them in the future. I have begun with trends that are visible without much fancy footwork. Looking at unscaled findings of the first two principal components is a fairly vanilla procedure, statistically speaking, which means that its faults and virtues are well-known. It is a first glance at the nature of linked variation in the corpus. And when we look at this plot from above, we see the characteristic opposition of history and comedy, which employ different and — on the whole — opposed linguistic and rhetorical strategies in telling their stories on stage. But is there a way to quantify the degree to which Shakespeare’s are like or unlike one another when they are rated on these principal components?
The second and third illustrations provide this information. A dendrogram is a visual representation of a statistical process of agglomeration, the process of finding items that are closely related and then pairing them with other items that are also closely related. There are a number of different techniques for performing the agglomeration — they are variations on the beginning of a square dance, where the people who want to dance with each other most pair up first, and the foot draggers are added to the mix as the dance continues — but the one I have used here is Ward’s minimum variance method. In the work I have done so far with the plays, I have found that Ward’s produces groupings most consonant with the genres of Shakespeare’s plays as we understand them critically. In this dendrogram, the different genres of Shakespeare’s Folio plays are color coded: comedies are red, histories are green, tragedies are brown and late plays are blue. The third item, a table, shows the sequence in which items were paired, listing pairs in order from the most similar to the least.
We learn a couple of things from this analysis. First, using the first two principal components was a good but not terrific way of grouping the plays into genres. If we were to be bowled over by this analysis, we would expect to see less intermixing of different colored items within the clusters and subclusters of the dendrogram. Using more than two components will provide better accuracy, as we will see below. Second, there is a reasonably intuitive path to be travelled from the patterns in the dendrogram to some explanation at the level of the language in the plays. I can, for example, ask why Love’s Labour’s Lost and King John look so similar in this analysis, and find an answer in the components themselves. This is what we did in the previous post, where I looked at a passage that expressed the first and second components in a typical, history play sort of way. Because the scatterplot is a graph and not a map, we need to interpret proximity of items correctly: items that collocate with one another possess and lack the two components being graphed in the same way. The same is true of The Tempest and Romeo and Juliet, but in this case, these plays possess a high degree of both principal component 1 and principal component 2: they combine lots of description with lots of first person and interaction.
Now look at the tradeoff we make when we take advantage of all the components that can be extracted using PCA. Using all seventeen components, we get the following dendrogram, again using Ward’s minimum variance method. (In JMP 8, I am performing Ward’s procedure without standardization on the PCs derived using the covariance matrix, the latter components being identical — as far as I can tell — to those I would have derived in R using prcomp with centering = T and scale = F). The first dendrogram is color coded by genre, like the one above. The second color codes the plays from lighter to darker, the lighter ones being those composed earlier in Shakespeare’s career (according to the Oxford editors) while the darker ones are from later.
Use of all the components provides much better separation into intelligible genres: notice how well the histories are clustering at the top of the dendrogram, while we find nice clusters of both tragedies and comedies below. And if we look at the two largest clusters — those linked by the right-most line in the dendrogram — we see that they are broadly separated into earlier and later plays (with the earlier on top, later below).
Nice result, and we’ll see some even nicer ones produced with different techniques in subsequent posts. But how do you work back from these images to a concrete discussion of what puts the plays together? Because we are dealing with seventeen components simultaneously, it is nearly impossible to make this properly interpretive leap. This is the fundamental paradox you encounter when you apply statistics to categorical data derived from texts: the more intensively you make use of mathematics in order render intelligible groups, the less you know about what qualifies items for membership in those groups. If we were running a search engine company like Google, we wouldn’t worry about explaining why items are grouped together, since our business would only concern producing groupings that make sense to an “end user.” But we do not work for Google — at least, not yet. There will be times when it makes more sense to visualize and explore manageable, critically intelligible levels of complexity instead of seeking the “perfect model” of literary experience.