Month: November 2009

Shakespearean Dendrograms

Dendrogram produced in JMP on PC1 and PC2 using covariance matrix and Ward’s

There is another way to visualize the degrees of similarity or dissimilarity among the items we’ve been analyzing in Shakespeare’s Folio plays. A dendrogram, which looks like a family tree, is a two dimensional representation of similarities derived from a statistical technique known as hierarchical clustering. I will say something about this technique in a moment, but first, a review of what we have here.

At the top of this post is the scatterplot we have been working with throughout our analysis of the full texts of the Folio plays. This graph plots the principal components (1 and 2) derived from an analysis of the plays in R using the command “prcomp” (where we have centered but not scaled the data). This analysis took place at the Cluster level — Docuscope can group its most basic token types (the LATs) into seventeen meta-types called Clusters — which we choose because we want fewer variables than observations. Thus, because have 36 plays in the Folio, we choose to group the items we have counted into seventeen buckets or containers, the Clusters. In previous posts, we tried to explain how the tokens collected in these clusters explain the filiations among the plays that have been discovered by unsupervised methods of statistical analysis — methods requiring no prior knowledge of which genres the plays are “supposed” to fit into.

There are all sorts of subtle trends that can be gleaned with the aid of statistics, and we will be exploring some of them in the future. I have begun with trends that are visible without much fancy footwork. Looking at unscaled findings of the first two principal components is a fairly vanilla procedure, statistically speaking, which means that its faults and virtues are well-known. It is a first glance at the nature of linked variation in the corpus. And when we look at this plot from above, we see the characteristic opposition of history and comedy, which employ different and — on the whole — opposed linguistic and rhetorical strategies in telling their stories on stage. But is there a way to quantify the degree to which Shakespeare’s are like or unlike one another when they are rated on these principal components?

The second and third illustrations provide this information. A dendrogram is a visual representation of a statistical process of agglomeration, the process of finding items that are closely related and then pairing them with other items that are also closely related. There are a number of different techniques for performing the agglomeration — they are variations on the beginning of a square dance, where the people who want to dance with each other most pair up first, and the foot draggers are added to the mix as the dance continues — but the one I have used here is Ward’s minimum variance method. In the work I have done so far with the plays, I have found that Ward’s produces groupings most consonant with the genres of Shakespeare’s plays as we understand them critically. In this dendrogram, the different genres of Shakespeare’s Folio plays are color coded: comedies are red, histories are green, tragedies are brown and late plays are blue. The third item, a table, shows the sequence in which items were paired, listing pairs in order from the most similar to the least.

We learn a couple of things from this analysis. First, using the first two principal components was a good but not terrific way of grouping the plays into genres. If we were to be bowled over by this analysis, we would expect to see less intermixing of different colored items within the clusters and subclusters of the dendrogram. Using more than two components will provide better accuracy, as we will see below. Second, there is a reasonably intuitive path to be travelled from the patterns in the dendrogram to some explanation at the level of the language in the plays. I can, for example, ask why Love’s Labour’s Lost and King John look so similar in this analysis, and find an answer in the components themselves. This is what we did in the previous post, where I looked at a passage that expressed the first and second components in a typical, history play sort of way. Because the scatterplot is a graph and not a map, we need to interpret proximity of items correctly: items that collocate with one another possess and lack the two components being graphed in the same way. The same is true of The Tempest and Romeo and Juliet, but in this case, these plays possess a high degree of both principal component 1 and principal component 2: they combine lots of description with lots of first person and interaction.

Now look at the tradeoff we make when we take advantage of all the components that can be extracted using PCA. Using all seventeen components, we get the following dendrogram, again using Ward’s minimum variance method. (In JMP 8, I am performing Ward’s procedure without standardization on the PCs derived using the covariance matrix, the latter components being identical — as far as I can tell — to those I would have derived in R using prcomp with centering = T and scale = F). The first dendrogram is color coded by genre, like the one above. The second color codes the plays from lighter to darker, the lighter ones being those composed earlier in Shakespeare’s career (according to the Oxford editors) while the darker ones are from later.

Dendrogram produced in JMP using all principal components, clustered with Ward

Dendrogram produced in JMP using all principal components, color coded according to time of composition (Oxford order)

Use of all the components provides much better separation into intelligible genres: notice how well the histories are clustering at the top of the dendrogram, while we find nice clusters of both tragedies and comedies below. And if we look at the two largest clusters — those linked by the right-most line in the dendrogram — we see that they are broadly separated into earlier and later plays (with the earlier on top, later below).

Nice result, and we’ll see some even nicer ones produced with different techniques in subsequent posts. But how do you work back from these images to a concrete discussion of what puts the plays together? Because we are dealing with seventeen components simultaneously, it is nearly impossible to make this properly interpretive leap. This is the fundamental paradox you encounter when you apply statistics to categorical data derived from texts: the more intensively you make use of mathematics in order render intelligible groups, the less you know about what qualifies items for membership in those groups. If we were running a search engine company like Google, we wouldn’t worry about explaining why items are grouped together, since our business would only concern producing groupings that make sense to an “end user.” But we do not work for Google — at least, not yet. There will be times when it makes more sense to visualize and explore manageable, critically intelligible levels of complexity instead of seeking the “perfect model” of literary experience.

November 29, 2009
Local Versus Diffused Variation; the Hinman Collator

Above are two images of the Hinman Collator currently residing in the Memorial Library at the University of Wisconsin, another optical collating device that uses visual comparisons to highlight minute differences between seemingly identical versions of the same text. Hinman used the device in his landmark survey of Shakespeare’s First Folio; it allows the user — here, paper conservator Theresa Smith of Harvard — to “see” differences between two items by merging them in a single image. Areas of difference appear as a kind of grey area — a more subtle effect, perhaps, than the hovering text that is produced by the Lindstrand Comparator discussed in the previous post. The device also allows you to toggle between the two editions you are looking at, making subtle differences standout immediately. Prior to creating this device, Hinman had worked in military intelligence comparing pre- and post-bombing aerial photographs: the collator is thus one of several adaptations of military skills and technologies for literary analysis.

Smith and her collaborator, Daniel Selcer (Duquesne), gave a fascinating paper at Wisconsin two weeks ago which dealt with differences among Facsimile editions of Copernicus’ De revolutionibus. Here Smith is examining two facsimiles of De revolutionibus with divergent diagrams of the center of Copernicus’ universe (the sun), which in one edition appears as a circular outline and in another as a solid sphere. The Hinman revealed the differences in the center of the diagram immediately, but it also revealed other areas of discoloration and streaking which suggested that the underlying manuscript (closely guarded in Poland) might not be adequately represented by the facsimile edition. One of the interesting points of Smith and Selcer’s paper was that you must treat facsimiles as artifacts in their own right; they are not always indexical transcriptions of an original, but in certain crucial ways iconic: the technologies used in producing the facsimile proper introduce artifactual effects that make the facsimile a likeness rather than an absolute trace-copy of the original.

This device and the one I wrote about in the previous post are technologies for the identification of local variants, which is to say, variants that occur in one place on the page. The search for such variants has been crucial in the history of textual scholarship. Hinman, for example, was able to deduce from variants among surviving First Folios the order in which the forms were printed and their various states of correction. This, in turn, led him to reconstruct an “ideal” Folio which he hypothesized contained the latest or most corrected state of the book. (No single Folio contained all of the corrections, since as Hinman argued, “every copy of the finished book shows a mixture of early and late states of the text that is peculiar to it alone.”) While presenting some findings from the work I have been doing with Docuscope at Loyola earlier this month, Peter Shillingsburg made a very important point in connection with this idea: when you are interested in broad patterns, it appears that it does not matter what edition you are working with. This may not seem like a big deal to some readers, but for Shakespeareans, it matters quite a lot which edition of the plays you are using to argue your case. There are substantive differences between different printed editions of the plays, and in some cases, individual words or phrases — for example, Hamlet’s “dram of eale” — can be emended to produce significantly different readings of a particular passage.

So, would the findings I come up with using Docuscope change significantly if I switched from the Moby Shakespeare (a nineteenth-century edition produced at Cambridge) to the Oxford or Norton? Yes and no. No in the sense that I am interested in a form of variation that is not accounted for in the kind of textual scholarship practiced by Hinman. In looking for genre at the level of the sentence, I am looking for diffused rather than local variation: a kind of patterned deviation from the mean that occurs across the entire body of the text rather than at one crucial intersection. So if Docuscope were able suddenly to read the word “evil” once an editor had emended “eale” in Hamlet’s speech, there would be a slight uptick in one of its counting categories (“Negative Values”). But that uptick would probably not fundamentally alter the patterns being discriminated across the entire corpus of Shakespeare’s works. There are some statistical procedures which could register slight upticks in categories that are not used frequently, however, correlating them with others that are exercised all the time. What if there is a correlation between even slight changes in Shakespeare’s use of “Negative Values” tokens and the much more common “Description” tokens we explored in the histories? What if, in other words, a “dash of x” matters sometimes?

I think it is important to recognize this category of the “dash” or “pinch” in looking for broader patterns of variation in large populations of texts, because it sits somewhere between the “crux” local variants like “eale” and the global variation we see in uses of “the” or concrete descriptive nouns. Because Docuscope is looking at things sub species aeternitatis, as is were, we cannot say that it matters when such dashwords are used. Time sensitivity in use, immediate context: these are all crucial features that help us understand local variants. (And we are quite attracted to local variants, as the history of literary criticism and close reading shows.) Dashwords are different: rare, like an eclipse, but nevertheless part of a globally diffused pattern.

November 25, 2009
Pre-Digital Iteration: The Lindstrand Comparator

I’ve just finished a terrific conference at Loyola organized by Suzanne Gosset on “The Future of Shakespeare’s Text(s).” This photo shows a device, used by one of the conference organizers Peter Shillingsburg, to perform manual collation of printed editions of texts. There is a long tradition of using optical collators to find and identify differences in printed editions of texts; this one, the Lindstrand Comparator, works on a deviously clever principle. Exploiting a feature of the human visual system, the Lindstrand Comparator allows you to view two different pages simultaneously, with each image being fed to only one eye at a time through a series of mirrors. When the brain tries to reconcile the two disparate images — a divergence caused by print differences on the page — these textual differences will levitate on the surface of the page or, conversely, sink into it. What is in actuality a spatial disparity becomes a disparity of depth via contributions of the brain (which is clearly an integral part of the apparatus).

In this photo, Shakespeare scholar Gabriel Egan compares two variant printed editions of a modern novel. The device is an excellent example of mechanical-optical technology being used to assist in the iterative tasks of scholarship — iterations we now perform with computers. It is also the only technology I know of that lets you see depth in a page, something you cannot do with hypertexts or e-readers. Maybe we should stop writing code for fancy web-pages and start working with wood and mirrors?

November 8, 2009