Tag: dendrograms

  • Shakespeare Out of Place?

    When Jonathan Hope and I did our initial Docuscope study of over 300 Renaissance plays, we found Shakespeare’s plays clustering together for the most part. One explanation for this clustering was that it was caused by something distinctive in Shakespeare’s writing, and that this authorial signature becomes visible in the same way genre does—at the level of the sentence. Indeed, in our first approach to this larger dataset (one we’d assembled from the Globe Shakespeare and Martin Mueller’s semi-algorithmically modernized TCP plays), we thought that authorship was overriding genre as source of patterned variance.

    But everything which goes into the dataset also comes out. And in this case, it was editorial difference that was helping to isolate Shakespeare’s plays. When we did a further study of the clusters containing works by Shakespeare, we noticed that their elevated levels of two different LATs that dealt with punctuation – TimeDate and LanguageReference – was an artifact of hand modernization.

    Several contracted items from the Globe/Moby Shakespeare edition, tagged as Language Reference Strings by Docuscope

    The variability in early modern orthography is well known, and we also know that there were many ways of punctuating early modern texts. (In the case of Shakespeare’s plays, we assume that most of the punctuation originated with the compositors who set up the text in the printhouse rather than Shakespeare himself.) But when the Globe editors modernized their sources in the nineteenth century, they consistently applied certain rules of punctuation that skew Docuscope’s counts when these texts (as a group) were compared with the more varied punctuation to be found in the TCP texts. Sequences that were dealt with consistently in the Globe texts – for example, contractions such as [’tis] or [’twas] or [o’clock] – were being handled much more variously in the original-spelling texts that Martin Mueller was modernizing. (He was only modernizing words in his procedure.)

    So, the punctuation was a tip off, increasing the chances that Shakespeare’s plays would cluster together.

    We now have the ability to skip or blacklist certain word strings, thanks to a newly updated version of Docuscope created by Suguru Ishizaki. At some point, we will open this can of worms–actually modifying Docuscope’s original tagging protocols–but not yet. There is still more to be learned from the results from an unmodified Docuscope: when we don’t touch the contents of its internal dictionaries, we have the ability to compare results across periods or corpora.

    In this case, we learn that Docuscope is sensitive to human editorial intervention in texts. So sensitive, in fact, that it produced an almost complete clustering of Shakespeare’s plays in the larger group of 320 that we profiled in the online draft of our “Hundredth Psalm” article.

    The large cluster of Shakespeare plays that resulted from our initial comparison of Globe texts with Mueller's semi-algorithmically modernized TCP plays

    Once we realized that this grouping was at least partly artifactual–a product of different editorial procedures applied to our combined corpus–we eliminated the LATs that were registering this difference (TimeDate and LangReference). Of course, by eliminating these, we lost their sorting power on the rest of the corpus, so there was a tradeoff. But we felt that it was not fair to give Docuscope this kind of advantage in sorting text when it was the result of modern editorial intervention. In the future, we might blacklist a word like [’tis] so that we can retain the rest of the category, but I don’t think this is necessary. What really needs to happen is that, in our editorial preparation of texts and corpora,  we must ensure that no set of texts is isolated from the others through special editorial preparation. The fact that “anything goes” in the current TCP collection – it is full of various compositorial and printhouse styles and conventions – is probably a good thing. And in any event, we still see authors’ works and genres clustering together even where printers are multiple. Here, now, is one of the new Shakespeare clusters once the editorial “tell” of certain types of punctuation was removed:

    New clustering of Shakespeare's plays with TimeDate and LangRef eliminated from analysis

    Now we see that plays by Munday, Heywood, Marlowe, Shirley, Rowley, Webster, Middleton, and Massinger are showing greater similarities with Shakespeare: the variability of their punctuation is not being used against them. Within the Shakespeare plays that do cluster together, we see some of the same similarities–Coriolanus with Cymbeline, for example. But the terms on which Shakespeare’s plays are related to each other are now more limited–we have eliminated two categories of LATs that may have been sorting Shakespeare’s plays with respect to each other. This relative loss of sorting power within Shakespeare’s works seems tolerable to us, however, because it allows for a more meaningful portrait of Shakespeare’s relationship to other dramatists of the period. What excited us about this large diagram was that it says something about 150 years of early modern drama as a whole, inasmuch as that whole could be represented by over 300 works.

    Here is the entire diagram, then, constructed without the LATs that capture the nineteenth-century modernization of the Shakespearean texts. (Many thanks to Kate Fedewa for helping us create this large image.)

    Revised dendrogram comparing early modern plays from the TCP collection and the Globe Shakespeare (click on image in new screen to zoom)
  • Shakespearean Dendrograms

    PCA Scatterplot in R of the First Folio Plays

    PCCovariance1and2WardDendrogram
    Dendrogram produced in JMP on PC1 and PC2 using covariance matrix and Ward’s

    PCCovariance1and2WardPairings

    There is another way to visualize the degrees of similarity or dissimilarity among the items we’ve been analyzing in Shakespeare’s Folio plays. A dendrogram, which looks like a family tree, is a two dimensional representation of similarities derived from a statistical technique known as hierarchical clustering. I will say something about this technique in a moment, but first, a review of what we have here.

    At the top of this post is the scatterplot we have been working with throughout our analysis of the full texts of the Folio plays. This graph plots the principal components (1 and 2) derived from an analysis of the plays in R using the command “prcomp” (where we have centered but not scaled the data). This analysis took place at the Cluster level — Docuscope can group its most basic token types (the LATs) into seventeen meta-types called Clusters — which we choose because we want fewer variables than observations. Thus, because have 36 plays in the Folio, we choose to group the items we have counted into seventeen buckets or containers, the Clusters. In previous posts, we tried to explain how the tokens collected in these clusters explain the filiations among the plays that have been discovered by unsupervised methods of statistical analysis — methods requiring no prior knowledge of which genres the plays are “supposed” to fit into.

    There are all sorts of subtle trends that can be gleaned with the aid of statistics, and we will be exploring some of them in the future. I have begun with trends that are visible without much fancy footwork. Looking at unscaled findings of the first two principal components is a fairly vanilla procedure, statistically speaking, which means that its faults and virtues are well-known. It is a first glance at the nature of linked variation in the corpus. And when we look at this plot from above, we see the characteristic opposition of history and comedy, which employ different and — on the whole — opposed linguistic and rhetorical strategies in telling their stories on stage. But is there a way to quantify the degree to which Shakespeare’s are like or unlike one another when they are rated on these principal components?

    The second and third illustrations provide this information. A dendrogram is a visual representation of a statistical process of agglomeration, the process of finding items that are closely related and then pairing them with other items that are also closely related. There are a number of different techniques for performing the agglomeration — they are variations on the beginning of a square dance, where the people who want to dance with each other most pair up first, and the foot draggers are added to the mix as the dance continues  — but the one I have used here is Ward’s minimum variance method. In the work I have done so far with the plays, I have found that Ward’s produces groupings most consonant with the genres of Shakespeare’s plays as we understand them critically. In this dendrogram, the different genres of Shakespeare’s Folio plays are color coded: comedies are red, histories are green, tragedies are brown and late plays are blue. The third item, a table, shows the sequence in which items were paired, listing pairs in order from the most similar to the least.

    We learn a couple of things from this analysis. First, using the first two principal components was a good but not terrific way of grouping the plays into genres. If we were to be bowled over by this analysis, we would expect to see less intermixing of different colored items within the clusters and subclusters of the dendrogram. Using more than two components will provide better accuracy, as we will see below. Second, there is a reasonably intuitive path to be travelled from the patterns in the dendrogram to some explanation at the level of the language in the plays. I can, for example, ask why Love’s Labour’s Lost and King John look so similar in this analysis, and find an answer in the components themselves. This is what we did in the previous post, where I looked at a passage that expressed the first and second components in a typical, history play sort of way. Because the scatterplot is a graph and not a map, we need to interpret proximity of items correctly: items that collocate with one another possess and lack the two components being graphed in the same way. The same is true of The Tempest and Romeo and Juliet, but in this case, these plays possess a high degree of both principal component 1 and principal component 2: they combine lots of description with lots of first person and interaction.

    Now look at the tradeoff we make when we take advantage of all the components that can be extracted using PCA. Using all seventeen components, we get the following dendrogram, again using Ward’s minimum variance method. (In JMP 8, I am performing Ward’s procedure without standardization on the PCs derived using the covariance matrix, the latter components being identical — as far as I can tell — to those I would have derived  in R using prcomp with centering = T and scale = F). The first dendrogram is color coded by genre, like the one above. The second color codes the plays from lighter to darker, the lighter ones being those composed earlier in Shakespeare’s career (according to the Oxford editors) while the darker ones are from later.

    Dendrogram produced in JMP using all principal components, clustered with Ward
    Dendrogram produced in JMP using all principal components, clustered with Ward
    Dendrogram produced in JMP using all principal components, color coded according to time of composition
    Dendrogram produced in JMP using all principal components, color coded according to time of composition (Oxford order)

    Use of all the components provides much better separation into intelligible genres: notice how well the histories are clustering at the top of the dendrogram, while we find nice clusters of both tragedies and comedies below. And if we look at the two largest clusters — those linked by the right-most line in the dendrogram — we see that they are broadly separated into earlier and later plays (with the earlier on top, later below).

    Nice result, and we’ll see some even nicer ones produced with different techniques in subsequent posts. But how do you work back from these images to a concrete discussion of what puts the plays together? Because we are dealing with seventeen components simultaneously, it is nearly impossible to make this properly interpretive leap. This is the fundamental paradox you encounter when you apply statistics to categorical data derived from texts: the more intensively you make use of mathematics in order render intelligible groups, the less you know about what qualifies items for membership in those groups. If we were running a search engine company like Google, we wouldn’t worry about explaining why items are grouped together, since our business would only concern producing groupings that make sense to an “end user.” But we do not work for Google — at least, not yet. There will be times when it makes more sense to visualize and explore manageable, critically intelligible levels of complexity instead of seeking the “perfect model” of literary experience.