I’ve just received word that our piece in Shakespeare Quarterly has gone online via JSTOR. The illustrations for the paper copy of the article are monochrome (with the exception of the cover), while there are color illustrations in the online version. The most complete illustrations, however, including the entire Figure 9 (the “very large dendrogram”) can be found at the post below.
Blog
-
Adding Knobs to the Analysis
A year ago I had a conversation with Miron Livny (UW Computer Science, Morgridge Institute) about the work we’ve been doing with Docuscope, and he asked an interesting question. “Are there any knobs that can be twisted?” he asked. Livny was alluding to the fact that tagging is a static procedure: once you’ve decided what tokens will be classified as a particular type, your will always get the results you are going to get through counting. Findings are determined the instant you decide what to count, since you are only counting these things. But what about any incremental variables or procedures that might allow us to see what happens when there is more or less of something – a more or less that we, rather than the author, control?
One idea was to systematically begin perturbing the dictionaries used by Docuscope, migrating, say, every nth word from one LAT type to the next, and doing this sequentially until one began to find results that were “more” interpretable. This would be computationally quite demanding and so a further development of our techniques in the direction of high throughput computing. But it might also raise basic questions about the nature of the dictionaries, their susceptibility to random or arbitrary re-disposition, and the sensitivity of our results to such dispositions. One might think of such an experiment as a variation on the Oulipian “N + 7” rule, and there is definitely some connection between this type of computational approach that Hope and I have been calling “iterative criticism” and the exploitation of the arbitrary one finds in Oulipo poetics, or even Burroughsian cut-up and collage.
Mike Stumpf has found a knob to turn – the amount of a particular character’s lines in a play – and has been turning it, with some interesting results. I’ve posted some questions on the post itself, but I think it is an interesting and provocative extension of some of the techniques we have been exploring.
-
Shakespeare Quarterly 61.3 Figures
The color figures below correspond to those published in black and white in the print edition of Jonathan Hope and Michael Witmore, “‘The Hundredth Psalm to the Tune of ‘Green Sleeves’: Digital Approaches to Shakespeare’s Language of Genre,” Shakespeare Quarterly 61.3 (fall 2010), Special Issue: New Media Approaches to Shakespeare, edited by Katherine Rowe. These figures can also be found at the University of Wisconsin-Madison Memorial library server.
Figure 1 (above): A total of 776 pieces of Shakespeare’s plays from the First Folio, each piece consisting of 1,000 words, rated on two scaled PCs (1 and 4). The cumulative proportion of variation accounted for by the first four principal components is 12.33 percent, with component 1 accounting for 3.83 percent and component 4 accounting for 2.35 percent.
Figure 2 (above): Loadings biplot for scaled PCs 1 and 4 used to create the scatterplot in Figure 1.
Figure 3 (above): Docuscope screenshot of exemplary Comic strings from Twelfth Night, 3.1. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue; “LanguageReference” in violet.
Figure 4 (above): A total of 767 1,000-word pieces of the Folio plays rated on scaled PCs 1 and 4. This image is the same as Figure 1, except that all the plays are displayed as red dots with the exception of Othello, which is displayed as blue dots and collects mostly in the upper-right-hand quadrant where the Comedies tend to cluster.
Figure 5 (above): Docuscope screenshot illustrating exemplary Comic strings from Othello, 3.3. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue.
Figure 6 (above): Docuscope screenshot illustrating exemplary Comic strings from Othello, 4.2. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue; “LanguageReference” in violet.
Figure 7 (above): Docuscope screenshot illustrating exemplary History strings from Richard II, 1.3. “CommonplaceAuthority” strings appear in lime green; “Inclusiveness” in olive; “SenseProperties,” “Sense Objects,” and “Motion” in yellow.
Figure 8 (above): Docuscope screenshot illustrating exemplary History strings from Romeo and Juliet, 1.1. “CommonplaceAuthority” strings appear in lime green; “Inclusiveness” in olive; “SenseProperties,” “Sense Objects,” and “Motion” in yellow.
Figure 9 (above): Dendrogram produced by Ward’s clustering method on scaled data using ninety-eight LATs to profile 320 plays written between 1519 and 1659. Individual items are colored according to genre, with the exception of plays written by Shakespeare, which all appear in yellow. To examine the diagram, click on it. We are grateful to Martin Mueller at Northwestern University for providing us with the modernized versions of these texts from the TCP collection.
The authors would like to thank the University of Wisconsin Memorial Library for its assistance in hosting these illustrations on a permanent basis. Thanks also go to Kate Fedewa for her assistance in preparing Figure 9.
-
Shakespeare Out of Place?
When Jonathan Hope and I did our initial Docuscope study of over 300 Renaissance plays, we found Shakespeare’s plays clustering together for the most part. One explanation for this clustering was that it was caused by something distinctive in Shakespeare’s writing, and that this authorial signature becomes visible in the same way genre does—at the level of the sentence. Indeed, in our first approach to this larger dataset (one we’d assembled from the Globe Shakespeare and Martin Mueller’s semi-algorithmically modernized TCP plays), we thought that authorship was overriding genre as source of patterned variance.
But everything which goes into the dataset also comes out. And in this case, it was editorial difference that was helping to isolate Shakespeare’s plays. When we did a further study of the clusters containing works by Shakespeare, we noticed that their elevated levels of two different LATs that dealt with punctuation – TimeDate and LanguageReference – was an artifact of hand modernization.
Several contracted items from the Globe/Moby Shakespeare edition, tagged as Language Reference Strings by Docuscope The variability in early modern orthography is well known, and we also know that there were many ways of punctuating early modern texts. (In the case of Shakespeare’s plays, we assume that most of the punctuation originated with the compositors who set up the text in the printhouse rather than Shakespeare himself.) But when the Globe editors modernized their sources in the nineteenth century, they consistently applied certain rules of punctuation that skew Docuscope’s counts when these texts (as a group) were compared with the more varied punctuation to be found in the TCP texts. Sequences that were dealt with consistently in the Globe texts – for example, contractions such as [’tis] or [’twas] or [o’clock] – were being handled much more variously in the original-spelling texts that Martin Mueller was modernizing. (He was only modernizing words in his procedure.)
So, the punctuation was a tip off, increasing the chances that Shakespeare’s plays would cluster together.
We now have the ability to skip or blacklist certain word strings, thanks to a newly updated version of Docuscope created by Suguru Ishizaki. At some point, we will open this can of worms–actually modifying Docuscope’s original tagging protocols–but not yet. There is still more to be learned from the results from an unmodified Docuscope: when we don’t touch the contents of its internal dictionaries, we have the ability to compare results across periods or corpora.
In this case, we learn that Docuscope is sensitive to human editorial intervention in texts. So sensitive, in fact, that it produced an almost complete clustering of Shakespeare’s plays in the larger group of 320 that we profiled in the online draft of our “Hundredth Psalm” article.
Once we realized that this grouping was at least partly artifactual–a product of different editorial procedures applied to our combined corpus–we eliminated the LATs that were registering this difference (TimeDate and LangReference). Of course, by eliminating these, we lost their sorting power on the rest of the corpus, so there was a tradeoff. But we felt that it was not fair to give Docuscope this kind of advantage in sorting text when it was the result of modern editorial intervention. In the future, we might blacklist a word like [’tis] so that we can retain the rest of the category, but I don’t think this is necessary. What really needs to happen is that, in our editorial preparation of texts and corpora, we must ensure that no set of texts is isolated from the others through special editorial preparation. The fact that “anything goes” in the current TCP collection – it is full of various compositorial and printhouse styles and conventions – is probably a good thing. And in any event, we still see authors’ works and genres clustering together even where printers are multiple. Here, now, is one of the new Shakespeare clusters once the editorial “tell” of certain types of punctuation was removed:
New clustering of Shakespeare's plays with TimeDate and LangRef eliminated from analysis Now we see that plays by Munday, Heywood, Marlowe, Shirley, Rowley, Webster, Middleton, and Massinger are showing greater similarities with Shakespeare: the variability of their punctuation is not being used against them. Within the Shakespeare plays that do cluster together, we see some of the same similarities–Coriolanus with Cymbeline, for example. But the terms on which Shakespeare’s plays are related to each other are now more limited–we have eliminated two categories of LATs that may have been sorting Shakespeare’s plays with respect to each other. This relative loss of sorting power within Shakespeare’s works seems tolerable to us, however, because it allows for a more meaningful portrait of Shakespeare’s relationship to other dramatists of the period. What excited us about this large diagram was that it says something about 150 years of early modern drama as a whole, inasmuch as that whole could be represented by over 300 works.
Here is the entire diagram, then, constructed without the LATs that capture the nineteenth-century modernization of the Shakespearean texts. (Many thanks to Kate Fedewa for helping us create this large image.)
Revised dendrogram comparing early modern plays from the TCP collection and the Globe Shakespeare (click on image in new screen to zoom) -
Crowdsourced Peer Review in NY Times
The Times this morning did a piece on the Shakespeare Quarterly New Media issue that Jonathan Hope and I participated in. We received some terrific feedback, mostly from Shakespeareans, on the article that was posted to Media Commons–feedback that helped us rewrite the essay for the print edition which will be appearing this fall. There was also a piece on the process by Jennifer Howard in the Chronicle for Higher Education, itself the topic of an opinion piece in the Chronicle’s Brainstorm section.
The idea of open peer review in the humanities raises basic questions about the “specialized” nature of our knowledge in the humanities. Could simply anyone weigh in on a debate about a particular text and its interpretation? Wouldn’t that, in principle, be a good thing? I assume that knowledge in the humanities is in principle available to all. But it is also clearly specialized. The word “allegory,” for example, has a deep history and set of contextual meanings that you just couldn’t pick up from a good dictionary. Our research does expand what is known about certain literatures, cultures and writers, and in this sense, we look like a science that aims to extend the range of objects that are understood. We also refine our terms of art and build communities around these terms (i.e, différance, queering, hegemony, subaltern, hybridity, racialization). One could learn to throw these terms around, as Alan Sokal did in his famous hoax and as graduate students do every day in their seminars, but a good critic or editor should be able to say whether or not the writer really understands the terms. (This is where the editors of Social Text failed.) Perhaps if the paper Sokal submitted to Social Text had been vetted through crowdsourced open peer review, the article would have been rejected. In any event, the hoax itself provides an interesting limit case with which to evaluate the promise of open peer review: a writer acting in bad faith, either as author of the article or peer reviewer.
One last thought: the trajectory of learning in the humanities is intensive rather than cumulative. This is what differentiates us from, say, molecular biology, where you must learn certain things first (organic chemistry, cell physiology) in order to understand other things later (gene transcription). Within the humanities, acquiring expertise might mean re-orienting our approach to existing works rather than expanding the range objects that can be known, although the latter is always possible. But the underlying assumption – that in the humanities one can make qualitative advances in knowledge that do not necessarily fit into a progressive sequence – makes any comparison between the humanities and sciences difficult.