To what extent is modern text analysis software capable of dealing with historical data? This is a perennial question asked by those working with digitized historical texts who wish to see how an analysis of such texts can be facilitated by cutting-edge technologies. No doubt the best way to answer the question is to test this software with two versions of the same text, where one version of the text can be considered an older and noticeably different version than the other version.
Enter the Folio and Globe editions of Shakespeare’s plays. The latter was published in 1867 and contains modernized spelling throughout, whereas the former was published in 1623 and maintains the original spelling of Shakespeare’s Early Modern English. Using DocuScope for text analysis and JMP for statistical visualizations, the following dendrogram was created:
The texts highlighted in red are from the Folio edition, whereas the texts highlight in blue come from the Globe edition. One would expect all of Shakespeare’s Folio plays to cluster with their Globe complement here. Much Ado About Nothing is Much Ado About Nothing, after all, regardless of which edition it appears in. But for the most part, this neat pairing off is not what happens: instead, most of the Folio plays are grouping with other Folio plays, and the same is true for the Globe plays. Only a few plays are actually grouping with themselves at the top of the dendrogram. Methinks we have a problem.
Upon closer inspection, I found that 13,667 items were tagged by DocuScope in the Globe edition of Much Ado, but only 11,382 items were tagged in the Folio edition of the same play: a 16.7% difference. An inspection of eleven other Shakespeare plays provides us with an overall mean difference of 17.8%: a difference that cannot be considered good when it comes to tagging accuracy.
But why the disparity? Maybe a closer look at DocuScope can give us an idea.
First the Folio version of the opening scene in Much Ado About Nothing (with the “Interior Thought” and “Public Values” clusters turned on):
And the Globe version of the same scene with the same clusters turned on:
One need not read far to discover what’s (not) being tagged in the older, Folio edition of Much Ado: Learne versus learn. It appears the orthographic rendering of the unstressed final –e is causing DocuScope to overlook this work altogether. We find the same mistake later on with indeede/indeed, kindnesse/kindness, helpe/help, and kinde/kind. Another common problem is Early Modern use of u, which is rendered v in modern orthography: deseru’d vs. deserved, seruice vs. service, and ouerflow vs. overflow. There are also a few punctuation issues causing problems: the use of apostrophe (as we see in deseru’d) and the use of | (con | flict vs. conflict), which probably results from some sort of scanning or other computer error. In other plays, the hyphen was also found to be a possible culprit of DocuScope overlooking certain items (ouer-charg’d vs. overcharged).
Although the overall number of DocuScope omissions on a Folio play is rather large, the actual number of error types is quite small. This gives us hope that, with a bit of modification, it may well be possible to train DocuScope to read non-modern(ized) texts.