To what extent is modern text analysis software capable of dealing with historical data? This is a perennial question asked by those working with digitized historical texts who wish to see how an analysis of such texts can be facilitated by cutting-edge technologies. No doubt the best way to answer the question is to test this software with two versions of the same text, where one version of the text can be considered an older and noticeably different version than the other version.
Enter the Folio and Globe editions of Shakespeare’s plays. The latter was published in 1867 and contains modernized spelling throughout, whereas the former was published in 1623 and maintains the original spelling of Shakespeare’s Early Modern English. Using DocuScope for text analysis and JMP for statistical visualizations, the following dendrogram was created:
The texts highlighted in red are from the Folio edition, whereas the texts highlight in blue come from the Globe edition. One would expect all of Shakespeare’s Folio plays to cluster with their Globe complement here. Much Ado About Nothing is Much Ado About Nothing, after all, regardless of which edition it appears in. But for the most part, this neat pairing off is not what happens: instead, most of the Folio plays are grouping with other Folio plays, and the same is true for the Globe plays. Only a few plays are actually grouping with themselves at the top of the dendrogram. Methinks we have a problem.
Upon closer inspection, I found that 13,667 items were tagged by DocuScope in the Globe edition of Much Ado, but only 11,382 items were tagged in the Folio edition of the same play: a 16.7% difference. An inspection of eleven other Shakespeare plays provides us with an overall mean difference of 17.8%: a difference that cannot be considered good when it comes to tagging accuracy.
But why the disparity? Maybe a closer look at DocuScope can give us an idea.
First the Folio version of the opening scene in Much Ado About Nothing (with the “Interior Thought” and “Public Values” clusters turned on):
And the Globe version of the same scene with the same clusters turned on:
One need not read far to discover what’s (not) being tagged in the older, Folio edition of Much Ado: Learne versus learn. It appears the orthographic rendering of the unstressed final –e is causing DocuScope to overlook this work altogether. We find the same mistake later on with indeede/indeed, kindnesse/kindness, helpe/help, and kinde/kind. Another common problem is Early Modern use of u, which is rendered v in modern orthography: deseru’d vs. deserved, seruice vs. service, and ouerflow vs. overflow. There are also a few punctuation issues causing problems: the use of apostrophe (as we see in deseru’d) and the use of | (con | flict vs. conflict), which probably results from some sort of scanning or other computer error. In other plays, the hyphen was also found to be a possible culprit of DocuScope overlooking certain items (ouer-charg’d vs. overcharged).
Although the overall number of DocuScope omissions on a Folio play is rather large, the actual number of error types is quite small. This gives us hope that, with a bit of modification, it may well be possible to train DocuScope to read non-modern(ized) texts.
3 Comments
Hi Jason, I believe the current Docuscope dictionary is almost entirely based on modern spelling and since DocuScope doesn’t attempt any NLP parsing and relies on n-gram matching, spelling variations will not be recognized. But instead of updating the Docuscope dictionary with EM spellings, a better way might be to modernize texts before running them through DocuScope. We already have a great resource in the Monk dictionary files that contain more than 350,000 regularized spellings. It should be pretty easy to modernize the Folio version with this dictionary. If you want to send me a couple of the Folio files you’re using, I can quickly write a script to modernize them and then we’d be able to compare how much DocuScope’s accuracy goes up.
This is exactly what we’ve done, asking Martin Mueller to lemmatize and then semi-algorithmically modernize our texts before tagging them with Docuscope. The results of this modernization process gives us roughly the coverage, in terms of tagging, that we get with a modern corpus. So, in term of transcription, it may be best to produce *original spelling* versions of early modern texts — as in the TCP archive — and then “push” them forward for tagging with an intermediary step for modernization. Given that the results are good enough with this process, I’m not sure it would be worthwhile to re-edit all of the EM texts we have from scratch.
Hello Jason & Mike,
I also found a similar problem when running a transcription from EEBO of Shakespeare’s First Folio straight into Docuscope. However, in May I used Alistair Baron’s VARD II program and came up with surprisingly similar results to Martin Mueller’s prepared texts. More information about the process can be found at my post here:
http://allistrue.org/2011/11/16/presume-not-that-i-am-the-thing-i-was/
But I think training Docuscope to modernize in addition to tag these texts might be taking too active of a stance, particularly with preexisting software available. By taking a collective approach to these texts, rather than coding the dictionary ourselves, I feel like we can avoid interposing our own biases and ultimately arrive at a more solid understanding of the literature. I ultimately think it comes down to personal preference what programs or tactics to use, but a set of standards for these occurrences might be something to think upon as Docuscope’s user base grows.
Also, do you feel like using semi-algorithmic modernization is imposing on the traditional focus of editors where the rendering of a word one way or another affects the interpretation of the entire play?
All the best, Mike
One Trackback
Ayana Cabrera…
Major thanks for the blog post.Much thanks again. Cool….