Edward III, Shakespearean Trigrams, and Trillin’s Derivatives

A bustling day for Shakespeare scholars, and those who follow computer assisted work in the humanities. The Times reported yesterday that Sir Brian Vickers has used a plagiarism detection software package to demonstrate that Shakespeare wrote Edward III with Thomas Kyd. Edward III was published anonymously in 1596 and has, since the eighteenth century, been associated with Shakespeare on the basis of stylistic and stylometric analyses. I haven’t seen any substantive writeup of Vickers’ conclusions, so I don’t want to pass judgment on the results. But the description we have of his proof so far is intriguing. Apparently the program, developed at the law school at Maastricht, produces a concordance all three-word sequences (trigrams) in a target text and then looks for these trigrams in other documents. According to Vickers, some of these trigrams are quite common and so shared by many language users. They represent common grammatical runs. But other trigrams are unique to a single writer and so are, in his words, a type of “fingerprint.” These author-generated trigrams (as opposed to those generated by grammatical constraints) tend to be “metaphors or unusual parts of speech.” Edward III, he claims, is seen to contain the author-fingerprints of both Shakespeare and Thomas Kyd when the play is compared on these terms with other writings by these authors published before 1596. The article goes on to explain that it’s usually around 20 grammar-based trigrams that are shared among texts (the common ones), but that 200 unique Shakespearean trigrams appeared in Edward III (in about 40% of the scenes), and “around 200” unique Kydian trigrams appeared in the rest of the play.

This analysis prompted two thoughts. First, I was skeptical, since my work with Docuscope — which uses a mix of grammatical and semantic categories to class populations of texts — has not indicated that the program has any ability to “see” authorship in the texts it analyses. This may be due to the nature of Docuscope: it was designed to track genres, so its phenomenological categories (which span the range of semantic, grammatical and rhetorical effects) are not particularly good for fingerprinting. But let’s say that Vickers is right, and that something like authorship can be described in terms of unique three-word sequences. Exactly what kind of linguistic independence — and surely some independence from grammatical and genre constraints is presumed here — do authors possess? According to this study, it would be the ability to sequence short runs of language in a unique way, an ability that is confirmed by the fact that those three-word sequences which are duplicated in the works of other writers are few and merely grammatical. My first question, then, is this: since the machine itself is not tasked with sifting what is merely grammatical from what is metaphoric and so authorial (and remember, some of the unique passages are “unusual parts of speech”), on what basis do we discount certain shared runs as merely grammatical and therefore non-authorial?

Now I know the obvious reason why Vickers sees authorship at work here — there are only 10 to 20 trigrams shared between (a) the known pre-1596 works of  Shakespeare, (b) the target text (Edward III), and (c) the known pre-1569 works of Kyd, whereas there are 200 that belong to Shakespeare and certain sub-portions of the target text exclusively (a and b). The comparatively higher frequency of unshared trigrams to shared trigrams (200/20) suggests Shakespeare is the author of those portions of the play with the abundance of non-Kydian trigrams. But it is worth thinking about the difference between the two types of shared trigrams, since the (interpretive) characterization of this difference supports an underlying theory of authorship which is itself worth debating. In the world of contemporary literary criticism, the author can now just as easily be described as an institution, effect or persona as he or she can an empirical person. So do we now have empirical proof that the author as person (rather than institution, persona or effect) shows up on the page in the form of unique combinations of metaphorical words? Sounds like the old, romantic fashioner of images to me, although we have no comparative data on the number of times particular writers use unique trigrams and so exercise their combinatory, poetic imagination. Is Shakespeare more inventive (and so authorial) than say Kyd, but less than Jonson? We’ll have to wait and see.

My second thought is prompted by an editorial by Calvin Trillin in the NYTimes this morning. Trillin tells a story about a man he met in a bar in Manhattan who explained the financial meltdown to him. The man, a late middle-aged Ivy Leaguer who has done well for himself in life, says that Wall Street melted down because “the smart guys had started working on Wall Street.” (I stumbled over the pluperfect in this sentence, but as a Trillin fan, I can only study and learn.) He tells Trillin that in the old days, the really smart kids went on to prestige jobs as judges as professors, where their minds were exercised but they didn’t make a lot of money. The lower third of the class, whose academic performance was undistinguished, went on to careers in finance, where they made oodles of money and could afford homes in Greenwich. As the cost of an elite education grew, however, the really smart ones decided they ought to go make a pile of money before doing something more rewarding, paying off their college loans on their way out the door. “That’s when you started reading about these geniuses from M.I.T. and Caltech who instead of going to graduate school in physics went to Wall Street to calculate arbitrage odds,” the man at the bar says. The problem, he goes on to explain, is that these high-flyers were quants or math geniuses and — in addition to knowing how to manipulate a data set — realized fairly quickly that they could make even more money than the “lower third” had been doing in their Wall Street careers. It didn’t take long for the quants to invent derivatives and credit default swaps: from here on out, risk would be “quantified” in ways that no one could really understand, and Wall Street executives would now be free to pursue profits and risks with reckless abandon.

I have often wondered if the application of statistics to the study of literature is a bit like the creation of literary “derivatives.” There are millions and millions of patterns in texts — perhaps as many as their are neurons in the brain — which could be grouped into bundles and used to class texts according to this or that taxonomy. (“I’d like to bundle Shakespeare and Jonson futures today, please.”) The pitfalls in this kind of analysis are the same as those facing Wall Street investors who look for new ways to describe (and then commoditize) the phenomena they trade in. You can attach meaning to patterns in the data, but there is no guarantee that the pattern you are looking at represents something that you really understand. That is why I have tried to work with known quantities — the decisions of these people to call these texts comedies, for example — and then link them to mechanisms or rhetorical effects that I recognize (retrospective narration, two-person dialogue, seduction). I say this because there are a lot of ways, millions of them, to measure similarities between texts. This recent attribution of co-authorship by Vickers is ultimately based on a measure of similarity across passages in known and unknown works. But one always has to ask, similar in what respect?

Do we know that individuals use three-word sequences in ways that are so unique that they cannot be imitated, thus ensuring that patterns among these sequences are an unconscious signature of authorship? A similar theory has been advanced about painters, suggesting that individual artists can be identified by acts that they do not consciously attend to, such as the sketching of ears and hands. Carlo Ginzburg has written interestingly about this, and there is a fascinating discussion of a similar technique (the “courtesan method”) in Orhan Pamuk’s novel, My Name Is Red. But of course, there might be other ways of measuring similarities and differences — other things to count — that would not pick out these two authors as decisively. My point is that the presumption that these are the things that need to be counted — trigrams — as opposed to the wealth of other countable things is based on interpretation rather than observation. Such a choice implies a theory of authorial behavior that should itself be tested over a broader range of texts, suggesting as it does that authorship can be “conclusively” measured and assessed as a behavior (i.e., it is using language in a unique ways) rather than as a feeling in the reader. And it implies a distinction between discountable overlap in such behaviors (because they are grammatical) as opposed to significant ones (because they are poetic or metaphoric) that should itself be explored further. For contrast, I would point out the work of Matt Jockers at Stanford, who has shown that combinations of extremely common words can provide clues about an author’s identity. Vickers’ author-tracking-trigrams are unique, and it is their uniqueness that gives them value, whereas Jockers’ “most frequent word” analyses (of words such as I, me, of, the, etc.) assumes that it is the common things that really “take” the imprint of the author’s individual nature.

All of this is good, I think, for the statistical study of literature, and for the study of literature more generally, since it forces us to ask basic questions about what authors do. Who knows, maybe Vickers is right. The more interesting question is, “Why?”

This entry was posted in Shakespeare. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.


  1. admin
    Posted October 15, 2009 at 9:22 am | Permalink

    Martin Mueller posted an excellent analysis of the diffusion of unique trigrams in early modern plays in August, which provides a statistical context for the claims Vickers is making. He zeroes in on the crucial issue right away: to what degree is uniqueness statistically significant when we are talking about three-word sequences in a body of works. I recommend it highly.

  2. Dennis Purcell
    Posted October 16, 2009 at 11:46 am | Permalink

    Of course the magic of the “three” in trigram study is just for statistical “meatiness” — two word sequences are too common, and “quadgrams” too sparse (and you lose too much at the edges) . And of course such a “fixed aperture” is very easy to compute.

    Question: have you seen any studies that use variable-length sequences, or any discussion of what a “phrase” usually consists of? (philosophically or linguistically or between different languages)

  3. admin
    Posted October 16, 2009 at 12:30 pm | Permalink

    Actually, Docuscope parses and tags sequences from 2 to 9 words in length, but only the ones that David Kaufer deemed worth counting because they made sense to him as functional rhetorical units. Docuscope does not attempt to mix any of these n-gram methods with its semantic/rhetorical/grammatical analyses.

    Martin Mueller over at DATA probably knows something about phrase lengths in different languages. I will ask him about it when I see him in Chicago in a few weeks. I have been dealing with a different “aperture” problem — and aperture is a great metaphor — namely, what is the ideal “length” of passage to analyze in a work?

    We chunk the Shakespeare plays into subsections of varying lengths — I’m working on a post about 776 pieces of Shakespeare now — but to really get what constitutes an exemplary “passage” you would have to decide how long a passage is for the purposes of comparison.

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>