Author: Michael Witmore

Edward III, Shakespearean Trigrams, and Trillin’s Derivatives

A bustling day for Shakespeare scholars, and those who follow computer assisted work in the humanities. The Times reported yesterday that Sir Brian Vickers has used a plagiarism detection software package to demonstrate that Shakespeare wrote Edward III with Thomas Kyd. Edward III was published anonymously in 1596 and has, since the eighteenth century, been associated with Shakespeare on the basis of stylistic and stylometric analyses. I haven’t seen any substantive writeup of Vickers’ conclusions, so I don’t want to pass judgment on the results. But the description we have of his proof so far is intriguing. Apparently the program, developed at the law school at Maastricht, produces a concordance all three-word sequences (trigrams) in a target text and then looks for these trigrams in other documents. According to Vickers, some of these trigrams are quite common and so shared by many language users. They represent common grammatical runs. But other trigrams are unique to a single writer and so are, in his words, a type of “fingerprint.” These author-generated trigrams (as opposed to those generated by grammatical constraints) tend to be “metaphors or unusual parts of speech.” Edward III, he claims, is seen to contain the author-fingerprints of both Shakespeare and Thomas Kyd when the play is compared on these terms with other writings by these authors published before 1596. The article goes on to explain that it’s usually around 20 grammar-based trigrams that are shared among texts (the common ones), but that 200 unique Shakespearean trigrams appeared in Edward III (in about 40% of the scenes), and “around 200” unique Kydian trigrams appeared in the rest of the play.

This analysis prompted two thoughts. First, I was skeptical, since my work with Docuscope — which uses a mix of grammatical and semantic categories to class populations of texts — has not indicated that the program has any ability to “see” authorship in the texts it analyses. This may be due to the nature of Docuscope: it was designed to track genres, so its phenomenological categories (which span the range of semantic, grammatical and rhetorical effects) are not particularly good for fingerprinting. But let’s say that Vickers is right, and that something like authorship can be described in terms of unique three-word sequences. Exactly what kind of linguistic independence — and surely some independence from grammatical and genre constraints is presumed here — do authors possess? According to this study, it would be the ability to sequence short runs of language in a unique way, an ability that is confirmed by the fact that those three-word sequences which are duplicated in the works of other writers are few and merely grammatical. My first question, then, is this: since the machine itself is not tasked with sifting what is merely grammatical from what is metaphoric and so authorial (and remember, some of the unique passages are “unusual parts of speech”), on what basis do we discount certain shared runs as merely grammatical and therefore non-authorial?

Now I know the obvious reason why Vickers sees authorship at work here — there are only 10 to 20 trigrams shared between (a) the known pre-1596 works of Shakespeare, (b) the target text (Edward III), and (c) the known pre-1569 works of Kyd, whereas there are 200 that belong to Shakespeare and certain sub-portions of the target text exclusively (a and b). The comparatively higher frequency of unshared trigrams to shared trigrams (200/20) suggests Shakespeare is the author of those portions of the play with the abundance of non-Kydian trigrams. But it is worth thinking about the difference between the two types of shared trigrams, since the (interpretive) characterization of this difference supports an underlying theory of authorship which is itself worth debating. In the world of contemporary literary criticism, the author can now just as easily be described as an institution, effect or persona as he or she can an empirical person. So do we now have empirical proof that the author as person (rather than institution, persona or effect) shows up on the page in the form of unique combinations of metaphorical words? Sounds like the old, romantic fashioner of images to me, although we have no comparative data on the number of times particular writers use unique trigrams and so exercise their combinatory, poetic imagination. Is Shakespeare more inventive (and so authorial) than say Kyd, but less than Jonson? We’ll have to wait and see.

My second thought is prompted by an editorial by Calvin Trillin in the NYTimes this morning. Trillin tells a story about a man he met in a bar in Manhattan who explained the financial meltdown to him. The man, a late middle-aged Ivy Leaguer who has done well for himself in life, says that Wall Street melted down because “the smart guys had started working on Wall Street.” (I stumbled over the pluperfect in this sentence, but as a Trillin fan, I can only study and learn.) He tells Trillin that in the old days, the really smart kids went on to prestige jobs as judges as professors, where their minds were exercised but they didn’t make a lot of money. The lower third of the class, whose academic performance was undistinguished, went on to careers in finance, where they made oodles of money and could afford homes in Greenwich. As the cost of an elite education grew, however, the really smart ones decided they ought to go make a pile of money before doing something more rewarding, paying off their college loans on their way out the door. “That’s when you started reading about these geniuses from M.I.T. and Caltech who instead of going to graduate school in physics went to Wall Street to calculate arbitrage odds,” the man at the bar says. The problem, he goes on to explain, is that these high-flyers were quants or math geniuses and — in addition to knowing how to manipulate a data set — realized fairly quickly that they could make even more money than the “lower third” had been doing in their Wall Street careers. It didn’t take long for the quants to invent derivatives and credit default swaps: from here on out, risk would be “quantified” in ways that no one could really understand, and Wall Street executives would now be free to pursue profits and risks with reckless abandon.

I have often wondered if the application of statistics to the study of literature is a bit like the creation of literary “derivatives.” There are millions and millions of patterns in texts — perhaps as many as their are neurons in the brain — which could be grouped into bundles and used to class texts according to this or that taxonomy. (“I’d like to bundle Shakespeare and Jonson futures today, please.”) The pitfalls in this kind of analysis are the same as those facing Wall Street investors who look for new ways to describe (and then commoditize) the phenomena they trade in. You can attach meaning to patterns in the data, but there is no guarantee that the pattern you are looking at represents something that you really understand. That is why I have tried to work with known quantities — the decisions of these people to call these texts comedies, for example — and then link them to mechanisms or rhetorical effects that I recognize (retrospective narration, two-person dialogue, seduction). I say this because there are a lot of ways, millions of them, to measure similarities between texts. This recent attribution of co-authorship by Vickers is ultimately based on a measure of similarity across passages in known and unknown works. But one always has to ask, similar in what respect?

Do we know that individuals use three-word sequences in ways that are so unique that they cannot be imitated, thus ensuring that patterns among these sequences are an unconscious signature of authorship? A similar theory has been advanced about painters, suggesting that individual artists can be identified by acts that they do not consciously attend to, such as the sketching of ears and hands. Carlo Ginzburg has written interestingly about this, and there is a fascinating discussion of a similar technique (the “courtesan method”) in Orhan Pamuk’s novel, My Name Is Red. But of course, there might be other ways of measuring similarities and differences — other things to count — that would not pick out these two authors as decisively. My point is that the presumption that these are the things that need to be counted — trigrams — as opposed to the wealth of other countable things is based on interpretation rather than observation. Such a choice implies a theory of authorial behavior that should itself be tested over a broader range of texts, suggesting as it does that authorship can be “conclusively” measured and assessed as a behavior (i.e., it is using language in a unique ways) rather than as a feeling in the reader. And it implies a distinction between discountable overlap in such behaviors (because they are grammatical) as opposed to significant ones (because they are poetic or metaphoric) that should itself be explored further. For contrast, I would point out the work of Matt Jockers at Stanford, who has shown that combinations of extremely common words can provide clues about an author’s identity. Vickers’ author-tracking-trigrams are unique, and it is their uniqueness that gives them value, whereas Jockers’ “most frequent word” analyses (of words such as I, me, of, the, etc.) assumes that it is the common things that really “take” the imprint of the author’s individual nature.

All of this is good, I think, for the statistical study of literature, and for the study of literature more generally, since it forces us to ask basic questions about what authors do. Who knows, maybe Vickers is right. The more interesting question is, “Why?”

October 14, 2009
Rhythm Quants: Burial, Click Tracks, Genre Tempo

Graham has posted a new video by one of my favorite artists over at Object Oriented Philosophy. Burial is a London DJ whose work often gets filed under the label “dubstep,” a variety of post-house electronica that appeared several years ago. I like dubstep a lot, and this video actually captures something of its unsteady, city-worn appeal:

One of the greatest things about Burial is that his beats are asymmetrical. That is, in a world where you can loop beats in such a way that the “ictus” (ideal musical point where the beat falls) is evenly distributed across the entire snippet, Burial’s beats sway a bit from tempo and then rejoin when the loop starts over again. I tend to hear this because I am a drummer, and was trained to play in the 1980s, just when drum machines were becoming more common in live performance and studio recording. For drummers who learned to play in this period, we were forced to synch our bodies (and eventually, minds) to a mathematically precise representation of the ictus — one that is produced by a machine — so that our own playing would match up with that of others who were similarly keyed into this “reference beat.” Most often, that reference beat would be calling the changes in synthesizer parts (which were electronically triggered by that reference): so the whole band, or the band in the recording studio, would ideally be vibrating to the same periodic oscillation, one that never changed unless the beat frequency was altered by the programmer or producer.

But of course, drumming is more fluid than this kind of matching to the mathematical ictus. Most dance music — music that people actually dance to — has subtle movements ahead of and behind the beat. This occurs in part to create musical tension, but also to whip dancers around in the right way. (Our bodies may exhibit symmetry, but our dance steps do not.) The most extreme versions of this kind of dance-wobble that I have witnessed, although not directly related to drumming, occur in European music. Hearing an orchestra play Strauss in Vienna, I was initiated into something that the Viennese take for granted: Strauss rushes the 1-2 in the 1-2-3 of waltz tempo, which means that you get a one-two…three, one-two….three in which the second beat does not evenly divide the first from the third. Hungarian and Romanian folk music has some of this as well. I remember being at a dancehouse in Budapest in the eighties and hearing a Roma folk band play, and was amazed at the quick surges and retards in the tempo, occurring at every measure. This variation, I was told, helped the dancers whip each other around so that their bodies could lean at the appropriate moment: a really beautiful idea, since it suggests that the music itself was conforming to the movements and weightings of the dance — even at the level of tempo.

If you look at the beginning of the Burial video, you can see the idea of symmetry taken apart on the screen, as the diagonals display action in a kind of dance-box. Movements and pans in and out of the paired boxes does not occur at the same speed, which means that you get the same kind of staggered synchrony that often occurs in Burial’s musical beats, but here it occurs visually.

I suspect that a good studio engineer could actually quantify the ways in which Burial’s beats redistribute the ictus on a measure by measure basis, something that was once done by drummers who were not playing to a “click” or mechanically measured metronome, but perhaps more intuitively and communally. That’s not to say that Burial has recaptured the “fluid” nature of the beat or that the electronic metronome killed the beat (and that Burial is bringing it back). It’s not that simple. Rather, drummers have always had a good sense of what the “ictus” is and have manipulated it implicitly by speeding up and slowing down before the beginnings and endings of measures. In a pre-click track world — listen, for example, to some of the beats by The Meters — you wouldn’t necessarily notice the manipulations, because the world has not yet learned to “hear” the absent click, which happens once music everywhere is keyed to an inaudible metrical yardstick. I would say that this was the case by the early ninetees. But once this implicit beat becomes part of the music — part of the bodies and ears of drummers and listeners alike — the tempo pushes and pulls are audible as deliberate. The drummers Manu Katché and Omar Hakim have made an art form out of this over the last two decades. I’m sure both of them can play to a click track (or not) in their sleep.

The point here is that human beings are exquisitely sensitive to quantitative phenomena like rhythm, and they can also have their background perceptions of what “proper rhythm” is shaped by the music they encounter. There is a backbeat or hidden track to music that is cultural, but that is confirmed or shifted with each performance. I suspect genre works in the same way — as a set of constantly shaped expectations — and that in some cases tempo has been keyed to certain arbitrary or regular standards in order to create particular effects. Serialization might be one version of this (something my colleague Susan Bernstein and I are working on), or the partitioning of plot around commercials.

October 5, 2009
Keeping the Game in Your Head: David Ortiz

I’m not a huge baseball fan, but I did grow up in the suburbs of Boston and so like the Red Sox. Over the weekend I saw a story in the Times about David Ortiz, who went from being a fabulous home run hitter to someone who couldn’t really connect with the ball and so lost his place at the top of the Red Sox batting order. Baseball is now loaded with information, as anyone who has followed the career of Nate Silver will be aware of. (Silver established his reputation as a baseball statistician but then went on to predict congressional and presidential elections at fivethirtyeight.com.) Apparently Ortiz was drawn into the game of studying his own performance “by the numbers,” and eventually it got to his game. Only when he decided to play for the “fun” of it did his hitting power return. As a story about a player’s encounter with statistics, this one has four parts: talented hitter does well; talented hitter attempts to improve performance with statistics (reported in the Times here); talented hitter suffers from overthinking his game; talented hitter learns to play the game again by forgetting about the numbers.

Perhaps this story is useful for thinking about the nature of statistically assisted reading. I’m not saying that using statistics to explore textual patterns drains the joy out of reading: it doesn’t, because the statistical re-description of texts is not reading in the sense that you or I would practice it. But I have had interesting experiences reading texts after I have learned something about the underlying linguistic patterns that they express. For example, when I learned that Shakespeare’s late plays contain a linguistic structure in the form of “, which” [comma, which] that distinguished them from all other Shakespeare plays, I really started to pay attention to these in my reading. I wouldn’t say that this detracted from my ability to read the text; rather it drew my attention to something else that was going on. But I also noticed that it was nearly impossible to pay attention to the linguistic patterns and to experience the meaning of that pattern at the same time. That is, I could either notice linguistic features of a play (presence of pronouns, concrete nouns, verbs in past tense, etc.) and ask why they were being used in a particular scene, or I could float along with the spoken line, feeling different ideas or emotions eddy and build as the speaker developed an image or theme. But I couldn’t do both.

Why should there be this “Ortiz effect” in reading? Is there some kind of fundamental scarcity of attention that forbids one’s reading as a (statistically assisted) linguist and as “any reader whatever” at the same time? I’m interested in this division, but skeptical of the idea — advanced in the article about Ortiz’ return to greatness — that you can forget what you know and “just do it.” The Times article says that Ortiz became a better hitter when he learned simply to “play…as if he were a boy.” But reading is never this simple: you can’t completely forget what you know, even if you learned it through the apparently foreign procedures of statistical analysis. Perhaps you can read “as if” you didn’t know it, and then re-engage that knowledge to examine how the linguistic patterns produce the effects you’ve just experienced? My point here is that readers who are assisted by statistics must simultaneously be both versions of Ortiz described in the different articles: both the hitter and the thinker. It would be a mistake to think that “natural” reading is accomplished in a state of child-like absorption in the game, since even children are brimming with strategies and inferences. I am glad to know certain things about Shakespeare that I couldn’t have known without the assistance of statistics — like the fact that the Histories are full of concrete description and a lack of first and second person pronouns. This doesn’t interfere with my game (I hope), but shows me that the game can be played on another, as yet unknown, verbal plane.

September 29, 2009
Four-Syllable Rock n’ Roll

Certain things can be counted without a parsing device, for example four-syllable words in rock n’ roll songs. I have often wondered why there are so many one syllable words in rock songs, and have a pet theory for this. Rock lyrics favor Anglo-Saxon words rather than Latinate words — the former have a more direct, less fussy sound — and since the Latinate words tend to be multi-syllabic compounds, multi-syllabic words (say, more than three syllables) tend to be very rare in rock music. Why exactly the monosyllable is appropriate to rock is something I cannot explain, although it may be related to another pattern I have observed: countries that underwent the Protestant Reformation seem to be the most adept at producing (not necessarily consuming) rock music, particularly heavy metal. Perhaps there is a connection here between Northern European linguistic practices (and the persistence of Anglo-Saxon forms) and the predisposition to religious violence in the sixteenth and seventeenth centuries, one that prepares these countries for immersion in a subsequent musical form like rock n’ roll.

In any event, I’d like to know what the longest Latinate word is that has been successfully used in a rock song. My candidate (based on popularity, not length) would be “satisfaction,” as in, “I can’t get no satisfaction.”

September 21, 2009
Texts as Objects II: Object Oriented Philosophy. And Criticism?

In the previous post I laid out several questions about the nature of texts, objects and interpretation that arise when we subject texts — for example, the Folio plays of Shakespeare — to statistical analysis. Above is a sketch of two texts, T1 and T2 (forgive the hand-drawn visuals), that exist as documents we might read. This is our point of contact as scholars, and we know where to take it from here. But for machine analysis, these texts are transformed into objects — relational, formalized mathematical entities — which means that they are containers of containers of things. So let’s think this way about texts for a moment.

T1 and T2 are both texts of 1000 words in length. We can think of these texts as a set of tokens drawn from a larger set of tokens that represents the totality of English words at a given moment. (Such a totality is an abstraction, just as Saussure’s parole was an abstraction; let’s leave that aside for now.) Now an mathematically-minded critic might say the following: Table 1 is a topologically flat representation of all possible words in English, arrayed in a two-dimensional matrix. The text T1 is a vector through that table, a needle that carries the “thread” through various squares on the surface, like someone embroidering a quilt. One possible way of describing the text, then, would be to chart its movement through this space, like a series of stitches.

Generalizations about the syntax and meaning of that continuously threading line would be generalizations about two things: the sequence of stitches and the significance of different regions in the underlying quilt matrix. I have arranged the words alphabetically in this table, which means that a “stitch history” of movements around the table would not be very revealing. But the table could be rendered in many other ways (it could be rendered three- or multi-dimensionally, for example). What if I put all of the verbs in the lower left-hand corner (southwest) of the table and all of the pronouns in the upper right (northeast). Based on this act of spatial classification, you could then come up with statements like: “I see many threads passing between the northeast and southwest,” a meaningless descriptive statement unless you add: “this is because verbs are here and pronouns are there, and they tend to follow one another in written and spoken English.” So this spatializing approach to textual analysis would require three things: (1) arrangement of the matrix in a meaningful way; (2) description of the movement through the matrix; and (3) analysis of patterns in that movement. Based on (1) you might have something interesting to say about (3), and as the note says, a text is a “vector through a hypothetical Table” and “a theory of rhetoric, grammar, semantics is an attempt to rationalize this vector — as sequence — by regrouping the words in the table by region.” In effect, any mathematical or container-based analysis of a text must ultimately be some kind of mapping of a vector-space (semantic, ideological, grammatical, generic, etc).

Now, Docuscope is itself a built form of this type of container-based analysis, one that eliminates the temporal dimension of “stitching” described above by transforming the hypothetical table into buckets or classes of words and then decanting the text into those buckets. Instead of regional movement, we get inclusion or exclusion of words (strings) from classes of words. The architecture of the classes matters, of course, since only if that architecture is good will we find patterns that we recognize and understand, understanding being the ultimate goal here. (It is also possible to simply look for correlated patterns among documents that might allow someone to find an entire class of objects based on a few tokens they already know (a very small “class”), as Google does; but finding is not criticism.) So what is a text in the eyes of Docuscope, or for than matter, any device that tags documents? One answer is that the text “is” the items circled above M1 and M2: words or sequences of words that have been classed into buckets. At the level of M1 and M2, the text becomes a set of local subsets, each of which contains a number of tokens. Statistical analysis of this partitioned object yields quantitative relations — R1, R2 and R3 — which differentiate one text from another.

Now for the philosophical question, the one where object oriented philosophy might be useful: when asked to describe the nature of the statistical entity undergoing analysis here (the data object rendered by Docuscope and then explored within R), do we say that it is simply the local contents (M1, M2) of the containers (T1 and T2)? If I begin by saying that the being of this object is, rather, the structure of these elements in their containers — a better answer, I think — then I probably mean that T1 and T2 are really the sum of all relations that can be posited (R1, R2, R3) among rendered elements (M1, M2). This rather Leibnizian sounding answer suggests that a text’s existence is ultimately differential: it is the sum of that object’s relations with all other objects. The statistical analysis of texts would be the quantitative description of this totality of relations given a set of classes — classes that we, as humanists, want to debate because they may be the source of any meaning in the result (because a certain kind of meaning or “purpose in pattern” is distributed into the classes).

But here is where I think Harman adds something crucial. If the argument he has been developing in Tool Being, Prince of Networks and elsewhere is correct, then an object of this or any other kind would not be the sum of its relations with other objects, as is the case in Latour’s analysis. To this relational model, Harman opposes the metaphysical integrity of the object over and beyond its relations, an integrity which holds that object together in its “domestic” being over and above its relational “alliances.” In Prince of Networks, he writes:

I hold that there is an absolute distinction between the domestic relations a thing needs to some extent in order to exist [see above, M1, M2] and the external alliances that it does not need [above, R1, R2, R3]. But the actor itself [i.e., object of analysis] cannot be identified with either. An object cannot be exhausted by a set of alliances. But neither is it exhausted by a summary of its pieces, since any genuine object will be an emergent reality over and above its components, oversimplifying those components and able to withstand a certain degree of turbulent change in them. (135)

What I find fascinating and important about Harman’s idea here is that he is providing a rationale for (1) accommodating the kind of container analysis I have outlined above while (2) arguing that this type of analysis is not the end of the story. Now, Harman and the Speculative Realists have been reluctant to discuss what constitutes a text and how language might itself be an object, a reluctance that stems — understandably, I think — from fatigue with the post-Heidegerrian “language is everything” trend in Continental philosophy and cultural studies. But language is definitely something, and it is as real as anything else I can think of. So too are our encounters (in the theater, the library, the cinema) with things like genre, style, ideology and pleasure.

Object oriented philosophy should have something to say about texts, since they too provide a particularly good example of why the purely relational criterion for an object’s identity (whether it is a text, a word, a thought, feeling, or piece of wood) is insufficient. As literary critics and theorists, we may have something to add to Harman’s account of the inexhaustibility of an object’s relations and its emergent reality over and above its components. In fact, this is what many of us have been arguing is wrong about the kinds of reductive claims that can be made about texts on the grounds that they yield statistical regularities.

What does it mean for the reality of an object to “simplify” its “components”? Perhaps the process that Harman refers to as simplification is what we as literary critics refer to as interpretation: the contingent coming into being of a portion of an object’s reality — here, a text — through that object’s interrelation with other objects and the subtractive unveiling of its inexhaustible contents. (Whitehead describes this as the process of “objectification.”) Harman would argue that such emergent realities don’t just take hold between texts and readers, but between sunlight and plant leaves or fire and cotton. All objects can be oversimplified, all of them can survive (and resist) some degree of turbulent change.

If objects are really this universal, then the process of “pattern recognition” that I describe as object oriented criticism is really something more involved than the collating of sets and relations among sets. Clearly, if a text is understood as a container of relations, then statistics can model the complexity of that object and its relations — even the immense complexity of a textual object. But that model, like the map of relations above, will always be just an approximation. As Harman insists, the inner reality of the object — itself alluring with the promise of something more — is never fully available, whether that object is a piece of wood or a piece of writing. As literary critics, I think we can find plenty to work with when objects are defined in this way.

September 17, 2009