Month: September 2009

  • Keeping the Game in Your Head: David Ortiz

    I’m not a huge baseball fan, but I did grow up in the suburbs of Boston and so like the Red Sox. Over the weekend I saw a story in the Times about David Ortiz, who went from being a fabulous home run hitter to someone who couldn’t really connect with the ball and so lost his place at the top of the Red Sox batting order. Baseball is now loaded with information, as anyone who has followed the career of Nate Silver will be aware of. (Silver established his reputation as a baseball statistician but then went on to predict congressional and presidential elections at fivethirtyeight.com.) Apparently Ortiz was drawn into the game of studying his own performance “by the numbers,” and eventually it got to his game. Only when he decided to play for the “fun” of it did his hitting power return. As a story about a player’s encounter with statistics, this one has four parts: talented hitter does well; talented hitter attempts to improve performance with statistics (reported in the Times here); talented hitter suffers from overthinking his game; talented hitter learns to play the game again by forgetting about the numbers.

    Perhaps this story is useful for thinking about the nature of statistically assisted reading. I’m not saying that using statistics to explore textual patterns drains the joy out of reading: it doesn’t, because the statistical re-description of texts is not reading in the sense that you or I would practice it. But I have had interesting experiences reading texts after I have learned something about the underlying linguistic patterns that they express. For example, when I learned that Shakespeare’s late plays contain a linguistic structure in the form of “, which” [comma, which] that distinguished them from all other Shakespeare plays, I really started to pay attention to these in my reading. I wouldn’t say that this detracted from my ability to read the text; rather it drew my attention to something else that was going on. But I also noticed that it was nearly impossible to pay attention to the linguistic patterns and to experience the meaning of that pattern at the same time. That is, I could either notice linguistic features of a play (presence of pronouns, concrete nouns, verbs in past tense, etc.) and ask why they were being used in a particular scene, or I could float along with the spoken line, feeling different ideas or emotions eddy and build as the speaker developed an image or theme. But I couldn’t do both.

    Why should there be this “Ortiz effect” in reading? Is there some kind of fundamental scarcity of attention that forbids one’s reading as a (statistically assisted) linguist and as “any reader whatever” at the same time? I’m interested in this division, but skeptical of the idea — advanced in the article about Ortiz’ return to greatness — that you can forget what you know and “just do it.” The Times article says that Ortiz became a better hitter when he learned simply to “play…as if he were a boy.” But reading is never this simple: you can’t completely forget what you know, even if you learned it through the apparently foreign procedures of statistical analysis. Perhaps you can read “as if” you didn’t know it, and then re-engage that knowledge to examine how the linguistic patterns produce the effects you’ve just experienced? My point here is that readers who are assisted by statistics must simultaneously be both versions of Ortiz described in the different articles: both the hitter and the thinker. It would be a mistake to think that “natural” reading is accomplished in a state of child-like absorption in the game, since even children are brimming with strategies and inferences. I am glad to know certain things about Shakespeare that I couldn’t have known without the assistance of statistics — like the fact that the Histories are full of concrete description and a lack of first and second person pronouns. This doesn’t interfere with my game (I hope), but shows me that the game can be played on another, as yet unknown, verbal plane.

  • Four-Syllable Rock n’ Roll

    Certain things can be counted without a parsing device, for example four-syllable words in rock n’ roll songs. I have often wondered why there are so many one syllable words in rock songs, and have a pet theory for this. Rock lyrics favor Anglo-Saxon words rather than Latinate words — the former have a more direct, less fussy sound — and since the Latinate words tend to be multi-syllabic compounds, multi-syllabic words (say, more than three syllables) tend to be very rare in rock music. Why exactly the monosyllable is appropriate to rock is something I cannot explain, although it may be related to another pattern I have observed: countries that underwent the Protestant Reformation seem to be the most adept at producing (not necessarily consuming) rock music, particularly heavy metal. Perhaps there is a connection here between Northern European linguistic practices (and the persistence of Anglo-Saxon forms) and the predisposition to religious violence in the sixteenth and seventeenth centuries, one that prepares these countries for immersion in a subsequent musical form like rock n’ roll.

    In any event, I’d like to know what the longest Latinate word is that has been successfully used in a rock song. My candidate (based on popularity, not length) would be “satisfaction,” as in, “I can’t get no satisfaction.”

  • Texts as Objects II: Object Oriented Philosophy. And Criticism?

    A text as it is represented statistically, in object form
    In the previous post I laid out several questions about the nature of texts, objects and interpretation that arise when we subject texts — for example, the Folio plays of  Shakespeare — to statistical analysis. Above is a sketch of two texts, T1 and T2 (forgive the hand-drawn visuals), that exist as documents we might read. This is our point of contact as scholars, and we know where to take it from here. But for machine analysis, these texts are transformed into objects — relational, formalized mathematical entities — which means that they are containers of containers of things. So let’s think this way about texts for a moment.
    T1 and T2 are both texts of 1000 words in length. We can think of these texts as a set of tokens drawn from a larger set of tokens that represents the totality of English words at a given moment. (Such a totality is an abstraction, just as Saussure’s parole was an abstraction; let’s leave that aside for now.) Now an mathematically-minded critic might say the following: Table 1 is a topologically flat representation of all possible words in English, arrayed in a two-dimensional matrix. The text T1 is a vector through that table, a needle that carries the “thread” through various squares on the surface, like someone embroidering a quilt. One possible way of describing the text, then, would be to chart its movement through this space, like a series of stitches.
    Generalizations about the syntax and meaning of that continuously threading line would be generalizations about two things: the sequence of stitches and the significance of different regions in the underlying quilt matrix. I have arranged the words alphabetically in this table, which means that a “stitch history” of movements around the table would not be very revealing. But the table could be rendered in many other ways (it could be rendered three- or multi-dimensionally, for example). What if I put all of the verbs in the lower left-hand corner (southwest) of the table and all of the pronouns in the upper right (northeast). Based on this act of spatial classification, you could then come up with statements like: “I see many threads passing between the northeast and southwest,” a meaningless descriptive statement unless you add: “this is because verbs are here and pronouns are there, and they tend to follow one another in written and spoken English.” So this spatializing approach to textual analysis would require three things: (1) arrangement of the matrix in a meaningful way; (2) description of the movement through the matrix; and (3) analysis of patterns in that movement. Based on (1) you might have something interesting to say about (3), and as the note says, a text is a “vector through a hypothetical Table” and “a theory of rhetoric, grammar, semantics is an attempt to rationalize this vector — as sequence — by regrouping the words in the table by region.” In effect, any mathematical or container-based analysis of a text must ultimately be some kind of mapping of a vector-space (semantic, ideological, grammatical, generic, etc).
    Now, Docuscope is itself a built form of this type of container-based analysis, one that eliminates the temporal dimension of “stitching” described above by transforming the hypothetical table into buckets or classes of words and then decanting the text into those buckets. Instead of regional movement, we get inclusion or exclusion of words (strings) from classes of words. The architecture of the classes matters, of course, since only if that architecture is good will we find patterns that we recognize and understand, understanding being the ultimate goal here. (It is also possible to simply look for correlated patterns among documents that might allow someone to find an entire class of objects based on a few tokens they already know (a very small “class”), as Google does; but finding is not criticism.) So what is a text in the eyes of Docuscope, or for than matter, any device that tags documents? One answer is that the text “is” the items circled above M1 and M2: words or sequences of words that have been classed into buckets. At the level of M1 and M2, the text becomes a set of local subsets, each of which contains a number of tokens. Statistical analysis of this partitioned object yields quantitative relations — R1, R2 and R3 — which differentiate one text from another.
    Now for the philosophical question, the one where object oriented philosophy might be useful: when asked to describe the nature of the statistical entity undergoing analysis here (the data object rendered by Docuscope and then explored within R), do we say that it is simply the local contents (M1, M2) of the containers (T1 and T2)? If I begin by saying that the being of this object is, rather, the structure of these elements in their containers — a better answer, I think — then I probably mean that T1 and T2 are really the sum of all relations that can be posited (R1, R2, R3) among rendered elements (M1, M2). This rather Leibnizian sounding answer suggests that a text’s existence is ultimately differential: it is the sum of that object’s relations with all other objects. The statistical analysis of texts would be the quantitative description of this totality of relations given a set of classes — classes that we, as humanists, want to debate because they may be the source of any meaning in the result (because a certain kind of meaning or “purpose in pattern” is distributed  into the classes).
    But here is where I think Harman adds something crucial. If the argument he has been developing in Tool Being, Prince of Networks and elsewhere is correct, then an object of this or any other kind would not be the sum of its relations with other objects, as is the case in Latour’s analysis. To this relational model, Harman opposes the metaphysical integrity of the object over and beyond its relations, an integrity which holds that object together in its “domestic” being over and above its relational “alliances.” In Prince of Networks, he writes:
    I hold that there is an absolute distinction between the domestic relations a thing needs to some extent in order to exist [see above, M1, M2] and the external alliances that it does not need [above, R1, R2, R3]. But the actor itself [i.e., object of analysis] cannot be identified with either. An object cannot be exhausted by a set of alliances. But neither is it exhausted by a summary of its pieces, since any genuine object will be an emergent reality over and above its components, oversimplifying those components and able to withstand a certain degree of turbulent change in them. (135)
    What I find fascinating and important about Harman’s idea here is that he is providing a rationale for (1) accommodating the kind of container analysis I have outlined above while (2) arguing that this type of analysis is not the end of the story. Now, Harman and the Speculative Realists have been reluctant to discuss what constitutes a text and how language might itself be an object, a reluctance that stems — understandably, I think — from fatigue with the post-Heidegerrian “language is everything” trend in Continental philosophy and cultural studies. But language is definitely something, and it is as real as anything else I can think of. So too are our encounters (in the theater, the library, the cinema) with things like genre, style, ideology and pleasure.
    Object oriented philosophy should have something to say about texts, since they too provide a particularly good example of why the purely relational criterion for an object’s identity (whether it is a text, a word, a thought, feeling, or piece of wood) is insufficient. As literary critics and theorists, we may have something to add to Harman’s account of the inexhaustibility of an object’s relations and its emergent reality over and above its components. In fact, this is what many of us have been arguing is wrong about the kinds of reductive claims that can be made about texts on the grounds that they yield statistical regularities.
    What does it mean for the reality of an object to “simplify” its “components”? Perhaps the process that Harman refers to as simplification is what we as literary critics refer to as interpretation: the contingent coming into being of a portion of an object’s reality — here, a text — through that object’s interrelation with other objects and the subtractive unveiling of its inexhaustible contents. (Whitehead describes this as the process of “objectification.”) Harman would argue that such emergent realities don’t just take hold between texts and readers, but between sunlight and plant leaves or fire and cotton. All objects can be oversimplified, all of them can survive (and resist) some degree of turbulent change.
    If objects are really this universal, then the process of “pattern recognition” that I describe as object oriented criticism is really something more involved than the collating of sets and relations among sets. Clearly, if a text is understood as a container of relations, then statistics can model the complexity of that object and its relations — even the immense complexity of a textual object. But that model, like the map of relations above, will always be just an approximation. As Harman insists, the inner reality of the object — itself alluring with the promise of something more — is never fully available, whether that object is a piece of wood or a piece of writing. As literary critics, I think we can find plenty to work with when objects are defined in this way.
  • Texts as Objects I: Object Oriented Philosophy. And Criticism?

    In the work I have been doing on Shakespeare with my colleague Jonathan Hope (see previous posts under Shakespeare category), we have approached the plays as two kinds of objects simultaneously: as historical documents of theater history and as objects of statistical analysis. We have emphasized their theatrical foundations because we believe this is the reality of what is being studied: real people on stage saying these words (or something like them) in a real situation. The forces at work in this situation shaped the final result, and the meaning of what we find there — when we find it — is most significant as a reflection of that time and place. This makes us historicists, and in my case there is also a certain sympathy for materialist rather than idealist approaches to literature (although these terms are not very nuanced).

    But what does it mean to say that a text is an object of statistical analysis, and how might this “object status” be related to our broader account of what texts are in general? Is there anything to be learned from thinking in this way about texts and interpretation that might alter the basic conceptual distinctions we use to think about texts, culture, experience, and language? This post represents a first attempt at answering some of these questions.

    We need to start with a frame of analysis, and for this, I’ll use recent debates in philosophy and sociology about networks, actors and objects. Some of you may be familiar with the Actor Network Theory of Bruno Latour, which provides what you might call a flat ontology of actors in the world, one that makes no distinction in kind between natural, human made (technological), animate and inanimate “actors” in any given domain of analysis. Graham Harman, who is one of the leaders of a group of philosophers now known as the Speculative Realist school, has provided a fascinating summary and critique of Latour’s work, one that I was present for in a recent symposium on Latour held last year at the London School of Economics. During this event, I asked Harman and Latour if this kind of flat ontology limited the kinds of things one can claim in any causal explanation of a given scene of change or transformation (a revolution in a government, a reconfiguration of a bureaucracy, a change of state in a gas, a change in emotions). The problem — which Harman expertly delineates in his recent book, Bruno Latour: Prince of Networks is that if no metaphysical priority is given to any particular type of actor; and if, further, all actors exhaust all of their potential at every moment because they possess no metaphysically privileged “special stuff” that will carry their powers through to the exclusion of other powers; then it becomes impossible to account for change. If you accept these consequences, then what we call “explanation” in any kind of critical work becomes interchangeable with description, and the activity of analysis becomes — as I argued at the LSE symposium — the “serial redescription” of each new state of the world. Harman agreed that this was unsatisfactory. Latour, to my surprise, said that this was exactly what he is trying to do in his sociological work. (A book about the symposium will be published next year.)

    Now, in literary criticism, we do not think of our work as being that of “description.” And yet, we are not really analyzing causal patterns either, at least not in the way that an epidemiologist would be when she links the presence of a given microbe to the development of a particular illness in a population. Somewhere in the middle of this continuum, between description on the one hand and causal explanation on the either, lies meaning — which is what my colleagues and I in the humanities are probably most interested in. There are lots of ways to think about meaning, but perhaps one way we can do so is to think of it as “purpose in pattern,” something more akin to Aristotle’s final cause than the efficient cause that brings things about causally. (I realize that there are problems with Aristotle, but I believe the distinction is useful for the present discussion.) One of the hallmarks of European modernity, arguably, is the tendency to believe that discussions of final causes, purposes (and later, meaning) ought to be kept separate from discussions of how things work (efficient causation). For the most part, I think that has been a good idea, although it has aided and abetted the creation of the “two cultures” of science and the humanities. Stephen Jay Gould’s notion of two non-overlapping magisteria with different protocols of explanation seems like a fine truce to me. But where do humanists (i.e., members of the humanities disciplines) fit in? In literary studies, we are very much interested in patterns, and the history of literary criticism is — among other things — the history of pattern recognition among readers and users of language.

    Literary genre is a pattern that human readers since Aristotle have discerned in drama, poetry and prose. This pattern is also picked out by unsupervised statistical analysis, both on the basis of the frequency of individual words (see Jockers et al.) and on the basis of groupings of words that have been tagged by a device like Docuscope. So where does that pattern exist? In the text or performance itself? In the mind that recognizes it? What is it made of? A set of relationships? A series of comparisons undertaken by the creators of texts and their interpreters? Do we learn anything new about genre when we say that it can be given multiple descriptions — either a plot formula (an amusing story ending in marriage) or a multivariate, statistical recipe (a story containing lots of I, me, my, you but very little concretely descriptive language)? Let’s take seriously the idea that genre is a formal or mathematical object, and see where it leads us.

  • More Shakespeare Outliers

    PCA Scatterplot in R of the First Folio Plays

    I’ve expanded the labels here on our PCA scatterplot in order to see a few more items. Several things worth thinking about here:

    • Late Plays are clustering in neither the Comedy nor the History quadrants explored in the other posts. The three that we see here — Winter’s Tale, Cymbeline, and Henry VIII — thus lack the dialogic interactivity we saw in comedy and the profusion of concrete nouns and description in history. This is an interesting way of thinking of the Late Plays: as lacking something that is a defining presence in the two most linguistically “obvious” genres of Shakespeare’s writing (comedy and history). We might think of genres that show up as diagonally opposed in PCA as “linguistic primes” in that they seem to be composed of nothing simpler than themselves. Those that are caught in the remaining corners (themselves lacking any opposite partner) would then be called “secondary,” since they cohere indirectly on a set of differences that are more comprehensively ordering a different part of the field. Note too that Romeo and Juliet is virtually identical with The Tempest, our last Late Play, in this plot. Both plays break the most obvious “rule” that Shakespeare seems to honor in his writing of plays — that of choosing between either First Person + Interaction strings or Description strings, but not both– and they break this “rule” in exactly the same way. Instead of choosing one of these two linguistic “forks in the road,” Romeo and Juliet and The Tempest take both at the same time, combining lots of the dialogical element we saw in Twelfth Night with the profusion of concrete descriptions (nouns, adjectives) that characterized Richard II.

    • In almost every visualization I have used of these data — Factor Analysis with various rotations, PCA — I find that A Midsummer Night’s Dream is unusual in terms of comedies. Sometimes it is grouped with the histories because it contains so much description in the passages dealing with the fairy landscape. Linguistically, this feature sets A Midsummer Night’s Dream apart from other comedies. For an illustration of what is unusual about MSND, which scores unusually high on the history component (Description) but also scores reasonably high on the comedy one (First Person/Interaction), click here.  I also find that Henry VIII is often placed away from the pack, which in this case due to its relative lack of all three types of string types tracked in this exercise — Description somewhat, but very obviously First Person and Interaction.  (For a sample passage where few of these are present, click here.) There are many reasons why this play might be distinctive — it is co-written with Fletcher, it is written at the very end of his career — but the only way to really know is to look at individual passages like the one I’ve posted and see what’s going on. Seeing what an absence of something is making possible, of course, is often more difficult than seeing what the presence of something makes possible.

    • Two very unusual Comedies are showing up in the lower left-hand quadrant, where three of the four Late Plays are located. This makes a certain kind of sense, as Measure for Measure and All’s Well That Ends Well are regularly described by critics as “problem comedies.” From a critical standpoint, this means that they lack the bouyant tone of plays like Much Ado or As You Like It or that they veer into emotions or problems that cannot really be solved by a few marriages at the end of the play (e.g., Angelo’s redemption or Bertram’s romantic rehabilitation). Of course, from a statistical-linguistic standpoint, the description of what makes these plays “unusual” would be different: they lack the First Person and Interaction strings of the high comedies while simultaneously lacking the Description strings that characterize histories. This description could be more nuanced — there are more subtle ways of characterizing these patterns if we break the plays down into smaller parts (and so can use more refined categories) — but we will do this later.

    • Tragedies are evenly spread out over the plot. This is in and of itself a significant finding; it does not mean that tragedies don’t have distinguishing traits, but that those traits aren’t tracked by the most obvious forms of coordinated variation that we can track in this corpus using Docuscope. I suspect that Matt Jockers’ most-frequent-word analysis would produce a similar result, as he and I have been finding very similar patterns in primary and secondary genre divisions using our different means. In fact, a combination of two other components (PC3 and PC5) does corner the tragedies in their own quadrant, and this will be the subject of a future post.

    So what are the rest of these dots? Below is an R biplot which shows the items plotted in the PCA scatterplot above, but instead of distinguishing them by color, it lists them by item number. (The numbers correspond to play titles, which I have also posted on the left hand side of the image; please click on the image below to open in another screen, then click again to resize to your window.) The biplot is helpful because, in addition to plotting the plays in PCA space, it shows the component loadings, which means that it illustrates the relationship between the variables counted as they vary across this corpus. The magnitude of trackable variation in individual variables (First Person, Interaction, etc.) is represented by a line in space — a vector — and its variation with respect to other vectors (other variables) is registered geometrically by the variable names (X. [Variable Name]) when they are suitably arranged around the origin. I have numbered the plays in order of composition, using the dating scheme provided by the Oxford editors. It makes for an interesting connect the dots, which represents Shakespeare’s stylistic progress throughout his career. (Note: he leaps.)

    Variables that extend opposite one another at an angle of 180 degrees are inversely correlated, while those that line up on top of one another vary with one another. Vectors that sit at right angles to one another have an interesting feature: because they are orthogonal, their variance is unrelated. So from the biplot below, we can see quite quickly that First Person and Interactivity strings tend to be found together in individual items (plays), whereas Description strings (which vary inversely with the amount of Topical Flow strings) tend to be present or absent in ways that have nothing to do with the presence or absence of the First Person and Interactivity. Another way of expressing this orthogonal relationship: behaviors among First Person and Interaction strings are (for whatever reason) indifferent to those of Description and Topical Flow strings, and vice versa. This doesn’t mean they aren’t connected on some other component (we are only looking at the first two here), but when we are thinking about the most statistically powerful description of variance in the corpus (which is captured in early principal components), this is how all of the quantities of counted things relate.

    Click on chart to enlarge; click again in new screen to resize.
    Click on chart to enlarge; click again in new screen to resize.

    A parting thought: what two plays are the most opposite in terms of style, based on what Docuscope sees and PCA can find in terms of variation patterns? Two obvious candidates would be Henry V and A Comedy of Errors, number 19 at the bottom number 8 at the top; and A Midsummer Night’s Dream and Measure for Measure, numbers 12 and 25 on the left and right.  If you’ve been following the discussion and this diagram makes sense to you — or if you’ve just read both pairs of plays — you know why they are so different.