Tag: Docuscope

  • Auerbach Was Right: A Computational Study of the Odyssey and the Gospels

    Rembrandt, The Denial of St. Peter (1660), Rijksmuseum
    Rembrandt, The Denial of St. Peter (1660), Rijksmuseum

    In the “Fortunata” chapter of his landmark study, Mimesis: The Representation of Reality, Eric Auerbach contrasts two representations of reality, one found in the New Testament Gospels, the other in texts by Homer and a few other classical writers. As with much of Auerbach’s writing, the sweep of his generalizations is broad. Long excerpts are chosen from representative texts. Contrasts and arguments are made as these excerpts are glossed and related to a broader field of texts. Often Auerbach only gestures toward the larger pattern: readers of Mimesis must then generate their own (hopefully congruent) understanding of what the example represents.

    So many have praised Auerbach’s powers of observation and close reading. At the very least, his status as a “domain expert” makes his judgments worth paying attention to in a computational context. In this post, I want to see how a machine would parse the difference between the two types of texts Auerbach analyzes, stacking the iterative model against the perceptions of a master critic. This is a variation on the experiments I have performed with Jonathan Hope, where we take a critical judgment (i.e., someone’s division of Shakespeare’s corpus of plays into genres) and then attempt to reconstruct, at the level of linguistic features, the perception which underlies that judgment. We ask, Can we describe what this person is seeing or reacting to in another way?

    Now, Auerbach never fully states what makes his texts different from one another, which makes this task harder. Readers must infer both the larger field of texts that exemplify the difference Auerbach alludes to, and the difference itself as adumbrated by that larger field. Sharon Marcus is writing an important piece on this allusive play between scales — between reference to an extended excerpt and reference to a much larger literary field. Because so much goes unstated in this game of stand-ins and implied contrasts, the prospect of re-describing Auerbach’s difference in other terms seems particularly daunting. The added difficulty makes for a more interesting experiment.

    Getting at Auerbach’s Distinction by Counting Linguistic Features

    I want to offer a few caveats before outlining what we can learn from a computational comparison of the kinds of works Auerbach refers to in his study. For any of what follows to be relevant or interesting, you must take for granted that the individual books of the Odyssey and the New Testament Gospels (as they exist in translation from Project Gutenberg) represent adequately the texts Auerbach was thinking about in the “Fortunata” chapter. You must grant, too, that the linguistic features identified by Docuscope are useful in elucidating some kind of underlying judgments, even when it is used on texts in translation. (More on the latter and very important point below.) You must further accept that Docuscope, here version 3.91, has all the flaws of a humanly curated tag set. (Docuscope annotates all texts tirelessly and consistently according to procedures defined by its creators.) Finally, you must already agree that Auerbach is a perceptive reader, a point I will discuss at greater length below.

    I begin with a number of excerpts that I hope will give a feel for the contrast in question, if it is a single contrast. This is Auerbach writing in the English translation of Mimesis:

    [on Petronius] As in Homer, a clear and equal light floods the persons and things with which he deals; like Homer, he has leisure enough to make his presentation explicit; what he says can have but one meaning, nothing is left mysteriously in the background, everything is expressed. (26-27)

    [on the Acts of the Apostles and Paul’s Epistles] It goes without saying that the stylistic convention of antiquity fails here, for the reaction of the casually involved person can only be presented with the highest seriousness. The random fisherman or publican or rich youth, the random Samaritan or adulteress, come from their random everyday circumstances to be immediately confronted with the personality of Jesus; and the reaction of an individual in such a moment is necessarily a matter of profound seriousness, and very often tragic.” (44)

    [on Gospel of Mark] Generally speaking, direct discourse is restricted in the antique historians to great continuous speeches…But here—in the scene of Peter’s denial—the dramatic tension of the moment when the actors stand face to face has been given a salience and immediacy compared with which the dialogue of antique tragedy appears highly stylized….I hope that this symptom, the use of direct discourse in living dialogue, suffices to characterize, for our purposes, the relation of the writings of the New Testament to classical rhetoric…” (46)

    [on Tacitus] That he does not fall into the dry and unvisualized, is due not only to his genius but to the incomparably successful cultivation of the visual, of the sensory, throughout antiquity. (46)

    [on the story of Peter’s denial] Here we have neither survey and rational disposition, nor artistic purpose. The visual and sensory as it appears here is no conscious imitation and hence is rarely completely realized. It appears because it is attached to the events which are to be related… (47, emphasis mine)

    There is a lot to work with here, and the difference Auerbach is after is probably always going to be a matter of interpretation. The simple contrast seems to be that between the “equal light” that “floods persons and things” in Homer and the “living dialogue” of the Gospels. The classical presentation of reality is almost sculptural in the sense that every aspect of that reality is touched by the artistic designs of the writer. One chisel carves every surface. The rendering of reality in the Gospels, on the other hand, is partial and (changing metaphors here) shadowed. People of all kinds speak, encounter one another in “their random everyday circumstances,” and the immediacy of that encounter is what lends vividness to the story. The visual and sensory “appear…because [they  are] attached to the events which are to be related.” Overt artistry is no longer required to dispose all the details in a single, frieze-like scene. Whatever is vivid becomes so, seemingly, as a consequence of what is said and done, and only as a consequence.

    These are powerful perceptions: they strike many literary critics as accurately capturing something of the difference between the two kinds of writing. It is difficult to say whether our own recognition of these contrasts, speaking now as readers of Auerbach, is the result of any one example or formulation that he offers. It may be the case, as Sharon Marcus is arguing, that Auerbach’s method works by “scaling” between the finely wrought example (in long passages excerpted from the texts he reads) and the broad generalizations that are drawn from them. The fact that I had to quote so many passages from Auerbach suggests that the sources of his own perceptions are difficult to discern.

    Can we now describe those sources by counting linguistic features in the texts Auerbach wants to contrast? What would a quantitative re-description of Auerbach’s claims look like? I attempted to answer these questions by tagging and then analyzing the Project Gutenberg texts of the Odyssey and the Gospels. I used the latest version of Docuscope that is currently being used by the Visualizing English Print team, a program that scans a corpus of texts and then tallies linguistic features according to a hand curated sets of words and phrases called “Language Action Types” (hereafter, “features”). Thanks to the Visualizing English Print project, I can share the raw materials of the analysis. Here you can download the full text of everything being compared. Each text can be viewed interactively according to the features (coded by color) that have been counted. When you open any of these files in a web browser, select a feature to explore by pressing on the feature names to the left. (This “lights up” the text with that feature’s color).

    I encourage you to examine these texts as tagged by Docuscope for yourself. Like me, you will find many individual tagging decisions you disagree with. Because Docuscope assigns every word or phrase to one and only one feature (including the feature, “untagged”), it is doomed to imprecision and can be systematically off base. After some checking, however, I find that the things Docuscope counts happen often and consistently enough that the results are worth thinking about. (Hope and I found this to be the case in our Shakespeare Quarterly article on Shakespeare’s genres.) I always try to examine as many examples of a feature in context as I can before deciding that the feature is worth including in the analysis. Were I to develop this blog post into an article, I would spend considerably more time doing this. But the features included in the analysis here strike me as generally stable, and I have examined enough examples to feel that the errors are worth ignoring.

    Findings

    We can say with statistical confidence (p=<.001) that several of the features identified in this analysis are likely to occur in only one of the two types of writing. These and only these features are the ones I will discuss, starting with an example passage taken from the Odyssey. Names of highlighted features appear on the left hand side of the screen shot below, while words or phrases assigned to those features are highlighted in the text to the right. Again, items highlighted in the following examples appear significantly more often in the Odyssey than in the New Testament Gospels:

    Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)
    Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

    Book I is bustling with description of the sensuous world. Words in pink describe concrete objects (“wine,” “the house”, “loom”) while those in green describe things involving motion (verbs indicating an activity or change of state). Below are two further examples of such features:

    Screen Shot 2016-01-05 at 8.33.24 AM

    Screen Shot 2016-01-05 at 8.30.03 AM

    Notice also the purple features above, which identify words involved in mediating spatial relationships. (I would quibble with “hearing” and “silence” as being spatial, per the long passage above, but in general I think this feature set is sound.) Finally, in yellow, we find a rather simple thing to tag: quotation marks at the beginning and end of a paragraph, indicating a long quotation.

    Continuing on to a shorter set of examples, orange features in the passages below and above identify the sensible qualities of a thing described, while blue elements indicate words that extend narrative description (“. When she” “, and who”) or words that indicate durative intervals of time (“all night”). Again, these are words and phrases that are more prevalent in the Homeric text:

    Screen Shot 2016-01-05 at 8.42.56 AM

    Screen Shot 2016-01-08 at 8.32.49 AM

    Screen Shot 2016-01-08 at 8.37.28 AM

    The items in cyan, particularly “But” and “, but”  are interesting, since both continue a description by way of contrast. This translation of the Odyssey is full of such contrastive words, for example, “though”, “yet,” “however”, “others”, many of which are mediated by Greek particles in the original.

    When quantitative analysis draws our attention to these features, we see that Auerbach’s distinction can indeed be tracked at this more granular level. Compared with the Gospels, the Odyssey uses significantly more words that describe physical and sensible objects of experience, contributing to what Auerbach calls the “successful cultivation of the visual.” For these texts to achieve the effects Auerbach describes, one might say that they can’t not use concrete nouns alongside adjectives that describe sensuous properties of things. Fair enough.

    Perhaps more interesting, though, are those features below in blue (signifying progression, duration, addition) and cyan (contrastive particles), features that manage the flow of what gets presented in the diegesis. If the Odyssey can’t not use these words and phrases to achieve the effect Auerbach is describing, how do they contribute to the overall impression? Let’s look at another sample from the opening book of the Odyssey, now with a few more examples of these cyan and blue words:

    Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)
    Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

    While this is by no means the only interpretation of the role of the words highlighted here, I would suggest that phrases such as “when she”, “, and who”, or “, but” also create the even illumination of reality to which Auerbach alludes. We would have to look at many more examples to be sure, but these types of words allow the chisel to remain on the stone a little longer; they continue a description by in-folding contrasts or developments within a single narrative flow.

    Let us now turn to the New Testament Gospels, which lack the above features but contain others to a degree that is statistically significant (i.e., we are confident that the generally higher measurements of these new features in the Gospels are not so by chance, and vice versa). I begin with a longer passage from Matthew 22, then a short passage from Peter’s denial of Jesus at Matthew 26:71. Please note that the colors employed below correspond to different features than they do in the passages above:

    Gospel of Matthew, Project Gutenberg Text (with discriminating features highlighted)
    Matthew 22, Project Gutenberg Text (with discriminating features highlighted)

    Matthew 26:71, Project Gutenberg Text (with discriminating features highlighted)

    The dialogical nature of the Gospels is obvious here. Features in blue, indicating reports of communication events, are indispensable for representing dialogical exchange (“he says”, “said”, “She says”). Features in orange, which indicate uses of the third person pronoun, are also integral to the representation of dialogue; they indicate who is doing the saying. The features in yellow represent (imperfectly, I think) words that reference entities carrying communal authority, words such as “lordship,” “minister,” “chief,” “kingdom.” (Such words do not indicate that the speaker recognizes that authority.) Here again it is unsurprising that the Gospels, which contrast spiritual and secular forms of obligation, would be obliged to make repeated reference to such authoritative entities.

    Things that happen less often may also play a role in differentiating these two kinds of texts. Consider now a group of features that, while present to a higher and statistically significant degree in the Gospels, are nevertheless relatively infrequent in comparison to the dialogical features immediately above. We are interested here in the words highlighted in purple, pink, gray and green:

    Matthew 13:5-6, Project Gutenberg Text (with discriminating features highlighted)

    Matthew 27:54, Project Gutenberg Text (with discriminating features highlighted)

    Matthew 23:16-17, Project Gutenberg Text (with discriminating features highlighted)

    Features in purple mark the process of “reason giving”; they identify moments when a reader or listener is directed to consider the cause of something, or to consider an action’s (spiritually prior) moral justification. In the quotation from Matthew 13, this form of backward looking justification takes the form of a parable (“because they had not depth…”). The English word “because” translates a number of ancient Greek words (διὰ, ὅτι); even a glance at the original raises important questions about how well this particular way of handling “reason giving” in English tracks the same practice in the original language. (Is there a qualitative parity here? If so, can that parity be tracked quantitatively?) In any event, the practice of letting a speaker — Jesus, but also others — reason aloud about causal or moral dependencies seems indispensable to the evangelical programme of the Gospels.

    To this rhetoric of “reason giving” we can add another of proverbiality. The word “things”  in pink (τὰ in the Greek) is used more frequently in the Gospels, as are words such as “whoever,” which appears here in gray (for Ὃς and ὃς). We see comparatively higher numbers of the present tense form of the verb “to be” in the Gospels as well, here highlighted in green (“is” for ἐστιν). (See the adage, “many are called, but few are chosen” in the longer Gospel passage from Matthew 22 excerpted above, translating Πολλοὶ γάρ εἰσιν κλητοὶ ὀλίγοι δὲ ἐκλεκτοί.)

    These features introduce a certain strategic indefiniteness to the speech situation: attention is focused on things that are true from the standpoint of time immemorial or prophecy. (“Things” that just “are” true, “whatever” the case, “whoever” may be involved.). These features move the narrative into something like an “evangelical present” where moral reasoning and prophecy replace description of sensuous reality. In place of concrete detail, we get proverbial generalization. One further effect of this rhetoric of proverbiality is that the searchlight of narrative interest is momentarily dimmed, at least as a source illuminating an immediate physical reality.

    What Made Auerbach “Right,” And Why Can We Still See It?

    What have we learned from this exercise? Answering the most basic question, we can say that, after analyzing the frequency of a limited set of verbal features occurring in these two types of text (features tracked by Docuscope 3.91), we find that some of those features distribute unevenly across the corpus, and do so in a way that tracks the two types of texts Auerbach discusses. We have arrived, then, at a statistically valid description of what makes these two types of writing different, one that maps intelligibly onto the conceptual distinctions Auerbach makes in his own, mostly allusive analysis. If the test was to see if we can re-describe Auerbach’s insights by other means, Auerbach passes the test.

    But is it really Auerbach who passes? I think Auerbach was already “right” regardless of what the statistics say. He is right because generations of critics recognize his distinction. What we were testing, then, was not whether Auerbach was “right,” but whether a distinction offered by this domain expert could be re-described by other means, at the level of iterated linguistic features. The distinction Auerbach offered in Mimesis passes the re-description test, and so we say, “Yes, that can be done.” Indeed, the largest sources of variance in this corpus — features with the highest covariance — seem to align independently with, and explicitly elaborate, the mimetic strategies Auerbach describes. If we have hit upon something here, it is not a new discovery about the texts themselves. Rather, we have found an alternate description of the things Auerbach may be reacting to. The real object of study here is the reaction of a reader.

    Why insist that it is a reader’s reactions and not the texts themselves that we are describing? Because we cannot somehow deposit the sum total of the experience Auerbach brings to his reading in the “container” that is a text. Even if we are making exhaustive lists of words or features in texts, the complexity we are interested in is the complexity of literary judgment. This should not be surprising. We wouldn’t need a thing called literary criticism if what we said about the things we read exhausted or fully described that experience. There’s an unstatable fullness to our experience when we read. The enterprise of criticism is the ongoing search for ever more explicit descriptions of this fullness. Critics make gains in explicitness by introducing distinctions and examples. In this case, quantitative analysis extends the basic enterprise, introducing another searchlight that provides its own, partial illumination.

    This exercise also suggests that a mimetic strategy discernible in one language survives translation into another. Auerbach presents an interesting case for thinking about such survival, since he wrote Mimesis while in exile in Istanbul, without immediate access to all of the sources he wants to analyze. What if Auerbach was thinking about the Greek texts of these works while writing the “Fortunata” chapter? How could it be, then, that at least some of what he was noticing in the Greek carries over into English via translation, living to be counted another day? Readers of Mimesis who do not know ancient Greek still see what Auerbach is talking about, and this must be because the difference between classical and New Testament mimesis depends on words or features that can’t be omitted in a reasonably faithful translation. Now a bigger question comes into focus. What does it mean to say that both Auerbach and the quantitative analysis converge on something non-negotiable that distinguishes these the two types of writing? Does it make sense to call this something “structural”?

    If you come from the humanities, you are taking a deep breath right about now. “Structure” is a concept that many have worked hard to put in the ground. Here is a context, however, in which that word may still be useful. Structure or structures, in the sense I want to use these words, refers to whatever is non-negotiable in translation and, therefore, available for description or contrast in both qualitative and quantitative terms. Now, there are trivial cases that we would want to reject from this definition of structure. If I say that the Gospels are different from the Odyssey because the word Jesus occurs more frequently in the former, I am talking about something that is essential but not structural. (You could create a great “predictor” of whether a text is a Gospel by looking for the word “Jesus,” but no one would congratulate you.)

    If I say, pace Auerbach, that the Gospels are more dialogical than the Homeric texts, and so that English translations of the same must more frequently use phrases like “he said,” the difference starts to feel more inbuilt. You may become even more intrigued to find that other, less obvious features contribute to that difference which Auerbach hadn’t thought to describe (for example, the present tense forms of “to be” in the Gospels, or pronouns such as “whoever” or “whatever”). We could go further and ask, Would it really be possible to create an English translation of Homer or the Gospels that fundamentally avoids dialogical cues, or severs them from the other features observed here? Even if, like the translator of Perec’s La Disparition, we were extremely clever in finding a way to avoid certain features, the resulting translation would likely register the displacement in another form. (That difference would live to be counted another way.) To the extent that we have identified a set of necessary, indispensable, “can’t not occur” features for the mimetic practice under discussion, we should be able to count it in both the original language as well as a reasonably faithful translation.

    I would conjecture that for any distinction to be made among literary texts, there must be a countable correlate in translation for the difference being proposed. No correlate, no critical difference — at least, if we are talking about a difference a reader could recognize. Whether what is distinguished through such differences is a “structure,” a metaphysical essence, or a historical convention is beside the point. The major insight here is that the common ground between traditional literary criticism and the iterative, computational analysis of texts is that both study “that which survives translation.” There is no better or more precise description of our shared object of study.

  • The very strange language of A Midsummer Night’s Dream

    I just got back from a fun and very educative trip to Shakespeare’s Globe in London, hosted by Dr Farah Karim-Cooper, who is director of research there.

    The Globe stages an annual production aimed at schools (45,000 free tickets have been distributed over the past five years), and this year’s play is A Midsummer Night’s Dream. I was invited down to discuss the language of the play with the cast and crew as they begin rehearsals.

    This was a fascinating opportunity for me to test our visualisation tools and analysis on a non-academic audience – and the discussions I had with the actors opened my eyes to applications of the tools we haven’t considered before. They also came up with a series of sharp observations about the language of the play in response to the linguistic analysis.

    I began with a tool developed by Martin Mueller’s team at Northwestern University: Wordhoard, as a way of getting a quick overview of the lexical patterns in the play, and introducing people to thinking statistically about language.

    Here’s the wordcloud Wordhoard generates for a loglikelihood analysis of MSND compared with the whole Shakespeare corpus:

     


    Loglikelihood takes the frequencies of words in one text (in this case MSND) and compares them with the frequencies of words in a comparison, or reference, sample (in this case, the whole Shakespeare corpus). It identifies the words that are used significantly more or less frequently in the analysis text than would be expected given the frequencies found in the comparison sample. In the wordcloud, the size of a word indicates how strongly its frequency departs from the expected. Words in black appear more frequently than we would expect, and words in grey appear less frequently.

    As is generally the case with loglikelihood tests, the words showing the most powerful effects here are nouns associated with significant plot elements: ‘fairy’, ‘wall’, ‘moon’, ‘lion’ etc. If you’ve read the play, it is not hard to explain why these words are used in MSND more than in the rest of Shakespeare – and you really don’t need a computer, or complex statistics, to tell you that. To paraphrase Basil Fawlty, so far, so bleeding obvious.

    Where loglikelihood results normally get more interesting – or puzzling – is in results for function words (pronouns, auxiliary verbs, prepositions, conjunctions) and in those words that are significantly less frequent than you’d expect.

    Here we can see some surprising results: why does Shakespeare use ‘through’ far more frequently in this play than elsewhere? Why are the masculine pronouns ‘he’ and ‘his’ used less frequently? (And is this linked to the low use of ‘lord’?) Why is ‘it’ rare in the play? And ‘they’ and ‘who’ and ‘of’?

    At this stage we started to look at our results from Docuscope for the play, visualised using Anupam Basu’s LATtice.

     

     

    The heatmap shows all of the folio plays compared to each other: the darker a square is, the more similar the plays are linguistically. The diagonal of black squares running from bottom left to top right marks the points in the map where plays are ‘compared’ to themselves: the black indicates identity. Plays are arranged up the left hand side of the square in ascending chronological order from Comedy of Errors at the bottom to Henry VIII at the top – the sequence then repeats across the top from left to right – so the black square at the bottom left is Comedy of Errors compared to itself, while the black square at the top right is Henry VIII.

    One of the first things we noticed when Anupam produced this heatmap was the two plays which stand out as being unlike almost all of the others, producing four distinct light lines which divide the square of the map almost into nine equal smaller squares:

     

    These two anomalous plays are Merry Wives of Windsor (here outlined in blue) and A Midsummer Night’s Dream (yellow). It is not so surprising to find Wives standing out, given the frequent critical observation that this play is generically and linguistically unusual for Shakespeare: but A Midsummer Night’s Dream is a result we certainly would not have predicted.

    This visualisation of difference certainly caught the actors’ attention, and they immediately focussed in on the very white square about 2/3 of the way along the MSND line (here picked out in yellow):

     

    So which play is MSND even less like than all of the others? A tragedy? A history? Again, the answer is not one we’d have guessed: Measure for Measure.

    This is a good example of how a visualisation can alert you to a surprising finding. We would never have intuited that MSND was anomalous linguistically without this heatmap. It is also a good example of how visualisations should send you back to the data: we now need to investigate the language of MSND to explain what it is that Shakespeare does, or does not do, in this play that makes it stand out so clearly. The visualisation is striking – and it allowed the cast members to identify an interesting problem very quickly – but the visualisation doesn’t give us an explanation for the result. For that we need to dig a bit deeper.

    One of the most useful features of LATtice is the bottom right window, which identifies the LATs that account for the most distance between two texts:

     

    This is a very quick way of finding out what is going on – and here the results point us to two LATs which are much more frequent in MSND than Measure for Measure: SenseObject and SenseProperty. SenseObject picks up concrete nouns, while SenseProperty codes for adjectives describing their properties. A quick trip to the LATice box plot screen (on the left of these windows):

     

    confirms that MSND (red dots) is right at the top end of the Shakespeare canon for these LATs (another surprise, since we’ve got used to thinking of these LATs as characteristic of History), while Measure for Measure (blue dots) has the lowest rates in Shakespeare for these LATs.

    So Docuscope findings suggest that MSND is a play concerned with concrete objects and their descriptions – another counter-intuitive finding given the associations most of us have with the supposed ethereal, fairy, dream-like atmosphere of the play. Cast members were fascinated by this and its possible implications for how they should use props – and someone also pointed out that many of the names in the play are concrete nouns (Quince, Bottom, Flute, Snout, Peaseblossom, Cobweb, Mote and so on) – what is the effect on the audience of this constant linguistic wash of ‘things’?

    Here is a screenshot from Docuscope with SenseObject and SenseProperty tokens underlined in yellow. Reading these tokens in context, you realise that many of these concrete objects and qualities, in this section at least, are fictional in the world of the play. A wall is evoked – but it is one in a play, represented by a man. Despite the frequency of SenseObject in this play, we should be wary of assuming that this implies the straightforward evocation of a concrete reality (try clicking if you need to enlarge):

     

    Also raised in MSND are LATs to do with locating and describing space: Motions and SpaceRelations (as suggested by our loglikelihood finding for ‘through’?). So accompanying a focus on things, is a focus on describing location, and movement – perhaps, someone suggested, because the characters are often so unsure of their location? (In the following screenshot, Motions and SpatialRelation tokens are underlined in yellow.)

     

     

    Moving on, we also looked at those LATs that are relatively absent from MSND – and here the findings were very interesting indeed. We have seen that MSND does not pattern like a comedy – and the main reason for this is that it lacks the highly interactive language we expect in Shakespearean comedy: DirectAddress and Question are lowered. So too are PersonPronoun (which picks up third person pronouns, and matches our loglikelihood finding for ‘he’ and ‘his’), and FirstPerson – indeed, all types of pronoun are less frequent in the play than is normal for Shakespeare. At this point one of the actors suggested that the lack of pronouns might be because full names are used constantly – she’d noticed in rehearsal how often she was using characters’ names – and we wondered if this was because the play’s characters are so frequently uncertain of their own, and others’ identity.

    Also lowered in the play is PersonProperty, the LAT which picks up familial roles (‘father’, ‘mother’, ‘sister’ etc) and social ones (job titles) – if you add this to the lowered rate of pronouns, then a rather strange social world starts to emerge, one lacking the normal points of orientation (and the play is also low on CommonAuthority, which picks up appeals to external structures of social authority – the law, God, and so on).

    The visualisation, and Docuscope screens, provoked a discussion I found fascinating: we agreed that the action of the play seems to exist in an eternal present. There seems to be little sense of future or past (appropriately for a dream) – and this ties in with the relative absence of LATs coding for past tense and looking back. As the LATtice heatmap first indicated, MSND is unlike any of the recognised Shakespearean genres – but digging into the data shows that it is unlike them in different ways:

    • It is unlike comedy in its lack of features associated with verbal interaction
    • It is unlike tragedy in its lack of first person forms (though it is perhaps more like tragedy than any other genre)
    • It is unlike history in its lack of CommonAuthority

    Waiting for my train back to Glasgow (at the excellent Euston Tap bar near Euston Station), I tried to summarize our findings in four tweets (read them from the bottom, up!):

     

     

    I’ll try to keep in touch with the actors as they rehearse the play – this was a lesson for me in using the tools to spark an investigation into Shakespeare’s language, and I can now see that we could adapt these tools to various educational settings (including schools and rehearsal rooms!).

    Jonathan Hope February 2012

  • Visualizing Linguistic Variation with LATtice

    The transformation of literary texts into “data” – frequency counts, probability distributions, vectors – can often seem reductive to scholars trained to read closely, with an eye on the subtleties and slipperiness of language. But digital analysis, in its massive scale and its sheer inhuman capacity of repetitive computation, can register complex patterns and nuances that might be beyond even the most perceptive and industrious human reader. To detect and interpret these patterns, to tease them out from the quagmire of numbers without sacrificing the range and the richness of the data that a text analysis tool might accumulate can be a challenging task. A program like DocuScope can easily take an entire corpus of texts and sort every word and phrase into groups of rhetorical features. It produces a set of numbers for each text in the corpus, representing the relative frequency counts for 101 “language action types” or LATs. Taken together, the numbers form a 101 dimensional vector that represents the rhetorical richness of a text, a literary “genetic signature” as it were.

    Once we have this data, however, how can we use it to compare texts, to explore how they are similar and how they differ? How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts. But it is precisely this high dimensionality that accounts for the richness of the data that DocuScope produces, so it is important to be able preserve it and to make comparisons at the level of individual LATs.

    LATtice addresses this problem by producing multiple visualizations in tandem to allow us to explore the same underlying LAT data from as many perspectives and in as much detail as possible. It reads a data file from DocuScope and draws up a grid, or a heatmap representing “similarity” or “difference” between texts. The heatmap, based on the Euclidean distance between vectors, is drawn up based on a color coding scheme where darker shades represent texts that are “closer” or more similar and lighter shades represent texts further apart or less similar according to DocuScope’s LAT counts. If there are N texts in the corpus, LATtice draws up an “N x N” grid where the distance of each text from every other text is represented. Of course, this table is symmetrical around the diagonal and the diagonal itself represents the intersection of each text with itself (a text is perfectly similar to itself, so the bottom right “difference” panel shows no bars for these cases). Moving the mouse around the heatmap allows one to quickly explore the LAT distribution for each text-pair.

    Screenshot of LATtice

    While the main grid can reveal interesting relationships between texts, it hides the underlying factors that account for differences or similarities, the linguistic richness that DocuScope counts and categorizes so meticulously. However, LATtice provides multiple, overlapping, visualizations to help us explore the relationship between any two texts in the corpus at the level of individual LATs. Any text-pair on the grid can be “locked” by clicking on it, allowing the user to move to the LATs to explore them in more detail. The top right panel shows how LATs from both the texts relate to each other. The text on the X axis of the heatmap is represented in red and the one on the Y axis is represented in blue in the histogram for side by side comparison. All the other panels follow this red-blue color coding for the text-pair. The bottom panel displays only the LATs whose counts are most dissimilar. These are the LATs we will want to focus on in most cases as they account most for the “difference” between the texts in DocuScope’s analysis. If a bar in this panel is red it signifies that for this LAT, the text on the X axis (our ‘red’ text) had a higher relative frequency count while a blue bar signals that the Y axis text (our ‘blue’ text) had a higher count for a particular LAT. This panel lets us quickly explore exactly on what aspects texts differ from each other. Finally, LATtice also produces a scatterplot as a very quick way of looking at “similarity” between texts. It plots LAT readings of the two texts against each other and color codes the dots to indicate which text has a higher relative frequency for a particular LAT (grey dots indicate that both LATs have the same value). The “spread” of dots gives a rough indication of difference or similarity between texts: a larger spread indicates dissimilar texts and dots clustering around the diagonal indicate very similar texts.

    You can try LATtice out with two sample data-sets by clicking on the links below. The first is drawn from the plays of Shakespeare which are in this case arranged in rough chronological order. As Hope and Witmore’s work has demonstrated, the possibilities opened up by applying DocuScope to the Shakespeare corpus are rich and hopefully exploring the relationship between individual plays on the grid will produce new insights and new lines on inquiry. The second data-set is experimental – it tries to use DocuScope not to compare multiple texts but to explore a single text – Milton’s Paradise Lost – in detail. It might give us insights about how digital techniques can be applied on smaller scales with well-curated texts to complement literary close-reading. The poem was divided into sections based on the speakers (God, Satan, Angels, Devils, Adam, Eve) and the places being described (Heaven, Hell, Paradise). These chunks were then divided into roughly three hundred line sections. As an example, we might notice straightaway that speakers and place descriptions seem to have very distinct characteristics. Speeches are broadly similar to each other as are place descriptions. This is not unexpected, but what accounts for these similarities and differences? Exploring the LATs helps us approach this question with a finer lens. Paradise, for example, is full of “sense objects” while Godly and angelic speech does not refer to them as often. Does Adam refer to “authority” more when he speaks to Eve? Does Satan’s defiance leave a linguistic trace that distinguishes him from unfallen angels? Hopefully LATtice will help us explore and answer such questions and let us bring DocuScope’s data closer to the nuances of literary reading.


    Finally, a few technical notes: The above links should load LATtice with the appropriate data-sets. Of course, you will need to have Java installed on your machine and to have applets enabled in your browser. You can also download LATtice and the sample data-sets, along with detailed instructions, as stand-alone applications for the following platforms:

    There are a few advantages to doing this. First, the standalone version offers an additional visualization panel which represents the distribution of LATs as box-and-whisker plots and shows where the text-pair’s frequency counts stand relative to the rest of the corpus. Secondly, the standalone application can make use of the entire screen, which can be a great advantage for larger and higher resolution monitors.

  • Comic Twelfth Night, Tragic Othello (Part 2)

    Here is a second comic exchange from Twelfth Night. Maria’s plan has worked wonderfully. Malvolio has arrived cross-gartered and is quoting to Olivia little bits of the love letter he believes she has written to him. The blue and red strings, First Person and Interaction, are again appearing fast and thick as the incomprehension builds. As in the previous passage, which dealt with Cesario’s resistance of Olivia, we have a resistant “you” here who keeps the game going. (Had she succumbed, dismissing Maria to go practice her penmanship, the dialogue would look very different: first and second person singular pronouns would most likely disappear.)

    OSSComedy2TN

    DSComedy2TN

    A few things worth noting about the coding in this passage. Docuscope is ignoring the single quotation marks from the Moby Shakespeare. It does not matter that these words are being “mentioned” rather than “used” in the Austinian sense: all “sightings” by Docuscope occur in a kind of weird citational indicative: there is no way for the machine to catch the fact that the speaker, Malvolio, is note really telling Olivia “Go to, thou art made.” This is a flat earth in the rhetorical sense: no ironic depth can be perceived when every item is tagged because it occurs, not because its use in a certain context means a certain thing. One should not be mislead about Docuscope’s powers of interpretation here.

    Switching analogies, we might say that – like a Spinozan deity – Docuscope contemplates words from the perspective of eternity: it does not itself follow events from the standpoint of a moving present against which it measures temporally marked events as they arrive and withdraw through time. (Docuscope does not engage in phenomenological protention or retention in the Husserlian sense.) Nor does it situate events in space in any perspectivally located way. The history of what happens in the world of the play, if we were to think of it that way, is a history of “mentioned happenings.” No one does anything; rather, words are mentioned, and Docuscope keeps track of which kinds of words are used (but never how).

    Another interesting feature of the passage. Malvolio really doesn’t say anything directly to Olivia in this passage: he is talking past Maria, and is reciting to Olivia what he believes she actually wants to say to him. This sort of indirection, when it is not a group effort, also seems to be contributing to the proliferation of Interaction and First Person strings: the “how,” “what,” “what” paired with the “you” “thou” “thou.” We would expect to find a lot of passages like this in other plays that have disguise and supposition, most of all in Comedy of Errors. I suspect that in the future I will be able to put my finger on a number of passages which parallel this one in terms of their performance on the comedy factor that Docuscope found for the full plays.

    A final observation. Here and elsewhere in the play, Malvolio is often the one who supplies the Description strings, which as I have mentioned below, this play lacks in comparison with other plays (just as it has more, on average, Interaction and First Person). Is there anything about this passage that shows us why one cannot put one’s weight on both sides of this equation – Description on the one hand, First Person/Interaction on the other – in a single play or passage? Is there something about the comic posture, linguistically, that prevents such combinations? Malvolio and Feste are the two characters in the play who use the most Description strings, and during the fabulous speech in which Malvolio fantasizes about being married to Olivia while Toby and Maria look on, the linguistic texture of the scene is that of a History play. But as principal component analysis tells us, such moments of “historical” writing – oversimplified as the definition is – may occur occasionally in Comedy, but they will not occur repeatedly. Malvolio can only give so many such monologues, and Feste can only produce his rich, descriptive banter for so long.

    But isn’t it important that there is a “dash” of Description in the play, indeed, in this passage? One issue that we need to explore as we think about what it means to find “a lot” of something in a particular type of play is what it also means to find “a little” of something. Is there a sense in which things that occur in small amounts are important as well, and if so, how should we think about those “dashes” of a certain type of word?

  • Comic Twelfth Night, Tragic Othello? (Part I)

    Twelfth Night is one of the classic Shakespearean comedies and so it is unsurprising that it appears in the Comedy quadrant that we obtained in our initial analysis. What is it about the language in this play that pushes it toward this quadrant, and would we recognize this comic “itness” if we saw it in the form of an exemplary passage? That is the first question I’ll be looking in the next series of posts, entitled “Comic Twelfth Night, Tragic Othello?” But there is another, more interesting question to ask, given the results we have obtained: why does Othello look to Docuscope like a comedy? Literary critics such as Susan Snyder and Stephen Orgel have noted genealogical links to comedy in this “high tragedy,” so it is particularly intriguing to find unsupervised statistical analysis of the language coming to a similar conclusion. I will try to provide more than one exemplary passage in this series of posts, since these tend to be where the analysis gets interesting (or not).

    So, Twelfth Night. In terms of plot, it has three interesting devices — a set of identical twins,  a shipwreck, and a disguise, all of which introduce a high degree of unintentional confusion into the action, driving it forward. In a plot that is driven on by accident and what you might call “congruent misunderstanding” (when two people don’t realize that they are speaking at cross-purposes), you expect to find a lot of back and forth between characters as they synch-up their erroneous suppositions (which is funny in and of itself), then more back and forth as they backtrack in order to rehearse why they didn’t understand what was going on when they were so deeply engaged with one another. I haven’t yet looked at the color coded play as I write this, but I expect to find the comic strings at the end, where the confusion is being unravelled, and in scenes of comic abuse (which I know from experience involves a lot of “I”/”thou” exchange characteristic of comedy). The exemplars are below, one from Open Source Shakespeare, the other a screen shot of the same passage as tagged by Docuscope:

    OSSComedy1TN3-1

    DSComedy1TN

     

    The first thing I notice about this exchange is that it involves an extended miscommunication, culminating in the wonderful line “I am not what I am.”  The doubled first person is emblematic of the doubling of Viola’s person in Cesario (or in Olivia’s apprehension of Viola as Cesario). The underlined red passages refer to the Docuscope category First Person, which as we remember from the component loadings is high in all of the items on the upper half of the scatterplot.  The other type of strings that push plays upward are those underlined in blue, which are coded in Docuscope under the category of Interactions. First person is fairly self-explanatory here — look at the red items — but Interaction is worth pausing at. Notice first that question marks are being tagged here: a piece of punctuation and so not definitively Shakespearean. Maybe it matters that something that could have been added by a compositor is at work in this category, maybe it doesn’t. I don’t think question marks are as open to interpretation, grammatically, as say a comma or semicolon, but this is something for my colleague Jonathan to weigh in on. We see lots of “thee” and “thou” under Interaction, and these words seem to be the mainstay of comedy as a whole from what I’ve seen. “Thee,” “thou,” “thine,” “you,” and “your” are some of the most common words in the Shakespearean corpus that Docuscope tags, so we can be fairly sure that when we find First Person coming up as a relevant loading in a component, it is words such as “these” that are driving the underlying pattern.

    Red and blue strings are pushing mostly comic plays up toward the top of the scatterplot. Yellow strings will push plays to the right, which means that the comedies clustering in the upper left exhibit a lack of yellow or Descriptive strings. The entire component that characterizes Comedy, then, is one in which First Person and Interaction strings are mutually elevated from the mean score of all plays, while Descriptive strings are (simultaneously) below the mean. Perhaps there is a reason that a linguist could provide that would explain this pattern as a general feature of the language. That is, someone might be able to show that our language is something that can only “bend” in certain ways, making it quite difficult to use a lot of concrete descriptive nouns and words describing motion or changes in states of objects while simultaneously juggling lots of I/you, my/your strings. But this would not be enough of an explanation for me. We need to say why this type of language pattern –whether or not it is constrained by limits in our grammar, cognition, or underlying semantic maps — coincides with genre classifications made by discriminating humans (Heminges and Condell, Shakespeare’s editors).

    Returning the the passage above, I would point out two things. First, the quick trading of I/you, my/your strings in comic dialogue suggests a world in which predicates are being attached to subjects from two and only two points of view. This is not a universe of one, nor is it a crowd. It is not surprising that comic plotting — built as it is on sexual pairings — would favor this type of bivalent, perspectival tagging of action by speakers. But there is something else going on here. Olivia is trying to make something happen here. She says, “do not extort thy reasons from this clause,” and earlier, “I would you were as I would have you be.” The “thy” and “you” here are important because the speaker is trying to create or assert a particular interpretation of how these two individuals relate to one another (and the words traded between them). The essential drama in this situation is the asymmetry of desire that obtains between the two characters, an asymmetry that keeps Viola from assenting to Olivia’s advances. That resistance is actually what forces Olivia to make these statements that are rich with I/you, me/my, since she is using these words as anchors for a broader interpretation that does not yet obtain. She really wants to say we. And Cesario doesn’t, so they remain in I/you dialogue.

    So we could offer a preliminary hypothesis here. Shakespeare writes comedies in which characters, sometimes quite perversely, find the wrong way to the ones they love. Often it is chance or an onstage helper who sorts this out. Shakespeare is actually quite reserved when it comes to showing love as naturally progressing through its obstacles unassisted. But given that, in the initial stages of courtship, Shakespearean lovers almost never meet and join in a perfectly symmetrical way — they don’t begin out as stones set in an arch, leaning perfectly on a keystone — we should expect this asymmetry to show itself in the language. Where does it show up? When a resistant individual, a “you,” prevents another “I” from arriving at an interpretation of their relationship that can be referred to as a “we” before others. Let’s call this the “resistant you” hypothesis. We can perhaps test it in the next passage, and in the passages we encounter from Othello.