What did Stanley Fish count, and when did he start counting it?

We have been observing the reaction to Stanley Fish’s critique of the Digital Humanities with great interest. Here is the full text of our comment, which could only be partially displayed on the New York Times comment window.

You know you’ve come up in the world if you’re being needled by Stanley Fish in The New York Times. Having done our share of work in the data mines, we believe Fish is right to insist that nothing in a text becomes evidence unless you have an interpretation which makes that evidence count. No amount of digital tabulation will substitute for a coherent, defensible reading.

As traditionally trained humanities scholars who use computers to study Shakespeare’s genres, we have pointed out repeatedly that nothing in literary studies will be settled by an algorithm or visualization, however seductively colorful. We have also argued that any pattern found through an iterative, computer-assisted analysis is meaningless without a larger interpretive framework in which to view it. It is the job of literary critics and historians to provide those interpretations, something they do by returning to the text and re-reading it with fresh eyes.

The job of digital tools is to draw our attention to evidence impossible or hard to see during normal reading, prompting us to ask new questions about our texts. This ability to redirect attention and pose new questions is the strong suit of certain kinds of digital humanities research. Indeed, we believe the addition of a digital prosthetic to our insistently human reading complements the skills of close textual analysis that are the staple of literary training. Not everyone in the so-called Digital Humanities community would agree with this position, but we believe the old and new techniques are entirely compatible.

What does it matter why Stanley Fish started minding his ps and bs in Milton? The point is that he has produced a plausible interpretation of Milton’s work based on evidence that fits his larger claim. The fact that an algorithm (“count ps and bs”) has directed his attention to something he hadn’t noticed doesn’t make the resulting pattern gibberish. You bet there are interesting patterns that show up in Milton when you mind his ps and bs. They existed before you counted them, and they exist after. However he found it, Fish has used that patterning to produce an interesting argument about the role of sound in Milton’s prose. And he has the evidence to back this argument up. In the end, he’s doing what most literary critics do in their work: create an interpretation that builds meaningfully on evidence in the text. Is there really any other way?

Yours sincerely,

Jonathan Hope, Strathclyde University

Michael Witmore, Folger Shakespeare Library

You can view a sample of our work at here.

Posted in Quant Theory | Tagged , | 1 Comment

Visualizing Linguistic Variation with LATtice

The transformation of literary texts into “data” – frequency counts, probability distributions, vectors – can often seem reductive to scholars trained to read closely, with an eye on the subtleties and slipperiness of language. But digital analysis, in its massive scale and its sheer inhuman capacity of repetitive computation, can register complex patterns and nuances that might be beyond even the most perceptive and industrious human reader. To detect and interpret these patterns, to tease them out from the quagmire of numbers without sacrificing the range and the richness of the data that a text analysis tool might accumulate can be a challenging task. A program like DocuScope can easily take an entire corpus of texts and sort every word and phrase into groups of rhetorical features. It produces a set of numbers for each text in the corpus, representing the relative frequency counts for 101 “language action types” or LATs. Taken together, the numbers form a 101 dimensional vector that represents the rhetorical richness of a text, a literary “genetic signature” as it were.

Once we have this data, however, how can we use it to compare texts, to explore how they are similar and how they differ? How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts. But it is precisely this high dimensionality that accounts for the richness of the data that DocuScope produces, so it is important to be able preserve it and to make comparisons at the level of individual LATs.

LATtice addresses this problem by producing multiple visualizations in tandem to allow us to explore the same underlying LAT data from as many perspectives and in as much detail as possible. It reads a data file from DocuScope and draws up a grid, or a heatmap representing “similarity” or “difference” between texts. The heatmap, based on the Euclidean distance between vectors, is drawn up based on a color coding scheme where darker shades represent texts that are “closer” or more similar and lighter shades represent texts further apart or less similar according to DocuScope’s LAT counts. If there are N texts in the corpus, LATtice draws up an “N x N” grid where the distance of each text from every other text is represented. Of course, this table is symmetrical around the diagonal and the diagonal itself represents the intersection of each text with itself (a text is perfectly similar to itself, so the bottom right “difference” panel shows no bars for these cases). Moving the mouse around the heatmap allows one to quickly explore the LAT distribution for each text-pair.

Screenshot of LATtice

While the main grid can reveal interesting relationships between texts, it hides the underlying factors that account for differences or similarities, the linguistic richness that DocuScope counts and categorizes so meticulously. However, LATtice provides multiple, overlapping, visualizations to help us explore the relationship between any two texts in the corpus at the level of individual LATs. Any text-pair on the grid can be “locked” by clicking on it, allowing the user to move to the LATs to explore them in more detail. The top right panel shows how LATs from both the texts relate to each other. The text on the X axis of the heatmap is represented in red and the one on the Y axis is represented in blue in the histogram for side by side comparison. All the other panels follow this red-blue color coding for the text-pair. The bottom panel displays only the LATs whose counts are most dissimilar. These are the LATs we will want to focus on in most cases as they account most for the “difference” between the texts in DocuScope’s analysis. If a bar in this panel is red it signifies that for this LAT, the text on the X axis (our ‘red’ text) had a higher relative frequency count while a blue bar signals that the Y axis text (our ‘blue’ text) had a higher count for a particular LAT. This panel lets us quickly explore exactly on what aspects texts differ from each other. Finally, LATtice also produces a scatterplot as a very quick way of looking at “similarity” between texts. It plots LAT readings of the two texts against each other and color codes the dots to indicate which text has a higher relative frequency for a particular LAT (grey dots indicate that both LATs have the same value). The “spread” of dots gives a rough indication of difference or similarity between texts: a larger spread indicates dissimilar texts and dots clustering around the diagonal indicate very similar texts.

You can try LATtice out with two sample data-sets by clicking on the links below. The first is drawn from the plays of Shakespeare which are in this case arranged in rough chronological order. As Hope and Witmore’s work has demonstrated, the possibilities opened up by applying DocuScope to the Shakespeare corpus are rich and hopefully exploring the relationship between individual plays on the grid will produce new insights and new lines on inquiry. The second data-set is experimental – it tries to use DocuScope not to compare multiple texts but to explore a single text – Milton’s Paradise Lost – in detail. It might give us insights about how digital techniques can be applied on smaller scales with well-curated texts to complement literary close-reading. The poem was divided into sections based on the speakers (God, Satan, Angels, Devils, Adam, Eve) and the places being described (Heaven, Hell, Paradise). These chunks were then divided into roughly three hundred line sections. As an example, we might notice straightaway that speakers and place descriptions seem to have very distinct characteristics. Speeches are broadly similar to each other as are place descriptions. This is not unexpected, but what accounts for these similarities and differences? Exploring the LATs helps us approach this question with a finer lens. Paradise, for example, is full of “sense objects” while Godly and angelic speech does not refer to them as often. Does Adam refer to “authority” more when he speaks to Eve? Does Satan’s defiance leave a linguistic trace that distinguishes him from unfallen angels? Hopefully LATtice will help us explore and answer such questions and let us bring DocuScope’s data closer to the nuances of literary reading.


Finally, a few technical notes: The above links should load LATtice with the appropriate data-sets. Of course, you will need to have Java installed on your machine and to have applets enabled in your browser. You can also download LATtice and the sample data-sets, along with detailed instructions, as stand-alone applications for the following platforms:

There are a few advantages to doing this. First, the standalone version offers an additional visualization panel which represents the distribution of LATs as box-and-whisker plots and shows where the text-pair’s frequency counts stand relative to the rest of the corpus. Secondly, the standalone application can make use of the entire screen, which can be a great advantage for larger and higher resolution monitors.

Posted in Uncategorized | Tagged , , | Leave a comment

Tokens of Impersonation in Dekker’s City Comedies

In sixteenth- and seventeeth-century England, the relationship between clothing and identity was complex. As Ann Rosalind Jones and Peter Stallybrass have shown, the fact that clothing circulated as currency among different owners implicitly called into question its supposed correspondence with the wearer’s social and financial status. Stephen Orgel has explored how issues surrounding clothing and identity played out on the Elizabethan and Jacobean stage—a place where clothing was understood at once as the defining token of identity and as disguise, where audiences entered into the fiction that a dress could temporarily transform a lower-class boy into a noble woman. The possibility that appearance might not match reality was problematic for early modern audiences, however, because the English credit culture that emerged in this period depended on people’s ability to assess one another’s presentations of honesty and trustworthiness. By challenging the assumed correspondence between social performance and identity, cross-dressing figures like Moll Cutpurse in Dekker and Middleton’s The Roaring Girl (1611) suggest the fallability of a system in which a person’s economic status is inferred from his or her appearance.

I wondered whether The Roaring Girl’s concern with the instability of credit might be visible at the linguistic level. In Witmore and Hope’s “very large dendrogram” (see Figure 9 here), three plays group tightly with The Roaring Girl: Westward Ho (Dekker and Webster, 1604), Northward Ho (Dekker and Webster, 1605), and The Honest Whore, Part 2 (Dekker, performed 1605 and published 1630). Based on where they cluster in the dendrogram, it is clear that these texts are not merely linked by authorship, genre, or time period. I hypothesized that these four plays might all share The Roaring Girl’s concern with disguise and credit, and that this concern would be one of the factors linking them together stylistically. Still, much of early modern drama, especially city comedy, is concerned with the economics of identity. Assuming that these plays’ treatment of credit and disguise contributes to their linkage, what is uniquely similar about them that pushes the plays together?

To answer this question, I performed Principle Component Analysis (PCA) on 130 plays performed between 1601 and 1621 and found a component that united the plays on The Roaring Girl twig. As it turns out, the cocktail of linguistic factors that joins these four plays includes the categories Docuscope labels “Person Properties” and “Sense Objects.” The component also discriminates against Positive and Negative Standards, Abstract Concepts, and Negativity.

The passage from the four plays that is most exemplary of this component comes from Westward Ho. Words underlined in purple are Person Properties, while bright yellow indicates Sense Objects:

In this scene, the bawd Birdlime tries to protect the identity of one of her clients, Tenterhook, from another who has entered her house. Tenterhook hides in a closet with the prostitute, Luce, and covers her eyes. She tries to identify him by the feel of his hands and what he wears on them. In guessing, she reveals the names of all her clients, thereby contradicting the bawd’s claim that whores practice a kind of doctor-patient confidentiality. The most frequent elements in this scene are Person Properties, Sense Objects, Questions and Direct Address. In other words, in this scene characters address one another based on their perceived identities (mistress, captain) and their interactions with the physical world.

The second most exemplary passage, this time from The Roaring Girl, is even more explicitly concerned with clothing. Here again, purple indicates words tagged as Person Properties, and yellow highlights Sense Objects:

In  this scene, Moll’s man Trapdoor reports to Sir Alexander about his mistress, and they hatch a plan to catch her in flagrante delicto with Alexander’s son Sebastian. Again, the passage is dominated by Person Properties (linked mainly to gender and social position) and Sense Objects. Moll’s male apparel is thoroughly catalogued, and the interplay of the repeated terms “girl,” “mistress,” and “man”/ “male” highlights the instability of her identity when she wears these typically masculine items of clothing. The rapid-fire comedic exchange amplifies the effect of the patterns—for example, the repeated pun on “shirt of mail” / “male shirt” creates a glut of Person Properties and Sense Objects in those lines.

It would seem, then, that the component under consideration selects for descriptions of people—their social roles (Person Properties) and the way they dress (Sense Objects)—as well as descriptions of the material world. What does PC2 select against? The least exemplary passage comes from a scene in The Roaring Girl in which Sebastian attempts to persuade his father that Moll is a chaste woman, despite her propensity for brawling and wearing men’s clothing. In this passage, green indicates Positive Standards and Negative Standards; light purple flags Abstract Concepts and various narrative cues such as Reporting Events; and orange highlights Negativity as well as other indicators of interiority such as Subjective Perception:

Sebastian explicitly critiques his father for judging Moll by her appearances; yet the language of this passage is very different from previous ones in which the obsession with appearances and roles was implicit in the preponderance of Person Properties and Sense Objects. Here, the most common elements are Positive and Negative Standards, Abstract Concepts, and Negativity. Given that this passage is the opposite of the component that grouped these four plays together, it would seem that this particular combination of standards, judgment, and interior life is uncommon in the world of these plays.

While the component that sets these four texts apart selects for plays about sex and clothing, it is not merely a “disguise plot” component. Given its opposition to standards and interiority, it might be more broadly defined as language that explores the material world’s inability to accurately reflect abstract truths. I believe this component can show us something about Dekker’s engagement, not only with identity, but with credit culture. In selecting for moments where people are described based on their clothing, appearance, and/or social role, and selecting against value judgments of those people’s performances, this component might highlight plays that represent the impossibility of assessing people based on their public personae. Not only might a woman dress as a man, but a prostitute might present herself as a rich woman, provided she has wealthy enough customers. Similarly, an insolvent gallant might dress well to trick shopkeepers into extending him credit (or their wives into sleeping with him). The fact that Dekker’s treatment of disguise excludes judgments, standards, or appeals to authority suggests that his critique is not of the amorality of the city. Rather, it is of the way that credit relations punish perceived immorality, while often rewarding well-hidden immorality. This explanation might help explain why these particular plays cluster together, rather than blending in with all the rest of Jacobean city comedy.

Richard Wawso argues that all Jacobean drama, through its concern with disguise, counterfeit, and crime, invites audiences to question the credibility of their neighbors. Certainly, Dekker’s stage comedies reflect a sustained interest in the unstable relationship between dress and character, but as this component reveals, they do so in a unique way. I hope my findings might help us begin to understand how different writers’ attitudes toward these issues register at the linguistic level, even when they use the same stock of plot points and characters. While a morally conservative writer like Jonson might condemn the coney-catchers and cross-dressers of the London underworld for wreaking havoc on the institutions of credit that undergird social commerce, Dekker seems more critical of the credit system itself. In its very structure—in its reliance on appearances—the system invites exploitation by those who are willing to play the game. We are able to see this critique coming through in these plays because, like an expert coney-catcher, Docuscope counts the tokens of texts’ identities, registering the affinities that are alternately hidden and revealed by the linguistic “clothing” they wear.

Posted in Early Modern Drama | 1 Comment

Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

An unexpected point of linguistic similarity between detective fiction and Shakespearean comedy recently led me to consider some of the theoretical implications of tools like DocuScope, which frequently identify textual similarities that remain invisible in the normal process of reading.

A Linguistic Approach to Suspense Plot

Playing around with a corpus of prose, we discovered that the linguistic specs associated with narrative plot are surprisingly unique. Principle Component Analysis performed on the linguistic features counted by DocuScope suggested the following relationship between the items in the corpus:

I interpreted the two strongest axes of differentiation seen in the graph (PC 1 and PC 2) as (1) narrative, and (2) plot. The two poles of the narrative axis are Wuthering Heights (most narrative) and The Communist Manifesto (least narrative). The plot axis is slightly more complicated. But on the narrative side of the spectrum, plot-driven mysteries like “The Speckled Band” and The Canterville Ghost score high on plot, while the least plotted narrative is Samuel Richardson’s Clarissa (9 vols.). For now, I won’t speculate about why Newton’s Optics scores so astronomically high on plot. It is enough that when dealing with narrative, PC 2 predicts plot.

The fact that something as qualitative and amorphous as plot has a quantitative analogue leads to several questions about the meaning of the data tools like DocuScope turn up.

Linguistic Plot without Actual Plot

Because linguistic plot is quantifiable, it allows us to look for passages where plot is present to a relative degree. Given a large enough sample, it is more than likely that some relatively plotted passages will occur in texts that are not plotted in any normal sense. This would at minimum raise questions about how to handle genre boundaries in digital literary research.

Our relative-emplotment test (done in TextViewer) yielded intuitive results when performed on the dozen or so stories in The Adventures of Sherlock Holmes: the passages exhibiting the strongest examples of linguistic plot generally narrated moments of discovery, and moved the actual plot forward in significant ways. Often, these passages showed Holmes and Watson bursting into locked rooms and finding bodies.

When we performed the same test on the Shakespeare corpus, something intriguing happened. The passages identified by TextViewer as exhibiting linguistic plot look very different from the corresponding passages in Sherlock Holmes. There were no dead bodies, no broken-down doors, and no exciting discoveries. Nonetheless, the ‘plotted’ Shakespeare scenes were remarkably consistent with each other. Perhaps most significant in the context of their genre, these scenes had a strong tendency to show characters putting on performances for other characters. Additionally, in a factor that is fascinating even though it is probably a red herring, the ‘plotted’ Shakespeare scenes had an equally strong tendency to involve fairies.

The consistent nature of the ‘plotted’ Shakespeare scenes suggests that the linguistic specs associated with plot when they occur in Sherlock Holmes may have different, but equally specific, effects in other genres. The next step would be to find a meaningful correspondence between the two seemingly disparate literary devices that accompany linguistic plot – detectives bursting into rooms to solve murders, and plays within plays involving fairies. I have some hunches about this. But in many ways the more important question is what is at stake in using DocuScope to identify such unexpected points of overlap.

Enough measurable links between seemingly unlike texts could suggest an invisible web of cognates, which share an underlying structure despite their different appearances and literary classifications. Accordingly, we might hypothesize that reading involves selective ignorance of semantic similarities that could otherwise lead to the socially deviant perception that A Midsummer Night’s Dream resembles a Sherlock Holmes mystery.

The question, then, is this: if the act of reading consists in part of ignoring unfruitful similarities, then what happens when these similarities nonetheless become apparent to us? Looking back at the corpus graph, we begin to see all sorts of possibilities, many of which would be enough make us literary outcasts if voiced in the wrong company. Could Newton’s Optics contain the most exciting suspense plot no one has ever noticed? Could Martin Luther be secretly more sentimental than Clarissa?

Estranging Capacities of Digital Cognates

I have been using the term ‘cognate’ to describe the relationship between linguistically similar but otherwise dissimilar texts. These correspondences will only be meaningful if we can connect them in a plausible way to our readerly understanding of the texts or genres in question. In the case of detective fiction and Shakespearean comedy, this remains to be seen. But our current lack of an explanation does not mean we should feel shy about pursuing the cognates computers direct us to. My analogy is the pop-culture ritual of watching The Wizard of Oz, starting the Pink Floyd album Dark Side of the Moon on the third roar of the MGM lion. The movie and the record sync up in a plausible pattern, prompting the audience to grasp a connection between the cognate genres of children’s movies and psychedelic rock.

If digital methods routinely direct our attention to patterns we would never notice in the normal process of reading, then we can expect them to turn up a large number of such cognates. If we want to understand the results these tools are turning up, we should develop a terminology and start thinking about implications – not just for the few correspondences we can explain, but also for the vast number we cannot explain, at least right now.

Posted in Counting Other Things, Quant Theory, Shakespeare | 2 Comments

Why the Difference? Accounting for Variation between the Folio and Globe Editions of Shakespeare’s Plays

To what extent is modern text analysis software capable of dealing with historical data? This is a perennial question asked by those working with digitized historical texts who wish to see how an analysis of such texts can be facilitated by cutting-edge technologies. No doubt the best way to answer the question is to test this software with two versions of the same text, where one version of the text can be considered an older and noticeably different version than the other version.

Enter the Folio and Globe editions of Shakespeare’s plays. The latter was published in 1867 and contains modernized spelling throughout, whereas the former was published in 1623 and maintains the original spelling of Shakespeare’s Early Modern English. Using DocuScope for text analysis and JMP for statistical visualizations, the following dendrogram was created:

The texts highlighted in red are from the Folio edition, whereas the texts highlight in blue come from the Globe edition. One would expect all of Shakespeare’s Folio plays to cluster with their Globe complement here. Much Ado About Nothing is Much Ado About Nothing, after all, regardless of which edition it appears in. But for the most part, this neat pairing off is not what happens: instead, most of the Folio plays are grouping with other Folio plays, and the same is true for the Globe plays. Only a few plays are actually grouping with themselves at the top of the dendrogram. Methinks we have a problem.

Upon closer inspection, I found that 13,667 items were tagged by DocuScope in the Globe edition of Much Ado, but only 11,382 items were tagged in the Folio edition of the same play: a 16.7% difference. An inspection of eleven other Shakespeare plays provides us with an overall mean difference of 17.8%: a difference that cannot be considered good when it comes to tagging accuracy.

But why the disparity? Maybe a closer look at DocuScope can give us an idea.

First the Folio version of the opening scene in Much Ado About Nothing (with the “Interior Thought” and “Public Values” clusters turned on):

And the Globe version of the same scene with the same clusters turned on:

One need not read far to discover what’s (not) being tagged in the older, Folio edition of Much Ado: Learne versus learn. It appears the orthographic rendering of the unstressed final –e is causing DocuScope to overlook this work altogether. We find the same mistake later on with indeede/indeed, kindnesse/kindness, helpe/help, and kinde/kind. Another common problem is Early Modern use of u, which is rendered v in modern orthography: deseru’d vs. deserved, seruice vs. service,  and ouerflow vs. overflow. There are also a few punctuation issues causing problems: the use of apostrophe (as we see in deseru’d) and the use of | (con | flict vs. conflict), which probably results from some sort of scanning or other computer error. In other plays, the hyphen was also found to be a possible culprit of DocuScope overlooking certain items (ouer-charg’d vs. overcharged).

Although the overall number of DocuScope omissions on a Folio play is rather large, the actual number of error types is quite small. This gives us hope that, with a bit of modification, it may well be possible to train DocuScope to read non-modern(ized) texts.

 

Posted in Uncategorized | 4 Comments