Several years ago I did some experiments with Franco Moretti, Matt Jockers, Sarah Allison and Ryan Heuser on a set of Victorian novels, experiments that developed into the first pamphlet issued by the Stanford Literary Lab. Having never tried Docuscope on anything but Shakespeare, I was curious to see how the program would perform on other texts. Looking back on that work, which began with a comparison of tagging techniques using Shakespeare’s plays, I think the group’s most important finding was that different tagging schemes can produce convergent results. By counting different things in the texts – strings that Docuscope tags and, alternatively, words that occur with high frequency (most frequent words) – we were able to arrive at similar groupings of texts using different methods. The fact that literary genres could be rendered according to multiple tagging schemes sparked the idea that genre was not a random projection of whatever we had decided to count. What we began to think as we compared methods, and it is as exciting a thought now as it was then, was that genre was something real.
Real as an iceberg, perhaps, genre may have underwater contours that are invisible but mappable with complementary techniques. Without delving too deeply into the specifics of the pamphlet, I’d like to sketch its findings and then discuss them in some of the terms I outlined in the previous post on critical gestures. First the preliminaries. In the initial experiment, we established a corpus (the Globe Shakespeare) and then used two tagging schemes to assign the tokens into those documents to a smaller number of types. (This is the crucial step of reducing the dimensionality of the documents, or “caricaturing” them.) The first tagging scheme, Docuscope, rendered the plays as percentage scores on the types it counts; the second, implemented by Jockers, identified the most frequent words (MFWs) in the corpus and likewise used these as the types or variables for analysis.
What we found was that the circles drawn by critics around these texts – circles here bounding different genres – could be reproduced by multiple means. Docuscope’s hand-curated tagging scheme did a fairly good job of reproducing the genre groupings via an unsupervised clustering algorithm, but so did the MFWs. We were excited by these results, but also cautious. Perhaps the words counted by Docuscope might include the very MFWs that were producing such good results in the parallel trial, which would mean we were working with one tokenization scheme rather than two. Subsequent experiments on Victorian novels curated by the Stanford team – for example, a comparison of the Gothic Novel versus Jacobin (see pp. 20-23) – showed that Docuscope was adding something over and above what was offered by counting MFWs. MFWs such as “was,” “had,” “who,” and “she,” for example were quite good at pulling these two groups apart when used as variables in an unsupervised analysis. But these high frequency words, even when they composed some of the Docuscope types that were helpful in sorting the genres, were correlated with other text strings that were more narrative in character, phrases such as “heard the,” “reached the,” and “commanded the.” So while we had some overlap in the two tagging schemes, what they shared did not explain the complementary sorting power each seemed to bring to the analysis. The rhetorical and semantic layers picked out in Docuscope were, so to speak, doing something alongside the more syntactically important function words that occur in texts with such high frequency.
The nature of that parallelism or convergence continues to be an interesting subject for thought as we discover more tagging schemes and contemplate making our own. Discussions in the NEH sponsored Early Modern Digital Agendas workshop at the Folger, some of which I have been lucky enough to attend, have pushed Hope and me to return to the issue of convergence and think about it again, especially as we think about how our research project, Visualizing English Print, 1470-1800, might implement new tagging schemes. If MFWs produce viable syntactical criteria for sorting texts, why would this “layer” of syntax be reliably coordinated with another, Docuscope-visible layer that is more obviously semantic or rhetorical? If different tagging schemes can produce convergent results, is it because they are invoking two perspectives on a single entity?
Because one doesn’t get completely different groupings of texts each time one counts new things, we must posit to the existence of something underneath all the variation, something that can be differently “sounded” by counting different things. The main attribute of this entity is its capacity to encourage or limit certain sorts of linguistic entailments. As I think back on how the argument developed in the Stanford paper with Moretti et al., the crucial moment came when we found that we could describe the Gothic novel as having both more spatial prepositions (“from,” “on,” “in,” “to”) and more narrative verb phrases (“heard the,” “reached the”) than the Jacobin novel. Our next move was to begin asking whether either of the tagging schemes was picking out a more foundational or structural layer of the text – whether, for example, the decision to use a certain type of narrative convention and, so, narrative phrase, entailed the use of corresponding spatial prepositions. As soon as the word “structural” appeared, I think everyone’s heart began to beat a little faster. But why? What is so special about the word “structural,” and what does it mean?
In the context of this experiment, I think “structural” means “is the source of the entailment;” its use, moreover, suggests that the entailment has direction. We (the authors of the Stanford paper) were claiming that, in deciding to honor the plot conventions of a particular generic type, the writer of a Gothic novel had already committed him or herself to using certain types of very frequent words that critics tend to ignore. The structure or plot was obligating, perhaps in an unconscious way.
I think now that I would pause before using the word “structure,” a word used liberally in that paper, not because I don’t think there is such a thing, but because I don’t know if it is one or many things. Jonathan Hope and I have been looking for a term to describe the entailments that are the focus of our digital work. We have chose to adopt, in this context, a deliberately “fuzzy structuralism” when talking about entailments among features in texts. We would prefer to say, that is, that the presence of one type of token (spatial preposition) seems to entail the presence of another type (narrative verb phrases), and remain agnostic about the direction of the entailment. Statistical analysis provides evidence of that relationship, and it is the first order of iterative criticism to describe such entailments, both exhaustively (by laying bare the corpus, counts, and classifying techniques) and descriptively (by identifying, through statistical means, passages that exemplify the variables that classify the texts most powerfully). Just as important, we feel one ought where possible to assign a shorthand name – “Gothicness,” “Shakespearean” – to the features that help sort certain kinds of texts. In doing so, we begin to build a bridge connecting our linguistic description to certain already known genre conventions that critics recognize or “circle” in their own thinking. But the application of the term”Gothic,” and the further claim that this names the cause of the entailments we discern by multiple means, deserves careful scrutiny.
A series of questions about this entailment entity, then, which sits just under the waterline of our immediate reading:
• How does entailment work? This is a very important question, since it gets at the problem of layers and depth. At one point in the work with the Stanford team, Ryan Heuser offered the powerful analogy alluded to above: genre is like an iceberg, with features visible above the water but depths unseen below. Plot, we all agreed, is an above the waterline phenomenon, whereas MFW word use and certain semantic choices are submerged below the threshold of conscious attention. In the article we say that the below-the-waterline phenomena sounded by our tagging schemes are entailed by the “higher order” choices made when the writer decided to write a “Gothic novel” or “history play.” I still like this idea, but worry it might suggest that all features of genre are the result of some governing, genre-conscious choice. What if some writers, in learning to mimic other writers, take sentence level cues and work “upward” from there? Couldn’t there be some kind of semi-conscious or sentence-based absorption of literary conventions that is specifically not a mimicry of plot?
• Are the entailments pyramidal, with a governing apex at the top, or are they multi-nodal and so radiating from different points within the entity? I can see how syntax, which is mediated by function or high-frequency words, is closely tied to certain higher order choices. If I want to write stories about lovers who don’t get along, this will entail using a lot of singular pronouns in the first and second person alongside words that support mutual misunderstanding. There is a relationship of entailment between these two things, and the source of that entailment is often called “plot” or “genre.” Here again we are at an interpretive turning point, since the names applied to types of texts are as fluid, at least potentially, as those assigned to types of words. Such names can be misleading. Suppose, for example, that I have identified the distinct signature of something like a “Shakespearean sentence,” and that this signature is apparent in all of Shakespeare’s plays. (An author-specific linguistic feature set was created for J. K. Rowling just last week.) Suppose further that, as Shakespeare is almost singlehandedly launching the history play as a theatrical genre in the 1590s, this authorial feature propagates alongside the plot-level features he establishes for the genre. Now someone shows that this Shakespearean sentence signature is reliably present in most plays that critics now call histories. Is that entailment upheld by the force of genre or authorship? The question would be just as hard to answer if we noticed that the generic signal of history plays spans the rest of Shakespeare’s writing and is a useful feature for differentiating his works from those of other authors.
• If entailments can be resolved at varying depths of field, like the two cats below, which are simultaneously resolved by the Lytro Camera at multiple focal lengths, how can we be sure that they are individual pieces of a single entity or scene? Different tagging schemes support the same groupings of texts, so there must be something specific “there” to be tagged which has definite contours. I remain astonished that the groupings derived from tagging schemes like Docuscope and MFWs correspond to names we use in literary criticism, names that designate authors and genres of fiction. But entailments are plural: some seem to correspond to what we call authorship, others genre, and perhaps still others to the medium itself (the small twelvemo, for example, often contains different kinds of words than those found in the larger folio format). There are biological constraints on how long we can attend to a single sentence. The nature and source of these entailments has thus got to be the subject of ongoing study, one that bridges a range of fields as wide as there are forces that constrain language use.
Entailment is real; it suggests an entity. But how should we describe that entity, and with what terms or analogies can its depths be resolved? Sometimes there may be multiple cats, sitting apart in the same room. Sometimes what seems like two icebergs may in fact be one.
Image from the Lytro Camera resolving objects at multiple depths