Visualizing English Print, 1530 -1800: The Corpus, Tag Sets, and Topics

Screen Shot 2013-11-27 at 4.28.21 PM

Visualization of the corpus using a topic model. A prototype of this corpus exploration and visualization tool, Serendip, was in use this fall at the Folger Shakespeare Library. It was designed by Eric Alexander.

Here begins a series of posts on a larger dataset we have been studying at Wisconsin under the auspices of “Visualizing English Print, 1530-1800,” a Mellon funded research project that brings together computer scientists and literary scholars from several institutions — UW, Madison, Strathclyde University (U.K.), and the Folger Shakespeare Library. A profile of the project members appears here.

I would like to begin by posting the data set that we are working with, which consists of texts drawn from the EEBO-TCP corpus. We assembled the corpus by drawing 40 texts at random from 27 decades within the corpus, beginning with the decade 1530-1539 and ending with 1790-1799. Texts under 500 words in length were excluded from selection. We knew that this selection could not be truly random, since the TCP project selected texts for transcription that it felt would be of interest to scholars. (Full disclosure: I am on the TCP Executive Board.) But we did want to frustrate the natural urge to pick texts we knew and liked: that would limit the kind of lexical and generic variation we want to study.

These texts are now being tagged and analyzed according to several different schemes. Members of our team have created tools for visualizing patterns of variation within the corpus, tools that we want eventually to share with others. There are two techniques I want to discuss: analysis of texts using a tagging scheme derived from the original Docuscope (what we call “Docuscope Junior”) and analysis of texts using a topic model. The former was implemented by Mike Gleicher, the latter by Eric Alexander.

A .csv.zip file showing the texts and their scores on both the DSJ variables and then the topics (each in ascending alphabetical order) can be downloaded at the following location: 1080forWDS. A zipped copy of a folder containing HTML pages of all the documents in the dataset, tagged with DSJ, can be downloaded here. (Note that some of these files, for example Ralegh’s History of the World, are very large and may swamp your browser.) These pages were generated by a utility called Ubiqu+ity, a Wisconsin-designed tagging tool that implements the DSJ tagging scheme. Ubiqu+ity can also tag a corpus with a user-defined tagging scheme, which should enable us to try different schemes on the same corpus.

Now a first picture of the corpus, showing a Docuscope LAT (Language Action Type) that increases steadily over time in the corpus:

Screen Shot 2013-11-27 at 3.28.49 PM

Perhaps this is not a surprising trend, but we were pleased to see something right away that made sense. Red dots in this graph are dramatic texts, whereas the blue texts are all others. Here is an example of a text encoded by DSJ that is high on this variable. You can scroll down on your HTML browser at left, click on SubjectivePercept, and all of the items so tagged will be highlighted. The text is Abroad and at home: A comic opera, in three acts. Now performing at the Theatre-Royal, Covent-Garden. By J. G. Holman, title page publication date 1796, which is the red dot significantly elevated above the trend line, at the far right. You can view it here:

K035834.000_1796_AbroadandathomeAcomi_Docuscope

Please note that the texts passed through DSJ were modernized using VARD 2, developed at Lancaster. The topic model was implemented on texts that were not modernized, although data from both trials are included in the .csv file above.

 

Posted in Visualizing English Print (VEP) | Tagged , | Leave a comment

Manuscript Average

Here’s a terrific post about examining manuscripts pages as image aggregates by Jesse Hurlbut, entitled “Manuscript Average.” Below, 331 folios from Guillaume Fillastre, La Toison d’Or, livre I, BNF Fr 138.

BNF Fr 138 (165 folios)

Posted in Counting Other Things | Leave a comment

Fuzzy Structuralism

Several years ago I did some experiments with Franco Moretti, Matt Jockers, Sarah Allison and Ryan Heuser on a set of Victorian novels, experiments that developed into the first pamphlet issued by the Stanford Literary Lab. Having never tried Docuscope on anything but Shakespeare, I was curious to see how the program would perform on other texts. Looking back on that work, which began with a comparison of tagging techniques using Shakespeare’s plays, I think the group’s most important finding was that different tagging schemes can produce convergent results. By counting different things in the texts – strings that Docuscope tags and, alternatively, words that occur with high frequency (most frequent words) – we were able to arrive at similar groupings of texts using different methods. The fact that literary genres could be rendered according to multiple tagging schemes sparked the idea that genre was not a random projection of whatever we had decided to count. What we began to think as we compared methods, and it is as exciting a thought now as it was then, was that genre was something real.

Real as an iceberg, perhaps, genre may have underwater contours that are invisible but mappable with complementary techniques. Without delving too deeply into the specifics of the pamphlet, I’d like to sketch its findings and then discuss them in some of the terms I outlined in the previous post on critical gestures. First the preliminaries. In the initial experiment, we established a corpus (the Globe Shakespeare) and then used two tagging schemes to assign the tokens into those documents to a smaller number of types. (This is the crucial step of reducing the dimensionality of the documents, or “caricaturing” them.) The first tagging scheme, Docuscope, rendered the plays as percentage scores on the types it counts; the second, implemented by Jockers, identified the most frequent words (MFWs) in the corpus and likewise used these as the types or variables for analysis.

What we found was that the circles drawn by critics around these texts – circles here bounding different genres – could be reproduced by multiple means. Docuscope’s hand-curated tagging scheme did a fairly good job of reproducing the genre groupings via an unsupervised clustering algorithm, but so did the MFWs. We were excited by these results, but also cautious. Perhaps the words counted by Docuscope might include the very MFWs that were producing such good results in the parallel trial, which would mean we were working with one tokenization scheme rather than two. Subsequent experiments on Victorian novels curated by the Stanford team – for example, a comparison of the Gothic Novel versus Jacobin (see pp. 20-23) – showed that Docuscope was adding something over and above what was offered by counting MFWs. MFWs such as “was,” “had,” “who,” and “she,” for example were quite good at pulling these two groups apart when used as variables in an unsupervised analysis. But these high frequency words, even when they composed some of the Docuscope types that were helpful in sorting the genres, were correlated with other text strings that were more narrative in character, phrases such as “heard the,” “reached the,” and “commanded the.” So while we had some overlap in the two tagging schemes, what they shared did not explain the complementary sorting power each seemed to bring to the analysis. The rhetorical and semantic layers picked out in Docuscope were, so to speak, doing something alongside the more syntactically important function words that occur in texts with such high frequency.

The nature of that parallelism or convergence continues to be an interesting subject for thought as we discover more tagging schemes and contemplate making our own. Discussions in the NEH sponsored Early Modern Digital Agendas workshop at the Folger, some of which I have been lucky enough to attend, have pushed Hope and me to return to the issue of convergence and think about it again, especially as we think about how our research project, Visualizing English Print, 1470-1800, might implement new tagging schemes. If MFWs produce viable syntactical criteria for sorting texts, why would this “layer” of syntax be reliably coordinated with another, Docuscope-visible layer that is more obviously semantic or rhetorical? If different tagging schemes can produce convergent results, is it because they are invoking two perspectives on a single entity?

Because one doesn’t get completely different groupings of texts each time one counts new things, we must posit to the existence of something underneath all the variation, something that can be differently “sounded” by counting different things. The main attribute of this entity is its capacity to encourage or limit certain sorts of linguistic entailments. As I think back on how the argument developed in the Stanford paper with Moretti et al., the crucial moment came when we found that we could describe the Gothic novel as having both more spatial prepositions (“from,” “on,” “in,” “to”) and more narrative verb phrases (“heard the,” “reached the”) than the Jacobin novel. Our next move was to begin asking whether either of the tagging schemes was picking out a more foundational or structural layer of the text – whether, for example, the decision to use a certain type of narrative convention and, so, narrative phrase, entailed the use of corresponding spatial prepositions. As soon as the word “structural” appeared, I think everyone’s heart began to beat a little faster. But why? What is so special about the word “structural,” and what does it mean?

In the context of this experiment, I think “structural” means “is the source of the entailment;” its use, moreover, suggests that the entailment has direction. We (the authors of the Stanford paper) were claiming that, in deciding to honor the plot conventions of a particular generic type, the writer of a Gothic novel had already committed him or herself to using certain types of very frequent words that critics tend to ignore. The structure or plot was obligating, perhaps in an unconscious way.

I think now that I would pause before using the word “structure,” a word used liberally in that paper, not because I don’t think there is such a thing, but because I don’t know if it is one or many things. Jonathan Hope and I have been looking for a term to describe the entailments that are the focus of our digital work. We have chose to adopt, in this context, a deliberately “fuzzy structuralism” when talking about entailments among features in texts. We would prefer to say, that is, that the presence of one type of token (spatial preposition) seems to entail the presence of another type (narrative verb phrases), and remain agnostic about the direction of the entailment. Statistical analysis provides evidence of that relationship, and it is the first order of iterative criticism to describe such entailments, both exhaustively (by laying bare the corpus, counts, and classifying techniques) and descriptively (by identifying, through statistical means, passages that exemplify the variables that classify the texts most powerfully). Just as important, we feel one ought where possible to assign a shorthand name – “Gothicness,” “Shakespearean” – to the features that help sort certain kinds of texts. In doing so, we begin to build a bridge connecting our linguistic description to certain already known genre conventions that critics recognize or “circle” in their own thinking. But the application of the term”Gothic,” and the further claim that this names the cause of the entailments we discern by multiple means, deserves careful scrutiny.

A series of questions about this entailment entity, then, which sits just under the waterline of our immediate reading:

• How does entailment work? This is a very important question, since it gets at the problem of layers and depth. At one point in the work with the Stanford team, Ryan Heuser offered the powerful analogy alluded to above: genre is like an iceberg, with features visible above the water but depths unseen below. Plot, we all agreed, is an above the waterline phenomenon, whereas MFW word use and certain semantic choices are submerged below the threshold of conscious attention. In the article we say that the below-the-waterline phenomena sounded by our tagging schemes are entailed by the “higher order” choices made when the writer decided to write a “Gothic novel” or “history play.” I still like this idea, but worry it might suggest that all features of genre are the result of some governing, genre-conscious choice. What if some writers, in learning to mimic other writers, take sentence level cues and work “upward” from there? Couldn’t there be some kind of semi-conscious or sentence-based absorption of literary conventions that is specifically not a mimicry of plot?

• Are the entailments pyramidal, with a governing apex at the top, or are they multi-nodal and so radiating from different points within the entity? I can see how syntax, which is mediated by function or high-frequency words, is closely tied to certain higher order choices. If I want to write stories about lovers who don’t get along, this will entail using a lot of singular pronouns in the first and second person alongside words that support mutual misunderstanding. There is a relationship of entailment between these two things, and the source of that entailment is often called “plot” or “genre.” Here again we are at an interpretive turning point, since the names applied to types of texts are as fluid, at least potentially, as those assigned to types of words. Such names can be misleading. Suppose, for example, that I have identified the distinct signature of something like a “Shakespearean sentence,” and that this signature is apparent in all of Shakespeare’s plays. (An author-specific linguistic feature set was created for J. K. Rowling just last week.) Suppose further that, as Shakespeare is almost singlehandedly launching the history play as a theatrical genre in the 1590s, this authorial feature propagates alongside the plot-level features he establishes for the genre. Now someone shows that this Shakespearean sentence signature is reliably present in most plays that critics now call histories. Is that entailment upheld by the force of genre or authorship? The question would be just as hard to answer if we noticed that the generic signal of history plays spans the rest of Shakespeare’s writing and is a useful feature for differentiating his works from those of other authors.

• If entailments can be resolved at varying depths of field, like the two cats below, which are simultaneously resolved by the Lytro Camera at multiple focal lengths, how can we be sure that they are individual pieces of a single entity or scene? Different tagging schemes support the same groupings of texts, so there must be something specific “there” to be tagged which has definite contours. I remain astonished that the groupings derived from tagging schemes like Docuscope and MFWs correspond to names we use in literary criticism, names that designate authors and genres of fiction. But entailments are plural: some seem to correspond to what we call authorship, others  genre, and perhaps still others to the medium itself (the small twelvemo, for example, often contains different kinds of words than those found in the larger folio format). There are biological constraints on how long we can attend to a single sentence. The nature and source of these entailments has thus got to be the subject of ongoing study, one that bridges a range of fields as wide as there are forces that constrain language use.

Entailment is real; it suggests an entity. But how should we describe that entity, and with what terms or analogies can its depths be resolved? Sometimes there may be multiple cats, sitting apart in the same room. Sometimes what seems like two icebergs may in fact be one.

Image from the Lytro Camera resolving objects at multiple depths

 

 

Posted in Early Modern Drama, Quant Theory, Uncategorized | Tagged , , , , | Leave a comment

New Image from Original Post from Google Books

We had a request for a clearer version of the image we discussed in our post last year, which shows changes in the catalogued subject of Library of Congress books over the course of several hundred years. Jon Orwant from Google was kind enough to send an updated image, which we’re sharing here. I include below his description of the visualization.

“The visualization…was derived exclusively from the metadata feed of the Library of Congress. The LoC catalog doesn’t restrict itself to American or even English language books, but likely does have some sample bias (as every union catalog does).”

Posted in Counting Other Things | Tagged , , | 1 Response