The Great Work Begins: EEBO-TCP in the wild

SAA2016 plenary round table

Session organiser:         Jonathan Hope, Strathclyde University UK


01 GWB


Objectives of the session

The release of EEBO-TCP phase 1 on 1st January 2015 was a beginning, not an end. This round table will consider the work to be done to, and with, EEBO-TCP: curation, amelioriation, and criticism.

What are the ongoing processes necessary to improve the texts and their metadata? Who should carry these out? How can this work be coordinated and preserved? What are the possibilities for teaching and research with the texts? What tools are available now, and what are desirable for the future? What are the limitations of the TCP corpus, and the dangers of the lure of ‘completeness’?

Participants have been selected with a view to a focus on the EEBO-TCP corpus itself, what needs to be done to the data in the short and medium term to allow the best possible informed use, and how the subject area should organize itself to achieve this.


EEBO-TCP links

about the texts


get the texts


fix the texts


search, tag, visualise the texts


Storify of #Shakeass16 tweets during the session (thanks Meaghan!)



Meaghan Brown, Folger Shakespeare Library, Washington DC, USA

Anupam Basu, Washington University, St Louis, USA

Laura Estill, Texas A&M, USA

Gabriel Egan, De Montfort University, UK

Martin Mueller, Northwestern University, USA

Janelle Jenstad, University of Victoria, Canada

Carl Stahmer, UC Davis, USA


Abstracts and session outline

0: Jonathan Hope: introductions and overview; difference between EEBO and TCP; phase 1 and phase 2 TCP; what we mean by ‘search’, ‘curation’, ‘modernisation’.


1: potential and use cases

Meaghan Brown Origin stories and other bibliographical tales: representing and recording digital developments in the Folger’s Digital Anthology of Early Modern English Drama

paper            slides

The Folger’s Digital Anthology of Early Modern English Drama seeks to become a hub for exploring the dramatic publications of early modern playwrights other than Shakespeare. Building on the transcriptions produced by the EEBO-TCP and the encoding of Martin Mueller’s Shakespeare His Contemporaries project, we aim to present documentary editions of early modern plays in their bibliographic and developmental context. In our prototype metadata portal, constructed by the Roy Rosenzweig Center for History and New Media at George Mason University, you will be able to browse a company’s repertoire, an author’s oeuvre, or a printer’s output, as well as search for a specific play. On each play page, you’ll also see the encoding history of the represented first edition, and follow it from the catalogue record of the library which holds the volume depicted to its EEBO-TCP transcription, access its SHC encoding, and finally read, download, and manipulate it as encoded by the Folger’s Digital Anthology editors. We will provide reliable and flexible encoded texts to serve as the basis for a range of traditional and digital research inquiries, pedagogical exercises, and editorial endeavors, while being transparent about the implications of a corpus derived from individual copies of specific, often problematic playbooks. In June 2016, the Folger will hold the first in a series of workshops to explore the pedagogical potential of this corpus.

Anupam Basu

Overview of Early Print


2: limits and bounds

Laura Estill: “EEBO-TCP: The Searchable (Print) Text and Manuscript Studies”

paper        slides

The Early English Books Online Text Creation Partnership (EEBO-TCP) makes an unprecedented number of early modern texts searchable, which changes the way we research.  Now, when faced with print or manuscript miscellany full of, well, miscellaneity, researchers can go about finding out if the commonplaces, epithets, or turns of phrase have potential print sources. Previously, researchers were limited to first-line indices (and therefore poetry), Project Gutenberg’s poor OCR (Optical Character Recognition, automated text digitization), or the un-scholarly “I’m feeling lucky” Google approach. The danger of EEBO-TCP is the myth of comprehensive searching—the lure of the universal library. EEBO-TCP is a carefully selected corpus, but is far from representing all printed works in English. It is especially imperative that students and scholars recognize EEBO-TCP’s (ever-expanding) limits: the size of its corpus, its metadata, and the search functionality. Manuscript studies cannot be separated from texts and book history any more than manuscripts can be disentangled from print sources in the early modern period. EEBO-TCP will make new editions of manuscripts and new digital projects possible; if we can understand the bounds of EEBO-TCP, we can better understand early modern textual cultures.


Gabriel Egan: Satisfying the Need for Determinate Searching: Labs, APIs, and Search Engines


This talk is concerned with satisfying users who need to speak authoritatively about the presence and absence of particular words and phrases in a large dataset such as EEBO-TCP. (A typical application with this need is an authorship attribution study based on preferred phrasing.) As an alternative to providing a website for users to manually enter the terms they wish to search for and, optionally, the relationships between those terms, it is possible to provide an Application Programming Interface (API) that enables the user’s own software to interrogate the dataset directly. It is also possible to provide a Labs service to help users to develop their own software for interrogating the dataset. These various approaches will be discussed in connection with EEBO-TCP, the wider TCP project, and the UK-only rival to EEBO called JISC Historical Texts.


3 curation and correction

Martin Mueller: Collaborative curation and exploration of the EEBO-TCP texts

The EEBO TCP project is magnficent and flawed. There are millions of known and millions of unknown errors in the digital transcriptions, which, mediated by mobile devices and for better or worse, will provide future scholars with   the most  common and often the only access to Early Modern print culture.  The errors can and should be fixed by users over time.

“Citizen scholars” from high school students through undergraduates to retirees can make useful contributions.Over the past two years, Northwestern undergraduates have made substantial contributions to the correction of some 50,000 words in some 500 non-Shakespearean plays from 1550-1650. Experience has shown that some of the work can be “downsourced” to machines. The technical problems for a collaborative framework are not trivial, but with a modicum of trust and willingness to cooperate they can be solved. The key technical problem consists in creating an environment that lets people fix errors ‘en passant, while working with  texts they are interested in. An energetic project witt the right balance of some centralization and a lot of distributed effort would produce significantly better texts over a five-year period.


Janelle Jenstad: Catch, Tag, and Release: Coordinating our Efforts to Build the Early Modern Corpus

The work of correcting EEBO-TCP texts is formidable. MoEML‘s work with EEBO-TCP’s XML files shows that transcribers need to supply gaps, capture forme work, correct mis-transcriptions, and restore early modern typographical habits and idiosyncracies.  Only with many partners working in coordination will we be able to establish an accurate corpus suitable for text mining, copy-text editing, and critical editions. We might think of such work in terms of a “catch-tag-release” model, whereby various entities “catch” EEBO-TCP texts from the data stream, “tag” them in TEI Simple (developed by Mueller), correct both tagging and transcriptions through teams of emerging scholars, and then “release” the texts back into the scholarly wilds. Mueller has already described how a corrective tagging process might work, and the Folger’s Digital Anthology project prototypes a repository environment that will allows us to release texts back into the wild. We also need to capture corrective work that has already been done, such as the ISE‘s transcriptions of the quarto and folio transcriptions of Shakespeare’s plays. These transcriptions are highly accurate, having been double-keyed by research assistants, carefully checked by the play editors, and peer reviewed. Their markup predates the development of XML or TEI, but can be dynamically converted (with some effort) into TEI Simple for general “release” alongside other EEBO-TCP transcriptions. From this stage, we can use various XSLT scenarios to convert the TEI Simple both into the plaintext suitable for corpus-wide analyses and into a variety of XML forms suitable for web publication and further editorial work.  The limitations of EEBO-TCP transcriptions and the effort required to correct them should make us mindful of the effect of “unevenness” across the corpus. The ISE proposes to replace reasonably good EEBO-TCP transcriptions of Shakespeare’s play with excellent transcriptions. But what of the texts in which SAA members are less invested? Some of them have error rates of two or more errors per line. Which will we correct first? Will we bestow as much care and time on them as we have on Shakespeare? How will our answers to those questions affect the results of distant reading and data mining exercises?


Carl Stahmer, UC Davis, USA: “Social Curation: A Model for Peer Reviewed, Collaborative Collation of Metadata and Texts”

Since 1999, the Early English Books Online Text Creation Partnership (EBBO-TCP) has undertaken the gargantuan effort of making publicly available TEI encoded full-text versions of the Early English Books Online (EBBO) corpus. Like all projects of this magnitude, the text transcriptions in the corpus contain a variety of errors and omissions.  Whether by hand or computer, textual transcription is a difficult and time consuming task that requires extensive editing and re-editing to produce accurate representations, and EBBO-TCP is no exception to this rule.  On January 1, 2015, the EBBO-TCP corpus entered the public domain, opening the possibility for scholars outside of the TCP workforce to contribute to improving its accuracy.  This work would, like the original creation of the texts, require a significant effort and would be best achieved by employing a wide and distributed body of scholars.  To date, no infrastructure exists for managing this type of distributed textual scholarship.  For the past three years the English Short Title Catalogue (ESTC), through the generous support of the Andrew W Mellon Foundation, has been engaged in designing just such a social curation infrastructure for correcting and enhancing the bibliographic and holding metadata in its collection.  The designed system, which is currently under production, will provide mechanisms for groups of scholars to engage in peer reviewed records management and improvement.   This paper will investigate the ways in which this (or a similar) system could be leveraged to perform social curation of texts in the EBBO-TCP corpus.




Biographical statements

Jonathan Hope is Professor of Literary Linguistics at Strathclyde University, Glasgow. He is joint P-I on the Visualising English Print project, which is producing tools to work with the EEBO-TCP corpus, and was Director of EMDA2013 and EMDA2015, NEH Advanced Institutes in Digital Humanities, held at the Folger Shakespeare Library.

Meaghan Brown is CLIR-DLF Fellow for Data Curation in Early Modern Studies at the Folger Shakespeare Library. Her main project is a Digital Anthology of Early Modern English Drama. She is also the PI on the Identifying Early Modern Books project and writes for Folgerpedia.

Anupam Basu is Mark Steinberg Weil Early Career Fellow in Digital Humanities at Washington University, St Louis, where he is part of the Humanities Digital Workshop. The website Early Modern Print ( is leading the way in allowing users to search the EEBO-TCP database.

Laura Estill is an Assistant Professor of English at Texas A&M University, where she edits the World Shakespeare Bibliography (  She is the author of Dramatic Extracts in Seventeenth-Century English Manuscripts: Watching, Reading, Changing Plays (2015).  Her work has also appeared in The Oxford Handbook of Shakespeare, Shakespeare, Early Theatre, Huntington Library Quarterly, Studies in English Literature, and ArchBook: Architectures of the Book.  She has articles forthcoming in Shakespeare Quarterly and Shakespeare and Textual Studies (Cambridge UP, 2015). She is currently working on DEx: A Database of Dramatic Extracts.

Gabriel Egan is Professor of Shakespeare Studies and Director of the Centre for Textual Studies at De Montfort University. He chairs the Advisory Board for JISC Historical Texts and has served as consultant on several mass digitization projects. He is a Technical Evaluator for the UK’s Arts and Humanities Research Council and a National Teaching Fellow of the UK’s Higher Education

Janelle Jenstad is Associate Professor of English at the University of Victoria.  She directs The Map of Early Modern London (MoEML), comprised of a georeferenced critical edition of the Agas map, an encyclopedia of early modern London, a XML library of literary texts, and a versioned edition of Stow’s Survey of London. She is also Associate Coordinating Editor of the Internet Shakespeare Editions, for which she is editing The Merchant of Venice, and Lead Applicant on Linked Early Modern Drama Online. With Jennifer Roberts-Smith, she co-edited Shakespeare’s Language in Digital Media (forthcoming from Ashgate). Her essays have appeared in Shakespeare Bulletin, Elizabethan Theatre, EMLS, JMEMS,and other venues.

Martin Mueller is Professor of English and Classics at Northwestern University. He has written a book on the Iliad (1984, revised 2009) and “Children of Oedipus and other essay on the imitation of Greek tragedy, 1550-1800″ (1980)

Carl Stahmer is Director of Digital Scholarship at University of California Davis Library, and Associate Director of the English Broadside Ballad Archive (EBBA). He is Technical Director of the English Short Title Catalogue (ESTC). While in the Marine Corps, Carl worked as a programmer on the ARPANET (Advanced Research Projects Agency Network). He left the Marines to pursue his Ph.D. in English, but the “ARPANET stuck with me, and I began to see strong connections between the way people there were talking about networks and exchange of information and the way people in English Departments were talking about how information gets puts together as narrative”.




Posted in Uncategorized | Leave a comment

Latour, the Digital Humanities, and the Divided Kingdom of Knowledge


Participants at the NLH Conference, “Recomposing the Humanities with Bruno Latour”: Barbara Herrnstein-Smith, Rita Felski, Bruno Latour, Nigel Thrift, Michael Witmore, Dipesh Chakrabarti, and Stephen Muecke.

The data and texts on this page serve as a companion to my article, “Latour, the Digital Humanities, and the Divided Kingdom of Knowledge” which appears in a special issue of New Literary History, 2016, 47:353-375. The article grew out of a presentation (“How to Exhaust Your Object”) which I made in September 2015 at a University of Virginia conference, organized by Rita Felski, “Recomposing the Humanities with Bruno Latour.”

Abstract: Talk about the humanities today tends to focus on their perceived decline at the expense of other, more technical modes of inquiry. The big S “sciences” of nature, we are told, are winning out against the more reflexive modes of humanistic inquiry that encompass the study of literature, history, philosophy, and the arts. This decline narrative suggests we live in a divided kingdom of disciplines, one composed of two provinces, each governed by its own set of laws. Enter Bruno Latour, who, like an impertinent Kent confronting an aging King Lear, looks at the division of the kingdom and declares it misguided, even disastrous. Latour’s narrative of the modern bifurcation of knowledge sits in provocative parallel with the narrative of humanities-in-decline: what humanists are trying to save (that is, reflexive inquiry directed at artifacts) was never a distinct form of knowledge. It is a province without borders, one that may be impossible to defend. We are now in the midst of a further plot turn with the arrival of digital methods in the humanities, methods that seem to have strayed into our province from the sciences.  As this new player weaves in and out of the plots I have just described, some interesting questions start to emerge. Does the use of digital methods in the humanities represent an incursion across battle lines that demands countermeasures, a defense of humanistic inquiry from the reductive methods of the natural or social sciences? Will humanists lose something precious by hybridizing with a strain of knowledge that sits on the far side of the modern divide? What is this precious thing that might be lost, and whose is it to lose?

The analysis presented in the article is based on a set of texts that were tagged (features were counted) using a tool called Ubiq+Ity, which counts features in texts specified by users or those captured a default feature-set known as Docuscope. The tool’s creation and associated research was funded by the Mellon Foundation under the “Visualizing English Print, 1530-1800” grant.

From this post, users can find the source texts, tagging code, data, and “marked up” Shakespeare plays as HTML documents (documents that show where the features “if” “and” or “but” occur in each of the 38 plays). The source texts were taken from the API created at the Folger Shakespeare Library for the Folger Editions, which are now available online. Thirty-eight Shakespeare plays were extracted from these online editions, excluding speech prefixes and stage directions, and then lightly curated (replacement of smart apostrophe with a regular one, emendation of é to e, insertion of spaces before and after em-dashes). Those texts were then uploaded in a zipped folder to Ubiq+Ity, along with a custom rules .csv that specified the features to be counted in this corpus (if, and, but). Once tagged, Ubiqu+Ity returned a .csv file containing the percentage counts for all of the plays. (I have removed some of the extraneous columns that do not pertain to the analysis, and added the genre medatata discussed in the article.) Ubiq+Ity also returned a set of dynamically annotated texts — HTML files of each individual play — that can be viewed on a browser, turning on and off the three features so that readers can see how and where they occur in the plays. Data from the counts were then visualized in three dimensions using the statistical software package JMP, which was also used to perform Student’s t-test. All of the figures from the article can be found here.

Posted in Shakespeare, Visualizing English Print (VEP) | Leave a comment

Auerbach Was Right: A Computational Study of the Odyssey and the Gospels

Rembrandt, The Denial of St. Peter (1660), Rijksmuseum

Rembrandt, The Denial of St. Peter (1660), Rijksmuseum

In the “Fortunata” chapter of his landmark study, Mimesis: The Representation of Reality, Eric Auerbach contrasts two representations of reality, one found in the New Testament Gospels, the other in texts by Homer and a few other classical writers. As with much of Auerbach’s writing, the sweep of his generalizations is broad. Long excerpts are chosen from representative texts. Contrasts and arguments are made as these excerpts are glossed and related to a broader field of texts. Often Auerbach only gestures toward the larger pattern: readers of Mimesis must then generate their own (hopefully congruent) understanding of what the example represents.

So many have praised Auerbach’s powers of observation and close reading. At the very least, his status as a “domain expert” makes his judgments worth paying attention to in a computational context. In this post, I want to see how a machine would parse the difference between the two types of texts Auerbach analyzes, stacking the iterative model against the perceptions of a master critic. This is a variation on the experiments I have performed with Jonathan Hope, where we take a critical judgment (i.e., someone’s division of Shakespeare’s corpus of plays into genres) and then attempt to reconstruct, at the level of linguistic features, the perception which underlies that judgment. We ask, Can we describe what this person is seeing or reacting to in another way?

Now, Auerbach never fully states what makes his texts different from one another, which makes this task harder. Readers must infer both the larger field of texts that exemplify the difference Auerbach alludes to, and the difference itself as adumbrated by that larger field. Sharon Marcus is writing an important piece on this allusive play between scales — between reference to an extended excerpt and reference to a much larger literary field. Because so much goes unstated in this game of stand-ins and implied contrasts, the prospect of re-describing Auerbach’s difference in other terms seems particularly daunting. The added difficulty makes for a more interesting experiment.

Getting at Auerbach’s Distinction by Counting Linguistic Features

I want to offer a few caveats before outlining what we can learn from a computational comparison of the kinds of works Auerbach refers to in his study. For any of what follows to be relevant or interesting, you must take for granted that the individual books of the Odyssey and the New Testament Gospels (as they exist in translation from Project Gutenberg) represent adequately the texts Auerbach was thinking about in the “Fortunata” chapter. You must grant, too, that the linguistic features identified by Docuscope are useful in elucidating some kind of underlying judgments, even when it is used on texts in translation. (More on the latter and very important point below.) You must further accept that Docuscope, here version 3.91, has all the flaws of a humanly curated tag set. (Docuscope annotates all texts tirelessly and consistently according to procedures defined by its creators.) Finally, you must already agree that Auerbach is a perceptive reader, a point I will discuss at greater length below.

I begin with a number of excerpts that I hope will give a feel for the contrast in question, if it is a single contrast. This is Auerbach writing in the English translation of Mimesis:

[on Petronius] As in Homer, a clear and equal light floods the persons and things with which he deals; like Homer, he has leisure enough to make his presentation explicit; what he says can have but one meaning, nothing is left mysteriously in the background, everything is expressed. (26-27)

[on the Acts of the Apostles and Paul’s Epistles] It goes without saying that the stylistic convention of antiquity fails here, for the reaction of the casually involved person can only be presented with the highest seriousness. The random fisherman or publican or rich youth, the random Samaritan or adulteress, come from their random everyday circumstances to be immediately confronted with the personality of Jesus; and the reaction of an individual in such a moment is necessarily a matter of profound seriousness, and very often tragic.” (44)

[on Gospel of Mark] Generally speaking, direct discourse is restricted in the antique historians to great continuous speeches…But here—in the scene of Peter’s denial—the dramatic tension of the moment when the actors stand face to face has been given a salience and immediacy compared with which the dialogue of antique tragedy appears highly stylized….I hope that this symptom, the use of direct discourse in living dialogue, suffices to characterize, for our purposes, the relation of the writings of the New Testament to classical rhetoric…” (46)

[on Tacitus] That he does not fall into the dry and unvisualized, is due not only to his genius but to the incomparably successful cultivation of the visual, of the sensory, throughout antiquity. (46)

[on the story of Peter’s denial] Here we have neither survey and rational disposition, nor artistic purpose. The visual and sensory as it appears here is no conscious imitation and hence is rarely completely realized. It appears because it is attached to the events which are to be related… (47, emphasis mine)

There is a lot to work with here, and the difference Auerbach is after is probably always going to be a matter of interpretation. The simple contrast seems to be that between the “equal light” that “floods persons and things” in Homer and the “living dialogue” of the Gospels. The classical presentation of reality is almost sculptural in the sense that every aspect of that reality is touched by the artistic designs of the writer. One chisel carves every surface. The rendering of reality in the Gospels, on the other hand, is partial and (changing metaphors here) shadowed. People of all kinds speak, encounter one another in “their random everyday circumstances,” and the immediacy of that encounter is what lends vividness to the story. The visual and sensory “appear…because [they  are] attached to the events which are to be related.” Overt artistry is no longer required to dispose all the details in a single, frieze-like scene. Whatever is vivid becomes so, seemingly, as a consequence of what is said and done, and only as a consequence.

These are powerful perceptions: they strike many literary critics as accurately capturing something of the difference between the two kinds of writing. It is difficult to say whether our own recognition of these contrasts, speaking now as readers of Auerbach, is the result of any one example or formulation that he offers. It may be the case, as Sharon Marcus is arguing, that Auerbach’s method works by “scaling” between the finely wrought example (in long passages excerpted from the texts he reads) and the broad generalizations that are drawn from them. The fact that I had to quote so many passages from Auerbach suggests that the sources of his own perceptions are difficult to discern.

Can we now describe those sources by counting linguistic features in the texts Auerbach wants to contrast? What would a quantitative re-description of Auerbach’s claims look like? I attempted to answer these questions by tagging and then analyzing the Project Gutenberg texts of the Odyssey and the Gospels. I used the latest version of Docuscope that is currently being used by the Visualizing English Print team, a program that scans a corpus of texts and then tallies linguistic features according to a hand curated sets of words and phrases called “Language Action Types” (hereafter, “features”). Thanks to the Visualizing English Print project, I can share the raw materials of the analysis. Here you can download the full text of everything being compared. Each text can be viewed interactively according to the features (coded by color) that have been counted. When you open any of these files in a web browser, select a feature to explore by pressing on the feature names to the left. (This “lights up” the text with that feature’s color).

I encourage you to examine these texts as tagged by Docuscope for yourself. Like me, you will find many individual tagging decisions you disagree with. Because Docuscope assigns every word or phrase to one and only one feature (including the feature, “untagged”), it is doomed to imprecision and can be systematically off base. After some checking, however, I find that the things Docuscope counts happen often and consistently enough that the results are worth thinking about. (Hope and I found this to be the case in our Shakespeare Quarterly article on Shakespeare’s genres.) I always try to examine as many examples of a feature in context as I can before deciding that the feature is worth including in the analysis. Were I to develop this blog post into an article, I would spend considerably more time doing this. But the features included in the analysis here strike me as generally stable, and I have examined enough examples to feel that the errors are worth ignoring.


We can say with statistical confidence (p=<.001) that several of the features identified in this analysis are likely to occur in only one of the two types of writing. These and only these features are the ones I will discuss, starting with an example passage taken from the Odyssey. Names of highlighted features appear on the left hand side of the screen shot below, while words or phrases assigned to those features are highlighted in the text to the right. Again, items highlighted in the following examples appear significantly more often in the Odyssey than in the New Testament Gospels:

Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

Book I is bustling with description of the sensuous world. Words in pink describe concrete objects (“wine,” “the house”, “loom”) while those in green describe things involving motion (verbs indicating an activity or change of state). Below are two further examples of such features:

Screen Shot 2016-01-05 at 8.33.24 AM

Screen Shot 2016-01-05 at 8.30.03 AM

Notice also the purple features above, which identify words involved in mediating spatial relationships. (I would quibble with “hearing” and “silence” as being spatial, per the long passage above, but in general I think this feature set is sound.) Finally, in yellow, we find a rather simple thing to tag: quotation marks at the beginning and end of a paragraph, indicating a long quotation.

Continuing on to a shorter set of examples, orange features in the passages below and above identify the sensible qualities of a thing described, while blue elements indicate words that extend narrative description (“. When she” “, and who”) or words that indicate durative intervals of time (“all night”). Again, these are words and phrases that are more prevalent in the Homeric text:

Screen Shot 2016-01-05 at 8.42.56 AM

Screen Shot 2016-01-08 at 8.32.49 AM

Screen Shot 2016-01-08 at 8.37.28 AM

The items in cyan, particularly “But” and “, but”  are interesting, since both continue a description by way of contrast. This translation of the Odyssey is full of such contrastive words, for example, “though”, “yet,” “however”, “others”, many of which are mediated by Greek particles in the original.

When quantitative analysis draws our attention to these features, we see that Auerbach’s distinction can indeed be tracked at this more granular level. Compared with the Gospels, the Odyssey uses significantly more words that describe physical and sensible objects of experience, contributing to what Auerbach calls the “successful cultivation of the visual.” For these texts to achieve the effects Auerbach describes, one might say that they can’t not use concrete nouns alongside adjectives that describe sensuous properties of things. Fair enough.

Perhaps more interesting, though, are those features below in blue (signifying progression, duration, addition) and cyan (contrastive particles), features that manage the flow of what gets presented in the diegesis. If the Odyssey can’t not use these words and phrases to achieve the effect Auerbach is describing, how do they contribute to the overall impression? Let’s look at another sample from the opening book of the Odyssey, now with a few more examples of these cyan and blue words:

Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

Odyssey, Book 1, Project Gutenberg Text (with discriminating features highlighted)

While this is by no means the only interpretation of the role of the words highlighted here, I would suggest that phrases such as “when she”, “, and who”, or “, but” also create the even illumination of reality to which Auerbach alludes. We would have to look at many more examples to be sure, but these types of words allow the chisel to remain on the stone a little longer; they continue a description by in-folding contrasts or developments within a single narrative flow.

Let us now turn to the New Testament Gospels, which lack the above features but contain others to a degree that is statistically significant (i.e., we are confident that the generally higher measurements of these new features in the Gospels are not so by chance, and vice versa). I begin with a longer passage from Matthew 22, then a short passage from Peter’s denial of Jesus at Matthew 26:71. Please note that the colors employed below correspond to different features than they do in the passages above:

Gospel of Matthew, Project Gutenberg Text (with discriminating features highlighted)

Matthew 22, Project Gutenberg Text (with discriminating features highlighted)

Matthew 26:71, Project Gutenberg Text (with discriminating features highlighted)

The dialogical nature of the Gospels is obvious here. Features in blue, indicating reports of communication events, are indispensable for representing dialogical exchange (“he says”, “said”, “She says”). Features in orange, which indicate uses of the third person pronoun, are also integral to the representation of dialogue; they indicate who is doing the saying. The features in yellow represent (imperfectly, I think) words that reference entities carrying communal authority, words such as “lordship,” “minister,” “chief,” “kingdom.” (Such words do not indicate that the speaker recognizes that authority.) Here again it is unsurprising that the Gospels, which contrast spiritual and secular forms of obligation, would be obliged to make repeated reference to such authoritative entities.

Things that happen less often may also play a role in differentiating these two kinds of texts. Consider now a group of features that, while present to a higher and statistically significant degree in the Gospels, are nevertheless relatively infrequent in comparison to the dialogical features immediately above. We are interested here in the words highlighted in purple, pink, gray and green:

Matthew 13:5-6, Project Gutenberg Text (with discriminating features highlighted)

Matthew 27:54, Project Gutenberg Text (with discriminating features highlighted)

Matthew 23:16-17, Project Gutenberg Text (with discriminating features highlighted)

Features in purple mark the process of “reason giving”; they identify moments when a reader or listener is directed to consider the cause of something, or to consider an action’s (spiritually prior) moral justification. In the quotation from Matthew 13, this form of backward looking justification takes the form of a parable (“because they had not depth…”). The English word “because” translates a number of ancient Greek words (διὰ, ὅτι); even a glance at the original raises important questions about how well this particular way of handling “reason giving” in English tracks the same practice in the original language. (Is there a qualitative parity here? If so, can that parity be tracked quantitatively?) In any event, the practice of letting a speaker — Jesus, but also others — reason aloud about causal or moral dependencies seems indispensable to the evangelical programme of the Gospels.

To this rhetoric of “reason giving” we can add another of proverbiality. The word “things”  in pink (τὰ in the Greek) is used more frequently in the Gospels, as are words such as “whoever,” which appears here in gray (for Ὃς and ὃς). We see comparatively higher numbers of the present tense form of the verb “to be” in the Gospels as well, here highlighted in green (“is” for ἐστιν). (See the adage, “many are called, but few are chosen” in the longer Gospel passage from Matthew 22 excerpted above, translating Πολλοὶ γάρ εἰσιν κλητοὶ ὀλίγοι δὲ ἐκλεκτοί.)

These features introduce a certain strategic indefiniteness to the speech situation: attention is focused on things that are true from the standpoint of time immemorial or prophecy. (“Things” that just “are” true, “whatever” the case, “whoever” may be involved.). These features move the narrative into something like an “evangelical present” where moral reasoning and prophecy replace description of sensuous reality. In place of concrete detail, we get proverbial generalization. One further effect of this rhetoric of proverbiality is that the searchlight of narrative interest is momentarily dimmed, at least as a source illuminating an immediate physical reality.

What Made Auerbach “Right,” And Why Can We Still See It?

What have we learned from this exercise? Answering the most basic question, we can say that, after analyzing the frequency of a limited set of verbal features occurring in these two types of text (features tracked by Docuscope 3.91), we find that some of those features distribute unevenly across the corpus, and do so in a way that tracks the two types of texts Auerbach discusses. We have arrived, then, at a statistically valid description of what makes these two types of writing different, one that maps intelligibly onto the conceptual distinctions Auerbach makes in his own, mostly allusive analysis. If the test was to see if we can re-describe Auerbach’s insights by other means, Auerbach passes the test.

But is it really Auerbach who passes? I think Auerbach was already “right” regardless of what the statistics say. He is right because generations of critics recognize his distinction. What we were testing, then, was not whether Auerbach was “right,” but whether a distinction offered by this domain expert could be re-described by other means, at the level of iterated linguistic features. The distinction Auerbach offered in Mimesis passes the re-description test, and so we say, “Yes, that can be done.” Indeed, the largest sources of variance in this corpus — features with the highest covariance — seem to align independently with, and explicitly elaborate, the mimetic strategies Auerbach describes. If we have hit upon something here, it is not a new discovery about the texts themselves. Rather, we have found an alternate description of the things Auerbach may be reacting to. The real object of study here is the reaction of a reader.

Why insist that it is a reader’s reactions and not the texts themselves that we are describing? Because we cannot somehow deposit the sum total of the experience Auerbach brings to his reading in the “container” that is a text. Even if we are making exhaustive lists of words or features in texts, the complexity we are interested in is the complexity of literary judgment. This should not be surprising. We wouldn’t need a thing called literary criticism if what we said about the things we read exhausted or fully described that experience. There’s an unstatable fullness to our experience when we read. The enterprise of criticism is the ongoing search for ever more explicit descriptions of this fullness. Critics make gains in explicitness by introducing distinctions and examples. In this case, quantitative analysis extends the basic enterprise, introducing another searchlight that provides its own, partial illumination.

This exercise also suggests that a mimetic strategy discernible in one language survives translation into another. Auerbach presents an interesting case for thinking about such survival, since he wrote Mimesis while in exile in Istanbul, without immediate access to all of the sources he wants to analyze. What if Auerbach was thinking about the Greek texts of these works while writing the “Fortunata” chapter? How could it be, then, that at least some of what he was noticing in the Greek carries over into English via translation, living to be counted another day? Readers of Mimesis who do not know ancient Greek still see what Auerbach is talking about, and this must be because the difference between classical and New Testament mimesis depends on words or features that can’t be omitted in a reasonably faithful translation. Now a bigger question comes into focus. What does it mean to say that both Auerbach and the quantitative analysis converge on something non-negotiable that distinguishes these the two types of writing? Does it make sense to call this something “structural”?

If you come from the humanities, you are taking a deep breath right about now. “Structure” is a concept that many have worked hard to put in the ground. Here is a context, however, in which that word may still be useful. Structure or structures, in the sense I want to use these words, refers to whatever is non-negotiable in translation and, therefore, available for description or contrast in both qualitative and quantitative terms. Now, there are trivial cases that we would want to reject from this definition of structure. If I say that the Gospels are different from the Odyssey because the word Jesus occurs more frequently in the former, I am talking about something that is essential but not structural. (You could create a great “predictor” of whether a text is a Gospel by looking for the word “Jesus,” but no one would congratulate you.)

If I say, pace Auerbach, that the Gospels are more dialogical than the Homeric texts, and so that English translations of the same must more frequently use phrases like “he said,” the difference starts to feel more inbuilt. You may become even more intrigued to find that other, less obvious features contribute to that difference which Auerbach hadn’t thought to describe (for example, the present tense forms of “to be” in the Gospels, or pronouns such as “whoever” or “whatever”). We could go further and ask, Would it really be possible to create an English translation of Homer or the Gospels that fundamentally avoids dialogical cues, or severs them from the other features observed here? Even if, like the translator of Perec’s La Disparition, we were extremely clever in finding a way to avoid certain features, the resulting translation would likely register the displacement in another form. (That difference would live to be counted another way.) To the extent that we have identified a set of necessary, indispensable, “can’t not occur” features for the mimetic practice under discussion, we should be able to count it in both the original language as well as a reasonably faithful translation.

I would conjecture that for any distinction to be made among literary texts, there must be a countable correlate in translation for the difference being proposed. No correlate, no critical difference — at least, if we are talking about a difference a reader could recognize. Whether what is distinguished through such differences is a “structure,” a metaphysical essence, or a historical convention is beside the point. The major insight here is that the common ground between traditional literary criticism and the iterative, computational analysis of texts is that both study “that which survives translation.” There is no better or more precise description of our shared object of study.

Posted in Counting Other Things, Quant Theory | Tagged , , , | Comments closed

Visualizing the Big Names of Early Modern Science

Visualizing English Print is currently working with a new corpus of Big Name scientific texts. The corpus contains 329 texts by 100 authors, drawn from EEBO-TCP and covering the period 1530-1724. These Big Name authors were selected on the basis of their prominence as early modern writers who address scientific subjects. The process of selecting Big Name authors involved searching for the most well-known and influential figures of the period (e.g. Francis Bacon, Robert Boyle, Descartes), followed by a search for key scientific terms in the metadata of the EEBO-TCP csv file (e.g. ‘Physics’, ‘Astronomy’, ‘Atoms’, ‘Matter’).

What types of texts constitute ‘scientific texts’?

Since the ‘genres’ of early modern science were diverse (and the disciplinary boundaries rather fluid) the parameters of the corpus had to reflect this diversity. To this end, the corpus is divided into scientific subgenres (detailed below.) Because of the ways in which these genres intersect, texts are assigned to subgenres based on the prominence of a particular feature of the work. For example, there are generic crossovers with texts on Mathematics, Astronomy and Instruments – Astronomy relies on geometry, and there are a number of mathematical instruments. If the texts appear to foreground Astronomy or Instruments they are assigned to the relevant groupings. This approach is observed with every subgenre.

With the finalised corpus we have been running some preliminary PCA experiments to see if any interesting patterns emerge. The following PCA visualizations provide a general overview of the data, and some first impressions of how the scientific subgenres are patterning. The diagrams below were produced using JMP.

Overview (Click to enlarge images)


This is a PCA visualization of the complete data set of the 329 texts. LATs with frequent zero values have been excluded.


Here is a visualization with the subgenres highlighted:


Astronomy = Blue

Mathematics = Dark Green

Instruments = Red

Physics = Lilac

Philosophy of Science = Green

Science/Religion = Lime Green

Natural History = Purple

Occultism = Orange

Medicine- Anatomy = Brown

Medicine-Disorders = Green/Brown

Medicine-Treatments = Light Blue

There is a lot of information to process in this image, but if we isolate and compare specific subgenres, clearer patterns begin to emerge. For example, this image maps Astronomy:


And this image shows the related subgenres of Mathematics (Green) and Instruments (Red):

Maths Instruments

Mathematics and Instruments appear to be grouping in the upper left of the PCA space. The LATs associated with this space include Abstract Concepts, Space Relation and MoveBody – types of language that deal with the special terminology of abstraction, and with bodies extended and moving in space. Such a result would seem to confirm what we might expect of Mathematics – a genre concerned with representing the physical world and physical processes through abstraction. Astronomy is interesting in that, while it appears to share in the traits of abstraction and space, it is also drawn into the other three quadrants. Why might this be the case?

A possible reason may be that people studied the stars and planets, not just from a mathematical perspective, but from an imaginative one; astronomy is bound up with mythology and astrology. In the visualization above, one of the Astronomical outliers is marked with a triangle in the lower right of the PCA space – an area that contains LATs, such as Positivity, Person Property and Personal Pronoun. The text in question is Robert Greene’s Planetomachia (1585), which blends classical mythology, religion and astrology in the form of a dramatic dialogue between the planets. The text has a high frequency of Abstract Concepts and Sense Objects; but it is drawn into this lower right quadrant by LATs, such as Personal Pronouns, Person Property (formal titles, identity roles), Subjective Perception, and positive and negative language:


What we are seeing here is a division in the PCA space between broadly imaginative and instructional modes of writing.


This visualisation displays the three subgenres of Medicine: Medicine-Anatomy (Green), Medicine-Disorders (Red) and Medicine-Treatments (Blue):


The majority of the Medicine subgenres are patterning across the lower half of the PCA space. The LATs of the right quadrant include Support (i.e. justify an argument), Responsibility (i.e. answerability for a certain state of affairs) and Reassure (words of comfort) – all of which we can imagine in a medical context. On the left, we have the LATs, Recurrence (over time), Imperative and Reporting Event. Medicine-Treatments (Blue) appears to be inclining towards the left; the reason for this may be the high frequency of Reporting Events found in these texts. The Docuscope definition of Reporting Events is: ‘learning about events that may not be known yet or that can lead to the learning of new information’. But it reads very much like imperative language, a recipe, or the instructions and recommendations of a Doctor. See, for example, this extract from Sir Kenelm Digby’s Choice and experimented receipts in physick and chirurgery (1675):


Philosophy of Science & Science/Religion

Philosophy of Science and Science/Religion are also subgenres that share common features/generic crossovers. In these subgenres we find a number of the famous early modern scientists (or scientific theorists), who are concerned with questions of methodology, morality and science as a system of knowledge – figures like Francis Bacon, Rene Descartes and Margaret Cavendish. Here is a visualization of Philosophy of Science:

Phil of Science

And here is Science/Religion:


Both subgenres are drawn towards the right of the PCA space. In the upper quadrant we find LATs such as, Confidence, Uncertainty, Question, Contingency and Common Authorities, types of language that may be used in the service of discursive writing. For example, here is an extract from Bacon’s Novum Organum (1676):


The common occurrence of Subjective Perception (i.e. observation that tells us as much about the perceiver as the perceived) throughout this text is also a feature of the right side of the PCA space, where we find language that deals with subjectivity, inner thought and the disclosure of personal opinion.

In the Science/Religion visualization above, an extreme outlier in the upper right quadrant is marked in dark purple. Compared to the rest of the corpus, this text scores very high on LATs such as, Private Thinking, First Person, Self-Disclosure and Uncertainty. Such a result may not be so surprising when we discover that this text is Descartes’ Meditations (1680); but it does, perhaps, indicate a gulf in scientific style, whereby the inquiring ‘subject’ begins to feature almost as much as the ‘object’ of inquiry:


 What Next?

The notion of subjectivity raises some interesting questions about the scientific texts we see drawn into the upper right quadrant, compared to the rest of the corpus. For example, does this indicate a more modern, or individualistic approach to the study of science? If so, how do we square emerging scientific ideas of objectivity and ‘matters of fact’ with the subjective perception of the individual who reports these facts?

To begin answering these questions, we aim to examine a group of texts that are commonly thought of as ‘defining’ modern scientific discourse, against those that are/were considered archaic – namely the writings of the Royal Society and the literature of the Occult.

Watch this space.



Posted in Uncategorized | Leave a comment

Data and Metadata

(Post by Jonathan Hope and Beth Ralston; data preparation by Beth Ralston.)

It is all about the metadata. That and text processing. Currently (July 2015) Visualising English Print (Strathclyde branch) is focussed on producing a hand-curated list of all ‘drama’ texts up to 1700, along with checked, clean metadata. Meanwhile VEP (Wisconsin branch) works on text processing (accessing TCP texts in a suitable format, cleaning up rogue characters, splitting up collected volumes into individual plays, stripping-out speech prefixes and non-spoken text, modernising/regularising).

We are not the only people doing this kind of work on Early Modern drama: Meaghan Brown at The Folger Shakespeare Library is working on a non-Shakespearean corpus, and Martin Mueller has just released the ‘Shakespeare His Contemporaries’ corpus. We’ve been talking to both, and we are very grateful for their help, advice, and generosity with data. In a similar spirit, we are making our on-going metadata collections available – we hope they’ll be of use to people, and that you will let us know of any errors and omissions.

You are welcome to make use of this metadata in any way you like, though please acknowledge the support of Mellon to VEP if you do, and especially the painstaking work of Beth Ralston, who has compared and cross-checked the various sources of information about Early Modern plays.

We hope to be in a position to release tagged texts once we have finalised the make-up of the corpus, and established our processing pipeline. Watch this space.

Many of the issues surrounding the development of usable corpora from EEBO-TCP will be discussed at SAA in 2016 in a special plenary round-table:

SAA session


In preparing these lists of plays and metadata we have made extensive use of Martin Wiggins and Catherine Richardson, British Drama 1533-1642: A Catalogue (Oxford), Alfred Harbage, Annals of English Drama 975-1700, the ESTC, and, most of all,  Zach Lesser and Alan Farmer’s DEEP (Database of Early English Playbooks).

Definitions and History 

One of the usefully bracing things about digital work is that it forces you to define your terms precisely – computers are unforgiving of vagueness, so a request for a corpus of ‘all’ Early Modern drama turns out to be no small thing. Of course everyone defines ‘all’, ‘Early Modern’ and ‘drama’ in slightly different ways – and those using these datasets should be aware of our definitions, and of the probability that they will want to make their own.

The current cut-off date for these files is the same as DEEP – 1660 (though one or two post-1660 plays have sneaked in). Before long, we will extend them to 1700.

By ‘drama’ we mean plays, masques, and interludes. Some dialogues and entertainments are included in the full data set, but we have not searched deliberately for them. We have included everything printed as a ‘play’, including closet dramas not intended for performance.

The immediate history of the selection is that we began with a ‘drama’ corpus chosen automatically by Martin Mueller (using XML tags in the TCP texts to identify dramatic genres). Beth Ralston then checked this corpus against the reference sources listed above for omissions, adding a considerable number of texts. This should not be regarded as ‘the’ corpus of Early Modern drama: it is one of many possible versions, and will continue to change as more texts are added to TCP (there are some transcriptions still in the TCP pipeline, and scholars are working on proposals to continue transcription of EEBO texts after TCP funding ends).

It is likely that each new scholar will want to re-curate a drama corpus to fit their research question – VEP is working on tools to allow this to be done easily.

Files and corpora

1    The 554 corpus

This spreadsheet lists only what we regard as the ‘central’ dramatic texts: plays.

Entertainments, masques, interludes, and dialogues are not included. We have also excluded around 35 play transcriptions in TCP which duplicate transcriptions of the same play made from different volumes (usually a collected edition and a stand-alone quarto).

The spreadsheet includes frequency counts for Docuscope LATs, tagged by Ubiquity, which can be visualised using any statistical analysis program (columns W-EE). For a descriptive list of the LATs, see <Docuscope LATs: descriptions>. For a description of all columns in the spreadsheet, see the <READ ME> file.

[In some of their early work, Hope and Witmore used a corpus of 591 plays which included these duplicates.]

554 metadata

README for 554 metadata

Docuscope LATs: descriptions 


2   The 704 corpus

The 704 corpus spreadsheet lists information for the 554 plays included above, and adds other types of dramatic text, such as masques, entertainments, dialogues, and interludes (mainly drawn from DEEP, and with the same date cut-off: 1660). This corpus also includes the 35 duplicate transcriptions excluded from the 554 spreadsheet.

Docuscope frequency counts are only available for items also in the 554 spreadsheet.

704 metadata

README for 704 metadata


3  The master metadata spreadsheet

Our ‘master metadata’ spreadsheet is intended to be as inclusive as possible. The current version has 911 entries, and we have sought to include a listing for every extant, printed ‘dramatic’ work we know about up to 1660 (from DEEP, Harbage, ESTC, and Wiggins). The spreadsheet does not include every edition of every text, but it does include the duplicate texts found in the 704 corpus. (When we extend the cut-off date to 1700, we expect the number of entries in this spreadsheet to exceed 1500.)

This master list includes all the texts in the 704 list (and therefore the 554 list as well). But it also includes:
• plays which are in TCP but which do not appear in the 554 or 704 corpora (i.e. they were missed first time round). These texts have ‘yes’ in the ‘missing from both’ column (M) of the master spreadsheet.
• plays which are absent from TCP at this time (we note possible reasons for this: some are in Latin, some are fragments, and we assume some have yet to be transcribed). These are texts which have ‘yes’ listed in the ‘missing from both’ column (M) of the master spreadsheet, as well as ‘not in tcp’ listed in the ‘tcp’ column (A).

master metadata

README for master metadata


TCP transcriptions

TCP is one of the most important Humanities projects ever undertaken, and scholars should be grateful for the effort and planning that has gone into it, as well as the free release of its data. It is not perfect however: as well as the issue of texts being absent from TCP, we are also currently dealing with problematic transcriptions on a play-by-play basis. Take Jonson’s 1616 folio (TCP: A04632, ESTC: S112455) for example – it has a very fragmentary transcription, especially during the masques.

page 1

First page of The Irish Masque


In the above image from The Irish Masque, you can see on the right-hand side that the text for this page is not available.

page 2

Second page of The Irish Masque

…However, on the next page the text is there (as far as we can work out, this seems to be due to problems with the original imaging of the book, rather than the transcribers).

Texts with fragmentary transcriptions have been excluded for now, assuming that at some point in the future TCP will re-transcribe them.

As we come across other examples of this, we will add them to page

Posted in Early Modern Drama, Shakespeare, Uncategorized, Visualizing English Print (VEP) | Leave a comment