The Great Work Begins: EEBO-TCP in the wild

SAA2016 plenary round table

Session organiser: Jonathan Hope, Strathclyde University UK [email protected]

Objectives of the session

The release of EEBO-TCP phase 1 on 1^st January 2015 was a beginning, not an end. This round table will consider the work to be done to, and with, EEBO-TCP: curation, amelioriation, and criticism.

What are the ongoing processes necessary to improve the texts and their metadata? Who should carry these out? How can this work be coordinated and preserved? What are the possibilities for teaching and research with the texts? What tools are available now, and what are desirable for the future? What are the limitations of the TCP corpus, and the dangers of the lure of ‘completeness’?

Participants have been selected with a view to a focus on the EEBO-TCP corpus itself, what needs to be done to the data in the short and medium term to allow the best possible informed use, and how the subject area should organize itself to achieve this.

EEBO-TCP links

about the texts

http://www.textcreationpartnership.org/tcp-eebo/

http://blogs.bodleian.ox.ac.uk/eebotcp/

http://hfroehli.ch/tag/eebo-tcp/

get the texts

https://ota.ox.ac.uk/tcp/

http://quod.lib.umich.edu/e/eebogroup/

https://github.com/textcreationpartnership/Texts

fix the texts

http://annolex.at.northwestern.edu/about/

search, tag, visualise the texts

http://earlyprint.wustl.edu

http://vep.cs.wisc.edu

Storify of #Shakeass16 tweets during the session (thanks Meaghan!)

https://storify.com/EpistolaryBrown/the-great-work-begins

Participants

Meaghan Brown, Folger Shakespeare Library, Washington DC, USA

[email protected]

Anupam Basu, Washington University, St Louis, USA

[email protected]

Laura Estill, Texas A&M, USA

[email protected]

Gabriel Egan, De Montfort University, UK

[email protected]

Martin Mueller, Northwestern University, USA

[email protected]

Janelle Jenstad, University of Victoria, Canada

[email protected]

Carl Stahmer, UC Davis, USA

[email protected]

Abstracts and session outline

0: Jonathan Hope: introductions and overview; difference between EEBO and TCP; phase 1 and phase 2 TCP; what we mean by ‘search’, ‘curation’, ‘modernisation’.

1: potential and use cases

Meaghan Brown Origin stories and other bibliographical tales: representing and recording digital developments in the Folger’s Digital Anthology of Early Modern English Drama

paper slides

The Folger’s Digital Anthology of Early Modern English Drama seeks to become a hub for exploring the dramatic publications of early modern playwrights other than Shakespeare. Building on the transcriptions produced by the EEBO-TCP and the encoding of Martin Mueller’s Shakespeare His Contemporaries project, we aim to present documentary editions of early modern plays in their bibliographic and developmental context. In our prototype metadata portal, constructed by the Roy Rosenzweig Center for History and New Media at George Mason University, you will be able to browse a company’s repertoire, an author’s oeuvre, or a printer’s output, as well as search for a specific play. On each play page, you’ll also see the encoding history of the represented first edition, and follow it from the catalogue record of the library which holds the volume depicted to its EEBO-TCP transcription, access its SHC encoding, and finally read, download, and manipulate it as encoded by the Folger’s Digital Anthology editors. We will provide reliable and flexible encoded texts to serve as the basis for a range of traditional and digital research inquiries, pedagogical exercises, and editorial endeavors, while being transparent about the implications of a corpus derived from individual copies of specific, often problematic playbooks. In June 2016, the Folger will hold the first in a series of workshops to explore the pedagogical potential of this corpus.

Anupam Basu

Overview of Early Print http://earlyprint.wustl.edu

2: limits and bounds

Laura Estill: “EEBO-TCP: The Searchable (Print) Text and Manuscript Studies”

paper slides

The Early English Books Online Text Creation Partnership (EEBO-TCP) makes an unprecedented number of early modern texts searchable, which changes the way we research. Now, when faced with print or manuscript miscellany full of, well, miscellaneity, researchers can go about finding out if the commonplaces, epithets, or turns of phrase have potential print sources. Previously, researchers were limited to first-line indices (and therefore poetry), Project Gutenberg’s poor OCR (Optical Character Recognition, automated text digitization), or the un-scholarly “I’m feeling lucky” Google approach. The danger of EEBO-TCP is the myth of comprehensive searching—the lure of the universal library. EEBO-TCP is a carefully selected corpus, but is far from representing all printed works in English. It is especially imperative that students and scholars recognize EEBO-TCP’s (ever-expanding) limits: the size of its corpus, its metadata, and the search functionality. Manuscript studies cannot be separated from texts and book history any more than manuscripts can be disentangled from print sources in the early modern period. EEBO-TCP will make new editions of manuscripts and new digital projects possible; if we can understand the bounds of EEBO-TCP, we can better understand early modern textual cultures.

Gabriel Egan: Satisfying the Need for Determinate Searching: Labs, APIs, and Search Engines

paper

This talk is concerned with satisfying users who need to speak authoritatively about the presence and absence of particular words and phrases in a large dataset such as EEBO-TCP. (A typical application with this need is an authorship attribution study based on preferred phrasing.) As an alternative to providing a website for users to manually enter the terms they wish to search for and, optionally, the relationships between those terms, it is possible to provide an Application Programming Interface (API) that enables the user’s own software to interrogate the dataset directly. It is also possible to provide a Labs service to help users to develop their own software for interrogating the dataset. These various approaches will be discussed in connection with EEBO-TCP, the wider TCP project, and the UK-only rival to EEBO called JISC Historical Texts.

3 curation and correction

Martin Mueller: Collaborative curation and exploration of the EEBO-TCP texts

The EEBO TCP project is magnficent and flawed. There are millions of known and millions of unknown errors in the digital transcriptions, which, mediated by mobile devices and for better or worse, will provide future scholars with the most common and often the only access to Early Modern print culture. The errors can and should be fixed by users over time.

“Citizen scholars” from high school students through undergraduates to retirees can make useful contributions.Over the past two years, Northwestern undergraduates have made substantial contributions to the correction of some 50,000 words in some 500 non-Shakespearean plays from 1550-1650. Experience has shown that some of the work can be “downsourced” to machines. The technical problems for a collaborative framework are not trivial, but with a modicum of trust and willingness to cooperate they can be solved. The key technical problem consists in creating an environment that lets people fix errors ‘en passant, while working with texts they are interested in. An energetic project witt the right balance of some centralization and a lot of distributed effort would produce significantly better texts over a five-year period.

Janelle Jenstad: Catch, Tag, and Release: Coordinating our Efforts to Build the Early Modern Corpus

The work of correcting EEBO-TCP texts is formidable. MoEML‘s work with EEBO-TCP’s XML files shows that transcribers need to supply gaps, capture forme work, correct mis-transcriptions, and restore early modern typographical habits and idiosyncracies. Only with many partners working in coordination will we be able to establish an accurate corpus suitable for text mining, copy-text editing, and critical editions. We might think of such work in terms of a “catch-tag-release” model, whereby various entities “catch” EEBO-TCP texts from the data stream, “tag” them in TEI Simple (developed by Mueller), correct both tagging and transcriptions through teams of emerging scholars, and then “release” the texts back into the scholarly wilds. Mueller has already described how a corrective tagging process might work, and the Folger’s Digital Anthology project prototypes a repository environment that will allows us to release texts back into the wild. We also need to capture corrective work that has already been done, such as the ISE‘s transcriptions of the quarto and folio transcriptions of Shakespeare’s plays. These transcriptions are highly accurate, having been double-keyed by research assistants, carefully checked by the play editors, and peer reviewed. Their markup predates the development of XML or TEI, but can be dynamically converted (with some effort) into TEI Simple for general “release” alongside other EEBO-TCP transcriptions. From this stage, we can use various XSLT scenarios to convert the TEI Simple both into the plaintext suitable for corpus-wide analyses and into a variety of XML forms suitable for web publication and further editorial work. The limitations of EEBO-TCP transcriptions and the effort required to correct them should make us mindful of the effect of “unevenness” across the corpus. The ISE proposes to replace reasonably good EEBO-TCP transcriptions of Shakespeare’s play with excellent transcriptions. But what of the texts in which SAA members are less invested? Some of them have error rates of two or more errors per line. Which will we correct first? Will we bestow as much care and time on them as we have on Shakespeare? How will our answers to those questions affect the results of distant reading and data mining exercises?

Carl Stahmer, UC Davis, USA: “Social Curation: A Model for Peer Reviewed, Collaborative Collation of Metadata and Texts”

Since 1999, the Early English Books Online Text Creation Partnership (EBBO-TCP) has undertaken the gargantuan effort of making publicly available TEI encoded full-text versions of the Early English Books Online (EBBO) corpus. Like all projects of this magnitude, the text transcriptions in the corpus contain a variety of errors and omissions. Whether by hand or computer, textual transcription is a difficult and time consuming task that requires extensive editing and re-editing to produce accurate representations, and EBBO-TCP is no exception to this rule. On January 1, 2015, the EBBO-TCP corpus entered the public domain, opening the possibility for scholars outside of the TCP workforce to contribute to improving its accuracy. This work would, like the original creation of the texts, require a significant effort and would be best achieved by employing a wide and distributed body of scholars. To date, no infrastructure exists for managing this type of distributed textual scholarship. For the past three years the English Short Title Catalogue (ESTC), through the generous support of the Andrew W Mellon Foundation, has been engaged in designing just such a social curation infrastructure for correcting and enhancing the bibliographic and holding metadata in its collection. The designed system, which is currently under production, will provide mechanisms for groups of scholars to engage in peer reviewed records management and improvement. This paper will investigate the ways in which this (or a similar) system could be leveraged to perform social curation of texts in the EBBO-TCP corpus.

Biographical statements

Jonathan Hope is Professor of Literary Linguistics at Strathclyde University, Glasgow. He is joint P-I on the Visualising English Print project, which is producing tools to work with the EEBO-TCP corpus, and was Director of EMDA2013 and EMDA2015, NEH Advanced Institutes in Digital Humanities, held at the Folger Shakespeare Library.

Meaghan Brown is CLIR-DLF Fellow for Data Curation in Early Modern Studies at the Folger Shakespeare Library. Her main project is a Digital Anthology of Early Modern English Drama. She is also the PI on the Identifying Early Modern Books project and writes for Folgerpedia.

Anupam Basu is Mark Steinberg Weil Early Career Fellow in Digital Humanities at Washington University, St Louis, where he is part of the Humanities Digital Workshop. The website Early Modern Print (http://earlyprint.wustl.edu) is leading the way in allowing users to search the EEBO-TCP database.

Laura Estill is an Assistant Professor of English at Texas A&M University, where she edits the World Shakespeare Bibliography (www.worldshakesbib.org). She is the author of Dramatic Extracts in Seventeenth-Century English Manuscripts: Watching, Reading, Changing Plays (2015). Her work has also appeared in The Oxford Handbook of Shakespeare, Shakespeare, Early Theatre, Huntington Library Quarterly, Studies in English Literature, and ArchBook: Architectures of the Book. She has articles forthcoming in Shakespeare Quarterly and Shakespeare and Textual Studies (Cambridge UP, 2015). She is currently working on DEx: A Database of Dramatic Extracts.

Gabriel Egan is Professor of Shakespeare Studies and Director of the Centre for Textual Studies at De Montfort University. He chairs the Advisory Board for JISC Historical Texts and has served as consultant on several mass digitization projects. He is a Technical Evaluator for the UK’s Arts and Humanities Research Council and a National Teaching Fellow of the UK’s Higher Education

Janelle Jenstad is Associate Professor of English at the University of Victoria. She directs The Map of Early Modern London (MoEML), comprised of a georeferenced critical edition of the Agas map, an encyclopedia of early modern London, a XML library of literary texts, and a versioned edition of Stow’s Survey of London. She is also Associate Coordinating Editor of the Internet Shakespeare Editions, for which she is editing The Merchant of Venice, and Lead Applicant on Linked Early Modern Drama Online. With Jennifer Roberts-Smith, she co-edited Shakespeare’s Language in Digital Media (forthcoming from Ashgate). Her essays have appeared in Shakespeare Bulletin, Elizabethan Theatre, EMLS, JMEMS,and other venues.

Martin Mueller is Professor of English and Classics at Northwestern University. He has written a book on the Iliad (1984, revised 2009) and “Children of Oedipus and other essay on the imitation of Greek tragedy, 1550-1800” (1980)

Carl Stahmer is Director of Digital Scholarship at University of California Davis Library, and Associate Director of the English Broadside Ballad Archive (EBBA). He is Technical Director of the English Short Title Catalogue (ESTC). While in the Marine Corps, Carl worked as a programmer on the ARPANET (Advanced Research Projects Agency Network). He left the Marines to pursue his Ph.D. in English, but the “ARPANET stuck with me, and I began to see strong connections between the way people there were talking about networks and exchange of information and the way people in English Departments were talking about how information gets puts together as narrative”.