{"id":2206,"date":"2015-07-09T11:26:53","date_gmt":"2015-07-09T16:26:53","guid":{"rendered":"http:\/\/winedarksea.org\/?p=2206"},"modified":"2025-02-10T17:28:59","modified_gmt":"2025-02-10T22:28:59","slug":"data-and-metadata","status":"publish","type":"post","link":"https:\/\/winedarksea.org\/?p=2206","title":{"rendered":"Data and Metadata"},"content":{"rendered":"<p>(Post by Jonathan Hope and Beth Ralston; data preparation by Beth Ralston.)<\/p>\n<p>It is all about the metadata. That and text processing. Currently (July 2015) Visualising English Print (Strathclyde branch) is focussed on producing a hand-curated list of all &#8216;drama&#8217; texts up to 1700, along with checked, clean metadata. Meanwhile VEP (Wisconsin branch) works on text processing (accessing TCP texts in a suitable format, cleaning up rogue characters, splitting up collected volumes into individual plays, stripping-out speech prefixes and non-spoken text, modernising\/regularising).<\/p>\n<p>We are not the only people doing this kind of work on Early Modern drama:<span style=\"color: #0000ff;\"> <a href=\"http:\/\/www.meaghan-brown.com\"><span style=\"color: #0000ff;\">Meaghan Brown<\/span><\/a> <\/span>at The Folger Shakespeare Library is working on a non-Shakespearean corpus, and <span style=\"color: #0000ff;\"><a href=\"http:\/\/www.english.northwestern.edu\/people\/faculty\/emeritus\/martin-mueller.html\"><span style=\"color: #0000ff;\">Martin Mueller<\/span><\/a><\/span> has just released the &#8216;Shakespeare His Contemporaries&#8217; corpus. We&#8217;ve been talking to both, and we are very grateful for their help, advice, and generosity with data. In a similar spirit, we are making our on-going metadata collections available &#8211; we hope they&#8217;ll be of use to people, and that you will let us know of any errors and omissions.<\/p>\n<p>You are welcome to make use of this metadata in any way you like, though please acknowledge the support of Mellon to VEP if you do, and especially the painstaking work of Beth Ralston, who has compared and cross-checked\u00a0the various sources of information about Early Modern plays.<\/p>\n<p>We hope to be in a position to release tagged texts once we have finalised the make-up of the corpus, and established our processing pipeline. Watch this space.<\/p>\n<p>Many of the issues surrounding the development of usable\u00a0corpora from EEBO-TCP will be discussed at <span style=\"color: #0000ff;\"><a href=\"http:\/\/www.shakespeareassociation.org\/wp-content\/uploads\/2015\/06\/June-2015-Bulletin.pdf\"><span style=\"color: #0000ff;\">SAA in 2016<\/span><\/a> <\/span>in a special plenary round-table:<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2385\" rel=\"attachment wp-att-2385\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2385\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/SAA-session-1024x492.png\" alt=\"SAA session\" width=\"540\" height=\"259\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/SAA-session-1024x492.png 1024w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/SAA-session-300x144.png 300w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/SAA-session.png 1261w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>In preparing these lists of plays and metadata we have made extensive use of Martin Wiggins and Catherine Richardson, <em>British Drama 1533-1642: A Catalogue (Oxford)<\/em>, Alfred Harbage,\u00a0<em>Annals of English Drama 975-1700<\/em>, the <span style=\"color: #0000ff;\"><a href=\"http:\/\/estc.bl.uk\"><span style=\"color: #0000ff;\">ESTC<\/span><\/a><\/span>, and, most of all, \u00a0Zach Lesser and Alan Farmer&#8217;s\u00a0<span style=\"color: #0000ff;\"><a href=\"http:\/\/deep.sas.upenn.edu\"><span style=\"color: #0000ff;\"><em>DEEP<\/em><\/span><\/a><\/span> (Database of Early English Playbooks).<\/p>\n<p><span style=\"text-decoration: underline;\"><span style=\"color: #ff0000; text-decoration: underline;\"><span style=\"color: #000000; text-decoration: underline;\">Definitions and History<\/span><\/span><\/span><span style=\"color: #ff0000;\">\u00a0<\/span><\/p>\n<p>One of the usefully bracing things about digital work is that it forces you to define your terms precisely &#8211; computers are unforgiving of vagueness, so a request for a corpus of &#8216;all&#8217; Early Modern drama turns out to be no small thing. Of course everyone defines &#8216;all&#8217;, &#8216;Early Modern&#8217; and &#8216;drama&#8217; in slightly different ways &#8211; and those using these datasets should be aware of our definitions, and of the probability that they will want to make their own.<\/p>\n<p>The current cut-off date for these files is the same as DEEP &#8211; 1660 (though one or two post-1660 plays have sneaked in). Before long, we will extend them to 1700.<\/p>\n<p>By &#8216;drama&#8217; we mean plays, masques, and interludes. Some dialogues and entertainments are included in the full data set, but we have not searched deliberately for them. We have included everything printed as a &#8216;play&#8217;, including closet dramas not intended for performance.<\/p>\n<p>The immediate history of the selection is that we began with a &#8216;drama&#8217; corpus chosen automatically by Martin Mueller (using XML tags in the TCP texts to identify dramatic genres). Beth Ralston then checked this corpus against the reference sources listed above for omissions, adding a considerable number of texts. This should not be regarded as &#8216;the&#8217; corpus of Early Modern drama: it is one of many possible versions, and will continue to change as more texts are added to TCP (there are some transcriptions still in the TCP pipeline, and scholars are working on proposals to continue transcription of EEBO texts after TCP funding ends).<\/p>\n<p>It is likely that each new scholar will want to re-curate a drama corpus to fit their research question &#8211; VEP is working on tools to allow this to be done easily.<\/p>\n<p><strong>Files and corpora<\/strong><\/p>\n<p><span style=\"text-decoration: underline;\">1 \u00a0 \u00a0The 554 corpus<\/span><\/p>\n<p>This spreadsheet lists only what we regard as the \u2018central\u2019 dramatic texts: plays.<\/p>\n<p>Entertainments, masques, interludes, and dialogues are not included. We have also excluded around 35 play transcriptions in TCP which duplicate transcriptions of the same play made from different volumes (usually a collected edition and a stand-alone quarto).<\/p>\n<p>The spreadsheet includes frequency counts for Docuscope LATs, tagged by <a href=\"http:\/\/vep.cs.wisc.edu\/ubiq\/\">Ubiquity<\/a>, which can be visualised using any statistical analysis program (columns W-EE). For a descriptive list of the LATs, see &lt;Docuscope LATs: descriptions&gt;. For a description of all columns in the spreadsheet, see the &lt;READ ME&gt; file.<\/p>\n<p>[In some of their early work, Hope and Witmore used a corpus of 591 plays which included these duplicates.]<\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/554-metadata.xlsx\"><span style=\"color: #0000ff;\">554 metadata<\/span><\/a><\/span><\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/README-for-554-metadata.docx\"><span style=\"color: #0000ff;\">README for 554 metadata<\/span><\/a><\/span><\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2387\"><span style=\"color: #0000ff;\">Docuscope LATs: descriptions<\/span>\u00a0<\/a><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"text-decoration: underline;\">2 \u00a0 The 704 corpus<\/span><\/p>\n<p>The 704 corpus spreadsheet lists information for the 554 plays included above, and adds other types of dramatic text, such as masques, entertainments, dialogues, and interludes (mainly drawn from DEEP, and with the same date cut-off: 1660). This corpus also includes the\u00a035 duplicate transcriptions excluded from the 554 spreadsheet.<\/p>\n<p>Docuscope frequency counts are only available for items also in the 554 spreadsheet.<\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/704-metadata.xlsx\"><span style=\"color: #0000ff;\">704 metadata<\/span><\/a><\/span><\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/README-for-704-metadata.docx\"><span style=\"color: #0000ff;\">README for 704 metadata<\/span><\/a><\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"text-decoration: underline;\">3 \u00a0The master metadata spreadsheet<\/span><\/p>\n<p>Our \u2018master metadata\u2019 spreadsheet is intended to be as inclusive as possible. The current version has 911 entries, and we have sought to include a listing for every extant, printed &#8216;dramatic&#8217; work we know about up to 1660 (from DEEP, Harbage, ESTC, and Wiggins). The spreadsheet does not include every edition of every text, but it does include the duplicate texts found in the 704 corpus. (When we extend the cut-off date to 1700, we expect the number of entries in this spreadsheet to exceed 1500.)<\/p>\n<p>This master list includes all the texts in the 704 list (and therefore the 554 list as well). But it also includes:<br \/>\n\u2022 plays which are in TCP but which do not appear in the 554 or 704 corpora (i.e. they were missed first time round). These texts have &#8216;yes&#8217; in the &#8216;missing from both&#8217; column (M) of the master spreadsheet.<br \/>\n\u2022 plays which are absent from TCP at this time (we note possible reasons for this: some are in Latin, some are fragments, and we assume some have yet to be transcribed). These are texts which have &#8216;yes&#8217; listed in the &#8216;missing from both&#8217; column (M) of the master spreadsheet, as well as &#8216;not in tcp&#8217; listed in the &#8216;tcp&#8217; column (A).<\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/master-metadata.xlsx\"><span style=\"color: #0000ff;\">master metadata<\/span><\/a><\/span><\/p>\n<p><span style=\"color: #0000ff;\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/07\/README-for-master-metadata.docx\"><span style=\"color: #0000ff;\">README for master metadata<\/span><\/a><\/span><\/p>\n<p>&nbsp;<\/p>\n<p><strong>TCP transcriptions<\/strong><\/p>\n<p>TCP is one of the most important Humanities projects ever undertaken, and scholars should be grateful for the effort and planning that has gone into it, as well as the free release of its data. It is not perfect however: as well as the issue of texts being absent from TCP, we are also currently dealing with problematic transcriptions on a play-by-play basis. Take Jonson\u2019s 1616 folio (TCP: A04632, ESTC: S112455) for example \u2013 it has a very fragmentary transcription, especially during the masques.<\/p>\n<figure id=\"attachment_2217\" aria-describedby=\"caption-attachment-2217\" style=\"width: 540px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-1.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2217 size-large\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-1-1024x819.jpg\" alt=\"page 1\" width=\"540\" height=\"431\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-1-1024x819.jpg 1024w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-1-300x240.jpg 300w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-1.jpg 1280w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/a><figcaption id=\"caption-attachment-2217\" class=\"wp-caption-text\">First page of The Irish Masque<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p>In the above image from <em>The Irish Masque<\/em>, you can see on\u00a0the right-hand side that the text for this page is not available.<\/p>\n<figure id=\"attachment_2218\" aria-describedby=\"caption-attachment-2218\" style=\"width: 540px\" class=\"wp-caption alignnone\"><a href=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-2.jpg\"><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-2218 size-large\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-2-1024x819.jpg\" alt=\"page 2\" width=\"540\" height=\"431\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-2-1024x819.jpg 1024w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-2-300x240.jpg 300w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/05\/page-2.jpg 1280w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/a><figcaption id=\"caption-attachment-2218\" class=\"wp-caption-text\">Second page of The Irish Masque<\/figcaption><\/figure>\n<p>&#8230;However, on the\u00a0next page the\u00a0text is there (as far as we can work out, this seems to be due to problems with the original imaging of the book, rather than the transcribers).<\/p>\n<p>Texts with fragmentary transcriptions have been excluded for now, assuming that at some point in the future TCP will re-transcribe them.<\/p>\n<p>As we come across other examples of this, we will add them to page<\/p>\n","protected":false},"excerpt":{"rendered":"<p>(Post by Jonathan Hope and Beth Ralston; data preparation by Beth Ralston.) It is all about the metadata. That and text processing. Currently (July 2015) Visualising English Print (Strathclyde branch) is focussed on producing a hand-curated list of all &#8216;drama&#8217; texts up to 1700, along with checked, clean metadata. Meanwhile VEP (Wisconsin branch) works on [&hellip;]<\/p>\n","protected":false},"author":18,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[51,8,1,144],"tags":[],"class_list":["post-2206","post","type-post","status-publish","format-standard","hentry","category-early-modern-drama","category-shakespeare","category-uncategorized","category-visualizing-english-print-vep"],"_links":{"self":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2206","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/users\/18"}],"replies":[{"embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2206"}],"version-history":[{"count":22,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2206\/revisions"}],"predecessor-version":[{"id":2410,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2206\/revisions\/2410"}],"wp:attachment":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}