{"id":2225,"date":"2015-06-23T08:27:48","date_gmt":"2015-06-23T13:27:48","guid":{"rendered":"http:\/\/winedarksea.org\/?p=2225"},"modified":"2025-02-10T17:29:31","modified_gmt":"2025-02-10T22:29:31","slug":"feeding-folger-digital-texts-into-ubiquity","status":"publish","type":"post","link":"https:\/\/winedarksea.org\/?p=2225","title":{"rendered":"Finding &#8220;Distances&#8221; Between Shakespeare&#8217;s Plays 1"},"content":{"rendered":"<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2259\" rel=\"attachment wp-att-2259\"><img loading=\"lazy\" decoding=\"async\" class=\"alignleft wp-image-2259 size-full\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/swallows-300x199.jpg\" alt=\"swallows-300x199\" width=\"300\" height=\"199\" \/><\/a>In honor of the latest meeting of our NEH sponsored Folger workshop, Early Modern Digital Agendas,\u00a0I wanted to start\u00a0a series of posts about how we find &#8220;distances&#8221; between texts\u00a0in quantitative terms, and about what those distances might mean. Why would I argue that two texts are &#8220;closer&#8221; to one another than they are to a third that lies somewhere else?\u00a0How do\u00a0those distances shift when they are measured on\u00a0different variables? When represented as\u00a0points\u00a0in different dataspaces, the distances between\u00a0texts can shift as variables change\u00a0\u2014\u00a0like\u00a0a murmuration of starlings. So what kind of cloud is a cloud of texts?<\/p>\n<p>This first post begins with some work on\u00a0the Folger Digital Texts of Shakespeare&#8217;s plays, which I&#8217;m making\u00a0available in &#8220;stripped&#8221; form <a href=\"http:\/\/winedarksea.org\/?attachment_id=2227\">here<\/a>. These texts were created by Mike Poston, who developed the encoding scheme for Folger Digital Texts, and who understands well the complexities involved in differentiating between the various encoded\u00a0elements of a play text.<\/p>\n<p>I&#8217;ve said the texts are &#8220;stripped.&#8221; What does that mean? It means that we have eliminated those words in the Folger Editions that are <em>not<\/em> spoken by characters. Speech prefixes, paratextual matter, and stage directions are absent from this corpus of Shakespeare plays. There are interesting and important reasons why these portions of the Editions are being set aside in the analyses that follow, and I\u00a0may\u00a0comment on that issue at a later date. (In some cases, stripping will even change the &#8220;distances&#8221; between texts!) For now, though,\u00a0I want to\u00a0run through a sequence of analyses using a corpus and tools that are available to as many people as possible. In this case that means\u00a0text files, a web utility, and in subsequent posts on &#8220;dimension reduction,&#8221; an excel spreadsheet\u00a0alongside\u00a0some code written for the statistics program R.<\/p>\n<p>The topic of this post, however, is &#8220;distance&#8221; &#8212; a term well worth thinking about as our work moves from corpus curation through the &#8220;tagging&#8221; of the text and on into analysis. As always, the goal of this work is to do the analysis and then return\u00a0to these texts with a deepened sense of how they achieve\u00a0their\u00a0effects \u2014\u00a0rhetorically, linguistically, and by engaging aesthetic conventions. It will take\u00a0more than one post to accomplish this full cycle.<\/p>\n<p>So, we take the zipped corpus of stripped Folger Edition plays and upload it to the online text tagger, <a href=\"http:\/\/vep-test.cs.wisc.edu\/ubiq\/\">Ubiqu+ity<\/a>. This tagger was created with support from the Mellon Foundation&#8217;s Visualizing English Print grant at the University of Wisconsin, in collaboration with the creators of the text tagging program Docuscope at Carnegie Mellon University.\u00a0Uniqu+ity will pass a version of Docuscope over the plays, returning\u00a0a spreadsheet with percentage scores on the different categories or Language Action Types (LATs) that Docuscope\u00a0can tally. In this case, we upload the stripped texts and request that they be tagged with the earliest version of Docuscope available on the site, version 3.21 from 2012. (This is the version that Hope and I have used for most of our analyses in our published work. There may be some divergences in actual counts, as this is a new implementation of Docuscope for public use. But so far the results seem consistent with our past findings.) We have asked Ubiqu+ity to create a downloadable .csv file with the Docuscope counts, as well as a series of HTML files (see the checked box below) that will allow us to inspect the tagged items in textual form.<\/p>\n<p>&nbsp;<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2228\" rel=\"attachment wp-att-2228\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2228\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.02.58-PM.png\" alt=\"Screen Shot 2015-06-22 at 9.02.58 PM\" width=\"540\" height=\"699\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.02.58-PM.png 658w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.02.58-PM-231x300.png 231w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/a><\/p>\n<p>The results can be downloaded <a href=\"http:\/\/winedarksea.org\/?attachment_id=2257\">here<\/a>, where you will find a zipped folder containing the .csv file with the Docuscope counts and the HTML files for all the stripped Folger plays. The .csv file will look like the one below, with abbreviated\u00a0play names arrayed\u00a0vertically in the first column, then (moving columnwise to the right) various other pieces of metadata (text_key, html_name, and model_path), and finally the Docuscope counts, labelled by LAT. You will also find that a note on curation was fed into the program. I\u00a0will want to remove this row when doing the analysis.<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2249\" rel=\"attachment wp-att-2249\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-large wp-image-2249\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.35.34-AM-1024x572.png\" alt=\"Screen Shot 2015-06-23 at 8.35.34 AM\" width=\"540\" height=\"301\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.35.34-AM-1024x572.png 1024w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.35.34-AM-300x167.png 300w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.35.34-AM.png 1082w\" sizes=\"auto, (max-width: 540px) 100vw, 540px\" \/><\/a><\/p>\n<p>For ease of explication, I&#8217;m going to pare down these columns to three: the name of the text in column 1, and then the scores that sit further to the right on the spreadsheet for two LATs: AbstractConcepts and FirstPerson. These scores are expressed as a proportion, which to say,\u00a0the number\u00a0of all tokens tagged under this LAT as a fraction\u00a0of all the included tokens. So now we are looking at something like this:<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2231\" rel=\"attachment wp-att-2231\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2231 size-full\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.20.00-PM.png\" alt=\"Screen Shot 2015-06-22 at 9.20.00 PM\" width=\"352\" height=\"724\" \/><\/a><\/p>\n<p>Before doing any analysis,\u00a0\u00a0I will make one further alteration, subtracting\u00a0the mean value for each column (the &#8220;average&#8221; score for the LAT)\u00a0from every\u00a0score in that\u00a0column. I\u00a0do this in order to center the data around the zero point of both axes:<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2235\" rel=\"attachment wp-att-2235\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2235\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.43.18-PM.png\" alt=\"Screen Shot 2015-06-22 at 9.43.18 PM\" width=\"335\" height=\"726\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.43.18-PM.png 335w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-9.43.18-PM-138x300.png 138w\" sizes=\"auto, (max-width: 335px) 100vw, 335px\" \/><\/a>Now some analysis. Having identified a corpus (Shakespeare&#8217;s plays) and curated our texts (stripping, processing), we have counted some agreed upon features (Docuscope LATs).\u00a0The features upon which we are basing the analysis are those words or strings of words\u00a0that\u00a0Docuscope counts as AbstractConcepts and FirstPerson tokens.<\/p>\n<p>It&#8217;s important to note that at any point in this process, we could have made different choices, and that these choices would have\u00a0lead\u00a0to different results. The choice of what to count is a vitally\u00a0important one, so we ought to give thought to what Douscope counted as FirstPerson and AbstractConcepts. To get to know these LATs better\u00a0\u2014 to understand what exactly has been assigned these two tags \u2014we can open one of the HTML files of the plays and &#8220;select&#8221; that category on the right hand side of the page, scrolling through the document to see what was tagged. Below is the opening scene of <em>Henry V<\/em>, so tagged:<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2248\" rel=\"attachment wp-att-2248\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2248\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.28.10-AM.png\" alt=\"Screen Shot 2015-06-23 at 8.28.10 AM\" width=\"622\" height=\"918\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.28.10-AM.png 622w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.28.10-AM-203x300.png 203w\" sizes=\"auto, (max-width: 622px) 100vw, 622px\" \/><\/a><\/p>\n<p>&nbsp;<\/p>\n<p>Before doing the analysis, we will want\u00a0explore the features we have been counting by opening up different play files and turning\u00a0different\u00a0LATs &#8220;on and off&#8221; on the left hand side of the HTML page. This is how we get to know\u00a0what is being counted in the columns of the .csv file.<\/p>\n<p>I\u00a0look, then, at\u00a0some of our texts and the features that Ubiqu+ity tagged within them. I will be more or less unsatisfied with some of these choices, of course. (Look at &#8220;i&#8217; th&#8217; receiving earth&#8221;!)Because words are tagged according to inflexible rules, I will disagree with some of the things that are being included in the different categories. That&#8217;s life. Perhaps there&#8217;s some consolation in the fact that the choices I disagree with are, in the case of Docuscope,\u00a0(a) relatively infrequent and (b) implemented consistently across all of the texts (wrong in the same way across <em>all<\/em> types of document). If I really disagree, I have the option of creating my own text tagger. In practice, Hope and I have found that it is easier to continue to use Docuscope, since we do not want to build into the tagging scheme the self-evident\u00a0things <em>we<\/em> may be interested in. It&#8217;s a good thing that\u00a0Docuscope remains a little bit alien to us, and to everyone else who uses it.<\/p>\n<p>Now to the question of distance.<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2241\" rel=\"attachment wp-att-2241\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-2241 size-full\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-22-at-10.01.43-PM2.png\" alt=\"Screen Shot 2015-06-22 at 10.01.43 PM\" width=\"811\" height=\"833\" \/><\/a><a href=\"http:\/\/winedarksea.org\/?attachment_id=2239\" rel=\"attachment wp-att-2239\"><br \/>\n<\/a>When we look at the biplot above, generated in R from the mean-adjusted data above, we notice a\u00a0general shape to the data. We could use statistics to describe the trend \u2014\u00a0there is a negative covariance between FirstPerson and AbstractConcept LATs \u2014\u00a0but we can\u00a0already see that as FirstPerson tokens increase, the proportion of AbstractConcept tokens tends to decrease. The trend is a rough one, but there is the suggestion of a diagonal line running from the upper left hand side of the graph toward the lower right.<\/p>\n<p>What does &#8220;distance&#8221; mean in this space? It\u00a0depends on a few things. First, it depends on how the data is centered. Here we have centered the data by\u00a0subtracting the column means from each entry. Our choice of a scale on either axis will also affect apparent distances, as will our choice of the\u00a0units represented on the axes. (One can tick off standard deviations around the mean, for example, rather than the original units, which we have not done). These contingencies\u00a0point up an important fact: distance\u00a0is only meaningful because the space is itself meaningful \u2014\u00a0because we can give a precise account of what it means to move an item up or down either of these two axes.<\/p>\n<p>Just as important:\u00a0distances in this space are\u00a0a caricature of the linguistic\u00a0complexity of these plays. We have strategically <em>reduced<\/em> that complexity in order to simplify a set of comparisons. Under these constraints, it is\u00a0meaningful to say that <em>Henry V<\/em> is &#8220;closer&#8221; to <em>Macbeth<\/em> than it is to <em>Comedy of Errors<\/em>. In the image above, you can compare these distances between\u00a0the labelled texts. The first two plays, connected by the red line,\u00a0are &#8220;closer&#8221;\u00a0<em>given the definitions of what is being measured<\/em> and <em>how those measured differences are represented<\/em>\u00a0in a visual field.<\/p>\n<p>When we plot the data in a\u00a0two dimensional biplot, we can &#8220;see&#8221; closeness according to\u00a0these two dimensions. But if you recall\u00a0the initial .csv file returned by Ubiq+ity, you know that there can be many more columns \u2014\u00a0and so, many more dimensions \u2014\u00a0that can be used to plot distances.<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2250\" rel=\"attachment wp-att-2250\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2250\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.56.23-AM.png\" alt=\"Screen Shot 2015-06-23 at 8.56.23 AM\" width=\"755\" height=\"81\" srcset=\"https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.56.23-AM.png 755w, https:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-8.56.23-AM-300x32.png 300w\" sizes=\"auto, (max-width: 755px) 100vw, 755px\" \/><\/a><\/p>\n<p>What if we had\u00a0scattered\u00a0all 38 of our points\u00a0(our plays) in a space that had more than the two dimensions shown in the\u00a0biplot above? We could have done so in\u00a0three dimensions \u2014\u00a0plotting three columns instead of two \u2014\u00a0but once we arrive at four dimensions we are beyond the capacity for simple visualization.\u00a0\u00a0Yet there may be a similar co-paterning (covariance) among LATs in these higher dimensional spaces, analogous to the ones we can &#8220;see&#8221;\u00a0in two dimensions. What if , for example,the frequency of Anger\u00a0decreases alongside that of AbstractConcepts just when FirstPerson instances increase? How should we understand the meaning of comparatives\u00a0such as &#8220;closer together&#8221; and &#8220;further apart&#8221; in\u00a0such multidimensional spaces? For that, we need techniques of dimension reduction.<\/p>\n<p>In the next post, I will describe my own attempts to understand a common technique for dimension reduction known as Principal Component Analysis. It took about two years for me to figure that out, however imperfectly. I wanted to pass that along in case others are curious. But it is important to understand that these more complex techniques are just extensions of something we can imagine in more simpler terms. And it is important to remember that there are very simple ways of visualizing distance\u00a0\u2014\u00a0for example, an ordered list. We assessed distance visually in the biplot above, a distance that was measured according to two variables or dimensions. But we could have just as easily used only one dimension, say, Abstract Concepts. Here is the list of Shakespeare&#8217;s plays, in descending\u00a0order, with respect to scores on AbstractConcepts:<\/p>\n<p><a href=\"http:\/\/winedarksea.org\/?attachment_id=2252\" rel=\"attachment wp-att-2252\"><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter size-full wp-image-2252\" src=\"http:\/\/winedarksea.org\/wp-content\/uploads\/2015\/06\/Screen-Shot-2015-06-23-at-9.04.56-AM.png\" alt=\"Screen Shot 2015-06-23 at 9.04.56 AM\" width=\"188\" height=\"701\" \/><\/a><\/p>\n<p>Even if we use only one dimension here, we can see once again that <em>Henry V<\/em> is &#8220;closer&#8221; to <em>Macbeth<\/em> than it is to <em>Comedy of Errors<\/em>. We could even remove the scores and simply use an ordinal sequence of this play, then this, then this. There would <em>still<\/em> be information about &#8220;distances&#8221; in this very simple, one dimensional, representation of the data.<\/p>\n<p>Now we ask ourselves: which way of representing the distances between these tests is <em>better<\/em>? Well, it depends on what you are trying to understand, since distances \u2014\u00a0whether in one, two, or many more dimensions \u2014\u00a0are only distances according to the variables or features (LATs) that have been measured. In the next post, I&#8217;ll try to explain how the thinking above helped me understand what is happening in a more complicated form of dimension reduction called Principal Component Analysis. I&#8217;ll use the same mean adjusted data for FirstPerson and AbstractConcepts discussed here, providing the\u00a0R code and spreadsheets so that others can follow along. The starting point for my understanding of PCA is an excellent <a href=\"http:\/\/arxiv.org\/pdf\/1404.1100.pdf\">tutorial<\/a> by Jonathon\u00a0Shlens, which will be the underlying basis for the discussion.<\/p>\n<p>&nbsp;<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In honor of the latest meeting of our NEH sponsored Folger workshop, Early Modern Digital Agendas,\u00a0I wanted to start\u00a0a series of posts about how we find &#8220;distances&#8221; between texts\u00a0in quantitative terms, and about what those distances might mean. Why would I argue that two texts are &#8220;closer&#8221; to one another than they are to a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[8,144],"tags":[],"class_list":["post-2225","post","type-post","status-publish","format-standard","hentry","category-shakespeare","category-visualizing-english-print-vep"],"_links":{"self":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2225","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=2225"}],"version-history":[{"count":24,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2225\/revisions"}],"predecessor-version":[{"id":2320,"href":"https:\/\/winedarksea.org\/index.php?rest_route=\/wp\/v2\/posts\/2225\/revisions\/2320"}],"wp:attachment":[{"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=2225"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=2225"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/winedarksea.org\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=2225"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}