Finding “Distances” Between Shakespeare’s Plays 1

swallows-300x199In honor of the latest meeting of our NEH sponsored Folger workshop, Early Modern Digital Agendas, I wanted to start a series of posts about how we find “distances” between texts in quantitative terms, and about what those distances might mean. Why would I argue that two texts are “closer” to one another than they are to a third that lies somewhere else? How do those distances shift when they are measured on different variables? When represented as points in different dataspaces, the distances between texts can shift as variables change — like a murmuration of starlings. So what kind of cloud is a cloud of texts?

This first post begins with some work on the Folger Digital Texts of Shakespeare’s plays, which I’m making available in “stripped” form here. These texts were created by Mike Poston, who developed the encoding scheme for Folger Digital Texts, and who understands well the complexities involved in differentiating between the various encoded elements of a play text.

I’ve said the texts are “stripped.” What does that mean? It means that we have eliminated those words in the Folger Editions that are not spoken by characters. Speech prefixes, paratextual matter, and stage directions are absent from this corpus of Shakespeare plays. There are interesting and important reasons why these portions of the Editions are being set aside in the analyses that follow, and I may comment on that issue at a later date. (In some cases, stripping will even change the “distances” between texts!) For now, though, I want to run through a sequence of analyses using a corpus and tools that are available to as many people as possible. In this case that means text files, a web utility, and in subsequent posts on “dimension reduction,” an excel spreadsheet alongside some code written for the statistics program R.

The topic of this post, however, is “distance” — a term well worth thinking about as our work moves from corpus curation through the “tagging” of the text and on into analysis. As always, the goal of this work is to do the analysis and then return to these texts with a deepened sense of how they achieve their effects — rhetorically, linguistically, and by engaging aesthetic conventions. It will take more than one post to accomplish this full cycle.

So, we take the zipped corpus of stripped Folger Edition plays and upload it to the online text tagger, Ubiqu+ity. This tagger was created with support from the Mellon Foundation’s Visualizing English Print grant at the University of Wisconsin, in collaboration with the creators of the text tagging program Docuscope at Carnegie Mellon University. Uniqu+ity will pass a version of Docuscope over the plays, returning a spreadsheet with percentage scores on the different categories or Language Action Types (LATs) that Docuscope can tally. In this case, we upload the stripped texts and request that they be tagged with the earliest version of Docuscope available on the site, version 3.21 from 2012. (This is the version that Hope and I have used for most of our analyses in our published work. There may be some divergences in actual counts, as this is a new implementation of Docuscope for public use. But so far the results seem consistent with our past findings.) We have asked Ubiqu+ity to return, by email, a downloadable .csv file with the Docuscope counts, as well as a series of HTML files (see the checked box below) that will allow us to inspect the tagged items in textual form.


Screen Shot 2015-06-22 at 9.02.58 PM

The results can be downloaded here, where you will find a zipped folder containing the .csv file with the Docuscope counts and the HTML files for all the stripped Folger plays. The .csv file will look like the one below, with abbreviated play names arrayed vertically in the first column, then (moving columnwise to the right) various other pieces of metadata (text_key, html_name, and model_path), and finally the Docuscope counts, labelled by LAT. You will also find that a note on curation was fed into the program. I will want to remove this row when doing the analysis.

Screen Shot 2015-06-23 at 8.35.34 AM

For ease of explication, I’m going to pare down these columns to three: the name of the text in column 1, and then the scores that sit further to the right on the spreadsheet for two LATs: AbstractConcepts and FirstPerson. These scores are expressed as a proportion, which to say, the number of all tokens tagged under this LAT as a fraction of all the included tokens. So now we are looking at something like this:

Screen Shot 2015-06-22 at 9.20.00 PM

Before doing any analysis,  I will make one further alteration, subtracting the mean value for each column (the “average” score for the LAT) from every score in that column. I do this in order to center the data around the zero point of both axes:

Screen Shot 2015-06-22 at 9.43.18 PMNow some analysis. Having identified a corpus (Shakespeare’s plays) and curated our texts (stripping, processing), we have counted some agreed upon features (Docuscope LATs). The features upon which we are basing the analysis are those words or strings of words that Docuscope counts as AbstractConcepts and FirstPerson tokens.

It’s important to note that at any point in this process, we could have made different choices, and that these choices would have lead to different results. The choice of what to count is a vitally important one, so we ought to give thought to what Douscope counted as FirstPerson and AbstractConcepts. To get to know these LATs better — to understand what exactly has been assigned these two tags —we can open one of the HTML files of the plays and “select” that category on the right hand side of the page, scrolling through the document to see what was tagged. Below is the opening scene of Henry V, so tagged:

Screen Shot 2015-06-23 at 8.28.10 AM


Before doing the analysis, we will want explore the features we have been counting by opening up different play files and turning different LATs “on and off” on the left hand side of the HTML page. This is how we get to know what is being counted  in the columns of the .csv file.

I look, then, at some of our texts and the features that Ubiqu+ity tagged within them. I will be more or less unsatisfied with some of these choices, of course. (Look at “i’ th’ receiving earth”!)Because words are tagged according to inflexible rules, I will disagree with some of the things that are being included in the different categories. That’s life. Perhaps there’s some consolation in the fact that the choices I disagree with are, in the case of Docuscope, (a) relatively infrequent and (b) implemented consistently across all of the texts (wrong in the same way across all types of document). If I really disagree, I have the option of creating my own text tagger. In practice, Hope and I have found that it is easier to continue to use Docuscope, since we do not want to build into the tagging scheme the self-evident things we may be interested in. It’s a good thing that Docuscope remains a little bit alien to us, and to everyone else who uses it.

Now to the question of distance.

Screen Shot 2015-06-22 at 10.01.43 PM
When we look at the biplot above, generated in R from the mean-adjusted data above, we notice a general shape to the data. We could use statistics to describe the trend — there is a negative covariance between FirstPerson and AbstractConcept LATs — but we can already see that as FirstPerson tokens increase, the proportion of AbstractConcept tokens tends to decrease. The trend is a rough one, but there is the suggestion of a diagonal line running from the upper left hand side of the graph toward the lower right.

What does “distance” mean in this space? It depends on a few things. First, it depends on how the data is centered — here we have done so by subtracting the column means from each entry. Our choice of a scale on either axis will also affect apparent distances, as will our choice of the units represented on the axes. (One can tick off standard deviations around the mean, for example, rather than the original units, which we have not done). These contingencies point up an important fact: distance is only meaningful because the space is itself meaningful — because we can give a precise account of what it means to move an item up or down either of these two axes.

Just as important: distances in this space are a caricature of the linguistic complexity of these plays. We have strategically reduced that complexity in order to simplify a set of comparisons. Under these constraints, it is meaningful to say that Henry V is “closer” to Macbeth than it is to Comedy of Errors. In the image above, you can compare these distances between the labelled texts. The first two plays, connected by the red line, are “closer” given the definitions of what is being measured and how those measured differences are represented in a visual field.

When we plot the data in a two dimensional biplot, we can “see” closeness according to these two dimensions. But if you recall the initial .csv file returned by Ubiq+ity, you know that there can be many more columns — and so, many more dimensions — that can be used to plot distances.

Screen Shot 2015-06-23 at 8.56.23 AM

What if we had scattered all 38 of our points (our plays) in a space that had more than the two dimensions shown in the biplot above? We could have done so in three dimensions — plotting three columns instead of two — but once we arrive at four dimensions we are beyond the capacity for simple visualization.  Yet there may be a similar co-paterning (covariance) among LATs in these higher dimensional spaces, analogous to the ones we can “see” in two dimensions. What if , for example,the frequency of Anger decreases alongside that of AbstractConcepts just when FirstPerson instances increase? How should we understand the meaning of comparatives such as “closer together” and “further apart” in such multidimensional spaces? For that, we need techniques of dimension reduction.

In the next post, I will describe my own attempts to understand a common technique for dimension reduction known as Principal Component Analysis. It took about two years for me to figure that out — however imperfectly — and I wanted to pass that along in case others are curious. But it is important to understand that these more complex techniques are just extensions of something we can imagine in more simpler terms. And it is important to remember that there are very simple ways of visualizing distance — for example, an ordered list. We assessed distance visually in the biplot above, a distance that was measured according to two variables or dimensions. But we could have just as easily used only one dimension, say, Abstract Concepts. Here is the list of Shakespeare’s plays, in descending order, with respect to scores on AbstractConcepts:

Screen Shot 2015-06-23 at 9.04.56 AM

Even if we use only one dimension here, we can see once again that Henry V is “closer” to Macbeth than it is to Comedy of Errors. We could even remove the scores and simply use an ordinal sequence of this play, then this, then this. There would still be information about “distances” in this very simple, one dimensional, representation of the data.

Now we ask ourselves: which way of representing the distances between these tests is better? Well, it depends on what you are trying to understand, since distances — whether in one, two, or many more dimensions — are only distances according to the variables or features (LATs) that have been measured. In the next post, I’ll try to explain how the thinking above helped me understand what is happening in a more complicated form of dimension reduction called Principal Component Analysis. I’ll use the same mean adjusted data for FirstPerson and AbstractConcepts discussed here, providing the R code and spreadsheets so that others can follow along. The starting point for my understanding of PCA is an excellent tutorial by Jonathon Shlens, which will be the underlying basis for the discussion.



Posted in Shakespeare, Visualizing English Print (VEP) | Leave a comment

Now Read This: A Thought Experiment

MRILet’s say that we believe we can learn something more about what literary critics call “authorial style” or “genre” by quantitative work. We want to say what that “more” is. We assemble a community of experts, convening a panel of early modernists to identify 10 plays that they feel are comedies based on prevailing definitions (they end in marriage), and 10 they feel are tragedies (a high born hero falls hard). To test these classifications, we randomly ask others in the profession (who were not on the panel) to sort these 20 plays into comedies and tragedies and see how far they diverge from the classifications of our initial panel. That subsequent sorting matches the first one, so we start to treat these labels (comedy/tragedy) as “ground truths” generated by “domain experts.” Now assume that I take a computer program, it doesn’t matter what that program is, and ask for it to count things in these plays and come up with a “recipe” for each genre as identified by our experts. The computer is able to do so, and the recipes make sense to us. (Trivially: comedies are filled with words about love, for example, while tragedies use more words that indicate pain or suffering.) A further twist: because we have an unlimited, thought-experiment budget, we decide to put dozens of early modernists into MRI machines and measure the activity in their brains while they are reading any of these 20 plays. After studying the brain activity of these machine-bound early modernists, we realize that there is a distinctive pattern of brain activity that corresponds with what our domain experts have called “comedies” and “tragedies.” When someone reads a comedy, regions A, B and C become active, whereas when a person reads tragedies, regions C, D, E, and F become active. These patterns are reliably different and track exactly the generic differences between plays that our subjects are reading in the MRI machine.

So now we have three different ways of identifying – or rather, describing – our genre. The first is by expert report: I ask someone to read a play and she says, “This is a comedy.” If asked why, she can give a range of answers, perhaps connected to plot, perhaps to her feelings while reading the play, or even to a memory: “I learned to call this and other plays like it ‘comedies’ in graduate school.” The second is a description, not necessarily competing, in terms of linguistic patterns: “This play and others like it use the conjunction ‘if’ and ‘but’ comparatively more frequently than others in the pool, while using ‘and’ less frequently.” The last description is biological: “This play and others like it produce brain activity in the following regions and not in others.” In our perfect thought experiment, we now have three ways of “getting at genre.” They seem to be parallel descriptions, and if they are functionally equivalent, any one of them might just be treated as a “picture” of the other two. What is a brain scan of an early modernist reading comedy? It is a picture of the speech act: “The play I’m reading right now is a comedy.”

Now the question. The first three acts of a heretofore unknown early modern play are discovered in a Folger manuscript, and we want to say what kind of play it is. We have our choice of either:

• asking an early modernist to read it and make his or her declaration

• running a computer program over it and rating it on our comedy/tragedy classifiers

• having an early modernist read it in an MRI machine and characterizing the play on the basis of brain activity.

Let’s say, for the sake of argument, that you can only pick one of these approaches. Which one would you pick, and why? If this is a good thought experiment, the “why” part should be challenging.

Posted in Quant Theory | Tagged , | Leave a comment

Mapping the ‘Whole’ of Early Modern Drama

We’re currently working with two versions of our drama corpus: the earlier version contains 704 texts, while the later one has 554, the main distinction being that the later corpus has a four-way genre split – tragedy, comedy, tragicomedy, and history – while the earlier corpus also includes non-dramatic texts like dialogues, entertainments, interludes, and masques. Recently we’ve been doing PCA experiments with the 704 corpus to see what general patterns emerge, and to see how the non-dramatic genres pattern in the data. The following are a few of the PCA visualisations generated from this corpus, which provide a general overview of the data. We produced the diagrams here using JMP. The spreadsheets of the 704 and 554 corpora are included below as excel files – please note we are still working on the metadata.

704 corpus

554 corpus


Overview (click to enlarge images):

overall PCA space copy

This is the complete data set visualised in PCA space. All 704 plays are included, but LATs with frequent zero values have been excluded.



If we highlight the genres, it looks like this:

all genres copy

Comedies = red

Dialogues = green

Entertainments = blue

Histories = orange

Interludes = blue-green

Masques = dark purple

Non-dramatics = mustard

Tragicomedies = dark turquoise

Tragedies = pink-purple


If we tease this out even more – hiding, but not excluding, the non-dramatic genres – there is a clear diagonal divide between tragedies (red) and comedies (blue):

[Michael Witmore, Jonathan Hope, and Michael Gleicher, forthcoming, ‘Digital Approaches to the Language of Shakespearean Tragedy’, in Michael Neill and David Schalkwyk, eds, The Oxford Handbook of ShakespeareanTragedy (Oxford)]

TR CO split copy

With tragicomedies (green) and histories (purple) falling in the middle:

TR CO TC HI split copy

It seems that tragedies and comedies are characterised by sets of opposing LATs. The LATs associated with comedy are those capturing highly oral language behaviour, while those associated with tragedy capture negative language and psychological states. Tragicomedies and histories – although we have yet to investigate them in detail – seem to occupy an intermediate space. If we unhide the non-dramatic genres, we can see how they pattern in comparison.

In spite of their name, dialogues are not comprised of rapid exchanges (e.g. Oral Cues, Direct Address, First Person etc., the LATs which make up the comedic side of the PCA space) but instead have lengthy monologues, which might explain why they fall mostly on the side of the tragedies:

DI copy

Entertainments do not seem to be linguistically similar to each other:

EN copy

Interludes, on the other hand, seem to occupy a more tightly defined linguistic space:

IN copy

Masques are pulled towards the left of the PCA space:

MA copy



Docuscope was designed to identify genre, rather than authorship, so perhaps we should not be surprised that authorship comes through less clearly than genre in these initial trials. We should also bear in mind that there are only 9 genres in the corpus, compared to approximately 200 authors.

This, for example, shows only the tragedies – all other genres are hidden – and each author is represented by a different colour:

TR authorship copy

We get a clearer picture when considering a smaller group in relation to the whole – for example, one author compared to all the others. Take Seneca, for example – demonstrated by the purple squares:

TR Seneca copy

From this we can deduce that Seneca’s tragedies are linguistically similar, as they are grouped tightly together.



The same applies for looking at date of writing across the corpus, with approximately 100 dates to consider.

This can be visualised on a continuous scale, e.g. the lighter the dot, the earlier the play; the darker the dot, the later the play. While this has a nice ‘heat map’ effect, it is difficult to interpret:

date continuous scale copy

If we narrow this down to three groups of dates – early (red), central (yellow), and late (maroon) – it becomes a little easier to read. As with the Seneca example, the fewer factors there are to consider, the clearer the visualisations become:

early central late split copy

Posted in Early Modern Drama, Shakespeare, Visualizing English Print (VEP) | Leave a comment

‘the size of it all carries us along’ – a new kind of literary history?



references and links for a presentation by Jonathan Hope

pdf of slides Hope Helsinki 2014


‘the size of it all carries us along’

This Heat, ‘A New Kind of Water’, from Deceit (1981, Rough Trade)


part 1

Early Modern Print: Text Mining Early Printed English

(Anupam Basu et al.: Humanities Digital Workshop, Washington University, St Louis)



William Garrard, The Art of War 1591

Robert Barret, Theorike and Practike of Moderne Warres 1598


part 2

Visualising English Print: 1450-1800

(Mike Gleicher Wisconsin-Madison U, Michael Witmore Folger Shakespeare Library, Jonathan Hope Strathclyde U)





forthcoming papers on the material presented in this section of the talk:

Anupam Basu, Jonathan Hope, and Michael Witmore, ‘Networks and Communities in the Early Modern Theatre’, in Roger Sell and Anthony Johnson (eds), Community-making in Early Stuart Theatres: Stage and Audience (Ashgate)

Michael Witmore, Jonathan Hope, and Michael Gleicher, ‘Digital Approaches to the Language of Shakespearean Tragedy’, in Michael Neill and David Schalkwyk (eds), The Oxford Handbook of Shakespearean Tragedy (Blackwell)



Ted Underwood, 2013, Why Literary Periods Mattered: Historical Contrast and the Prestige of English Studies(Stanford UP)


Goldstone, Andrew, and Ted Underwood, 2014, ‘The Quiet Transformations of Literary Studies: What Thirteen Thousand Scholars Could Tell Us’, New Literary History


Ted Underwood, 2014, ‘Theorizing Research Practices We Forgot to Theorize Twenty Years Ago’, Representations <>.


Alan B. Farmer (forthcoming), ‘Playbooks and the Question of Ephemerality’


Lucy Munro, 2013, Archaic Style in English Literature (CUP)

Posted in Early Modern Drama, Uncategorized, Visualizing English Print (VEP) | Leave a comment

The Novel and Moral Philosophy 3: What Does Lennox Do with Moral Philosophy Words?

The previous two posts explored how an eighteenth century novel uses words from an associated topic to fulfill, and perhaps shape, the expectations of an audience looking to immerse themselves in a life as it is lived. In this post I want to think a little more about the idea that the red words identified by Serendip’s topic model do something exclusively “novel-like” and that the blue words are exclusively “philosophical.” Both sets of words seem, rather, to aim at a common target, since each contributes something distinctive to the common project of rendering a moral perspective on lived experience. I want to caution against thinking of these topics as “signatures” of different genres; they may instead index narrative strategies that criss-cross different types of writing.

Take, for example, the passage from Lennox’s Euphemia that appears toward the bottom of the screen shot below:


After relating several details about her relationship with her aunt and uncle, Maria concludes: “BEING in this unfavourable disposition towards me, he [Sir John] was easily persuaded to press me to a marriage, in which my in|clinations were much less consulted than my interests.” This sentence illustrates some of the dynamics that Park described in her earlier post. On the one hand, Maria’s letter immerses the reader in a scene from life, rendering vivid the circumstances that led her uncle to make a fateful decision about Maria’s marriage prospects. Yet at the same time, the narrator dips frequently into the vocabulary of a more removed and somewhat static moral judgment – one that appraises “circumstance” in relation to “actions” and “interest.” The red words, novelistic in our analysis, are the words that show us how something happened: Maria’s uncle Sir John decided to force her into “marriage,” ignoring his niece’s wishes or inclinations because he was in an “unfavourable” disposition that made him more easily “persuaded” to this course of action. (We are getting contextual details – backstory – that make his decision intelligible.) These red, novelistic topic words – marriage, persuaded, unfavorable – are thus necessary for rendering the sequence of events that prompted her change of fortunes. A man was persuaded, his favor had changed, and a marriage ensued.

But the narrative sequence opens up onto a more general possibility for analysis. An abstract noun – “interest” – is offered as the nominal criterion for her uncle’s decision, but in the context of the sentence it seems to gloss the uncle’s reasoning as he might represent it to Maria (“this marriage is in your interest”), not the narrator’s feelings about that reasoning (“it was in my interest”). What we are getting, then, is the narrator’s view of how her uncle made his decision, what circumstances contributed to his thinking, even the abstract concept that he could have invoked in the absence of any residual “natural” sympathy for his niece’s inclinations. One sees, perhaps, a tension between the kinds of abstract nouns that appear in works of moral philosophy – in the screen shot above, “natural” “actions” “circumstance” “interest” – and the concrete terms of relation that render action for us in a more vivid, immediate way.

What is interesting about this passage is that it shows us how flexible the abstract vocabulary of moral philosophy can be when it is introduced into the narrative stream of a novel. In the passage above, Maria tells us that her aunt, Lady Harley, was stung by jealousy when she witnessed Sir John’s pleasure at hearing his niece read. Out of spite, the aunt insinuates that there is a contradiction between the “oppression and faintness” that Maria purportedly has complained of and her manifestly good spirits, which Sir John would otherwise take on face value. Maria then uses the abstract noun “circumstance” to characterize the fact of her good spirits, a fact which Sir John is now (culpably) discounting.

The shift in register becomes necessary because Sir John has abandoned his natural sympathy for Maria and is instead bringing a quasi-judicial process of weighing her actions (thinking “circumstantially”). It’s the intermixture of these fragments of moral reasoning with images of life as it unfolds – a didactic mix of abstract nouns and personal actions – that are allowing Lennox to stage distinct layers of sympathy and indifference, serving them all up for the reader’s observation. The shift to moral evaluation is even more decisive in the following passage from letter V, in which Maria tells Euphemia how Sir James came to doubt her aunt’s deprecations and once again view his niece in a favorable light:

SnipImage12Maria is moving into the realm of generalization (“I have often observed…”), and this shift requires the writer to “investigate” the ways in which Sir James was led to “compare” Maria’s behavior with a secondhand “picture” that has been drawn of her “disposition” by her aunt. These blue words might be seen as pivots in a process of moral judgment – the same process that the novel’s reader had to employ in evaluating Sir James’ earlier souring on his niece. Because this process itself is now the subject of narration, it is not surprising that the vocabulary needs to be more structured and abstract.

In using Serendip to explore how Euphemia behaves linguistically qua novel, then, we must start with the idea that novels mix the vocabularies of these two topics in order to layer points of view and to involve the reader, experientially, in a world where actions have moral significance. Moral philosophy words (blue) are important because they mark occasions where that state of experiential immersion has been temporarily deflected onto some explicitly moralizing, explicitly generalizing consciousness, a consciousness which may or may not be that of the narrator. Regardless of its origin, the capacity of that consciousness to withdraw temporarily from the particulars of the narrative and to render judgment on a kind of act seems a crucial aspect of the novel’s program, which Julie Park described in her previous post in terms of the novel’s epistolarity and emphasis on sensibility.

We can say, moreover, that this procedure of mixing words from these two topics also occurs in formal works of moral philosophy. Consider the passage from Smith’s Theory of Moral Sentiments below:


In this passage, Smith is describing the way in which a man – any man whatever – will alter his treatment of his friends if suddenly elevated in social status. Such a man becomes insolent and petulant, which is why Smith believes that one should slow one’s social rise whenever possible. “He is happiest,” Smith writes, “who advances more gradually to greatness, whom the public destines to every step of his preferment long before he arrives at it…” Smith is encouraging his audience to pass judgment on a drama whose characters are never rendered concrete, characters whose actions illustrate a concept. The closest Smith gets to a novelistic treatment of the life world occurs just after he has presented his maxim above. Instead of calculating and re-calculating one’s standing among friends, Smith writes, one should find “satisfaction in all the little occurrences of common life, in the company with which we spent the evening last night.” Smith modulates into the red here, drawing words from the life world as if he himself is reporting on events in his own life just the night before, events which ground and so justify the moral pleasure he takes in them precisely because they are not bloodless and calculating. Smith has, for a sentence or two, become an epistolary novelist, and it is this sudden (and relatively rare) excursion into the every day – the world of “last night” – that allows him to show the difference between happiness and its opposite.

As an excursion, this passage has to be brief. There is “a lot of blue” in moral philosophy because, as philosophy, it needs to be systematic – indifferent, in other words, to the most particular details of the life world. But the subject of this philosophy is certainly the stuff of novels: dramas of sympathy, judgments of circumstances and the precise analysis of the qualities and intentions suffusing different acts (including the quality of failing to be concrete in one’s observations). If the burden of system building were relaxed, Smith too might write volubly about the “satisfactions” one finds “in the little occurrences of common life.”

Posted in Visualizing English Print (VEP) | Tagged , , , , , | Comments closed