Category: Shakespeare

Macbeth: The State of Play

We have a new chapter on the language of Macbeth which appears in this book from Arden. The chapter surveys previous work on the language of the play, and then offers some new analysis we’ve done, chiefly using WordHoard. Along the way, we consider the role of word frequency in literary analysis, and especially the word ‘the’ in Macbeth (we also think about word frequency in this post). Of course you are going to buy the book, which is currently (February 2014) available at a reduced price at the link above, but here is a pre-print of our chapter.

UPDATE: ‘the’ is attracting a lot of attention. Here is Bill Benzon discussing Matt Jockers’ discussion of it in Macroanalysis, and here is Mark Liberman responding, with references to other work on different rates of ‘the’ in language.

INTRODUCTION Ann ThompsonTHE TEXT AND ITS STATUS

Notes and Queries Concerning the Text of Macbeth Anthony B. Dawson

Dwelling ‘in doubtful joy’: Macbeth and the Aesthetics of Disappointment Brett Gamboa

HISTORY AND TOPICALITY

Politic Bodies in Macbeth Dermot Cavanagh

‘To crown my thoughts with acts’: Prophecy and Prescription in Macbeth Debapriya Sarkar

Lady Macbeth, First Ladies and the Arab Spring: The Performance of Power on the Twenty-First Century Stage Kevin A. Quarmby

CRITICAL APPROACHES AND CLOSE READING

‘A walking shadow’: Place, Perception and Disorientation in Macbeth Darlene Farabee

Cookery and Witchcraft in Macbeth Geraldo U. de Sousa

The Language of Macbeth Jonathan Hope and Michael Witmore

ADAPTATION AND AFTERLIFE

The Shapes of Macbeth: The Staged Text Sandra Clark

Raising the Violence while Lowering the Stakes: Geoffrey Wright’s Screen Adaptation of Macbeth Philippa Sheppard

The Butcher and the Text: Adaptation, Theatricality and the ‘Shakespea(Re)-Told’ Macbeth Ramona Wray

February 20, 2014
What happens in Hamlet?

We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not ‘interesting’ in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by ‘traditional’ literary methods. Here’s an example.

Dr Farah Karim-Cooper, head of research at Shakespeare’s Globe just asked on Twitter if I had any suggestions for a lecture on Hamlet she was due to give. Ten minutes later I had some ‘interesting’ questions for her.

I began with Wordhoard‘s log-likelihood function, comparing Hamlet to the rest of Shakespeare’s plays. You can view the results of this as a tag cloud:

Tag cloud for Hamlet vs the rest of Shakespeare: black words are raised in frequency; grey words lowered; size indicates strength of effect

which is nice, but for real text analytics you need to read the spreadsheet of figures. Word-frequency analysis is limited in many ways, but it can surprise you if you look in the right places and at the right things.

When I run log-likelihood, I always look first for the items that are lower than expected, rather than those that are raised (which tend to be content words associated with the topic of the text, and thus fairly obvious). I also tend to look at function words (pronouns, articles, auxiliary verbs) rather than nouns or adjectives.

If you look for absences of high-frequency items, you are using digital text analysis to do the things it does best compared to human reading: picking up absence, and analysing high-frequency items. Humans are good at spotting the presence of low frequency items, items that disrupt a pattern (outliers, in statistical terms) – but we are not good at noticing things that are not there (dogs that don’t bark in the night) and we are not good at seeing woods (we see trees, especially unusual trees).

The Hamlet results were pretty outstanding in this respect: very high up the list, with 3 stars, indicating very strong statistical significance, is a minus result for the pronoun ‘I’. A check across the figures shows that ‘I’ occurs in Hamlet about 184 times every 10,000 words (see the column headed ‘Analysis parts per 10,000’ – Hamlet is the ‘analysis text’ here), whereas in the rest of Shakespeare it occurs about 228 times every 10,000 words (see the column headed ‘Reference parts per 10,000) – the reference corpus is the rest of Shakespeare) – so every 10,000 words in Hamlet have about 40 fewer ‘I’ pronouns than we’d expect.

Or, to put it another way, Shakespeare normally uses ‘I’ 228 times every 10,000 words. Hamlet is about 30,000 words long, so we’d expect, all other things being equal, that Shakespeare would use ‘I’ 684 times. In fact, he uses it just 546 times – and Wordhoard checks the figures to see if we could expect this drop due to chance or normal variation. The three stars next to the log likelihood score for ‘I’ tell us that this figure is very unlikely to be due to chance – something is causing the drop.

Digital analysis can’t explain the cause of the drop: the only question it is answering here is, ‘How frequently does Shakespeare use “I” in Hamlet compared to his other plays?’. On its own, this is not a very interesting question. But the analysis provokes the much more interesting question, ‘Why does Shakespeare use “I” far less frequently in Hamlet than normal?’.

Given literary-critical claims that Hamlet marks the birth of the modern consciousness, it is surprising to find a drop in the frequency of first-person forms. But for an explanation of why this might happen, you’ll have to attend Dr Karim-Cooper’s lecture, ask on Twitter: @DrFarahKC – or go back to the play yourself.

August 17, 2012
Shakespeare’s mythic vocabulary – and his invisible grammar

Universities in the UK are under pressure to demonstrate the ‘impact’ of their research. In many ways, this is fair enough: public taxes account for the vast majority of UK University income, so it is reasonable for the public to expect academics to attempt to communicate with them about their work.

University press offices have become more pro-active in seeking out stories to present to the media as a way of raising the profile of institutions. Recently, the Strathclyde press office contacted me after they read one of my papers on Strathclyde’s internal research database: they wanted to do a press release to see if any outlets would follow-up on the story.

The paper they’d read was a survey article I’d written for an Open University course reader. My article reported recent papers by Hugh Craig and Ward Elliott & Robert Valenza, which demolish some common myths about Shakespeare’s vocabulary (its size and originality – and see Holger Syme on this too) – and went on to argue that Shakespeare’s originality might lie in his grammar, rather than in the words he does not make up.

Indeed they did want to pick up on the story, though I’d have preferred the article to have been a bit clearer, and not to have had a headline that was linguistic nonsense. The Huffington Post did a bit better.

One particularly galling aspect of the stories: the articles failed to attribute the work on Shakespeare’s vocabulary to Craig or Elliott and Valenza, so it might have looked as though I was taking credit for other people’s work

Looking back, I don’t think I explained my ideas very well either to Strathclyde’s press office, or to the Daily Telegraph when they rang – hence the rather confused reports. But I was extremely careful to attribute the work to those who had done it – even to the point of sending my original text to the journalist I talked to, and pointing him to the relevant footnote. I did not expect a news story to contain full academic references of course – but a clearly written story could easily have mentioned the originators of the work.

A minor episode, but it also made me think that there is a fundamental problem with trying to explain complex linguistic issues in the daily press – even if you use Newcastle United’s greatest goalscorers to illustrate the statistics. They want a clear story: you want to get the nuances across. Luckily, this blog allows me to make the full text of my article available (click through twice for a pdf of my article):

Shakespeare and the English Language

Jonathan Hope, Strathclyde University, Glasgow, February 2012

February 14, 2012
The very strange language of A Midsummer Night’s Dream
I just got back from a fun and very educative trip to Shakespeare’s Globe in London, hosted by Dr Farah Karim-Cooper, who is director of research there.

The Globe stages an annual production aimed at schools (45,000 free tickets have been distributed over the past five years), and this year’s play is A Midsummer Night’s Dream. I was invited down to discuss the language of the play with the cast and crew as they begin rehearsals.

This was a fascinating opportunity for me to test our visualisation tools and analysis on a non-academic audience – and the discussions I had with the actors opened my eyes to applications of the tools we haven’t considered before. They also came up with a series of sharp observations about the language of the play in response to the linguistic analysis.

I began with a tool developed by Martin Mueller’s team at Northwestern University: Wordhoard, as a way of getting a quick overview of the lexical patterns in the play, and introducing people to thinking statistically about language.

Here’s the wordcloud Wordhoard generates for a loglikelihood analysis of MSND compared with the whole Shakespeare corpus:

Loglikelihood takes the frequencies of words in one text (in this case MSND) and compares them with the frequencies of words in a comparison, or reference, sample (in this case, the whole Shakespeare corpus). It identifies the words that are used significantly more or less frequently in the analysis text than would be expected given the frequencies found in the comparison sample. In the wordcloud, the size of a word indicates how strongly its frequency departs from the expected. Words in black appear more frequently than we would expect, and words in grey appear less frequently.

As is generally the case with loglikelihood tests, the words showing the most powerful effects here are nouns associated with significant plot elements: ‘fairy’, ‘wall’, ‘moon’, ‘lion’ etc. If you’ve read the play, it is not hard to explain why these words are used in MSND more than in the rest of Shakespeare – and you really don’t need a computer, or complex statistics, to tell you that. To paraphrase Basil Fawlty, so far, so bleeding obvious.

Where loglikelihood results normally get more interesting – or puzzling – is in results for function words (pronouns, auxiliary verbs, prepositions, conjunctions) and in those words that are significantly less frequent than you’d expect.

Here we can see some surprising results: why does Shakespeare use ‘through’ far more frequently in this play than elsewhere? Why are the masculine pronouns ‘he’ and ‘his’ used less frequently? (And is this linked to the low use of ‘lord’?) Why is ‘it’ rare in the play? And ‘they’ and ‘who’ and ‘of’?

At this stage we started to look at our results from Docuscope for the play, visualised using Anupam Basu’s LATtice.

The heatmap shows all of the folio plays compared to each other: the darker a square is, the more similar the plays are linguistically. The diagonal of black squares running from bottom left to top right marks the points in the map where plays are ‘compared’ to themselves: the black indicates identity. Plays are arranged up the left hand side of the square in ascending chronological order from Comedy of Errors at the bottom to Henry VIII at the top – the sequence then repeats across the top from left to right – so the black square at the bottom left is Comedy of Errors compared to itself, while the black square at the top right is Henry VIII.

One of the first things we noticed when Anupam produced this heatmap was the two plays which stand out as being unlike almost all of the others, producing four distinct light lines which divide the square of the map almost into nine equal smaller squares:

These two anomalous plays are Merry Wives of Windsor (here outlined in blue) and A Midsummer Night’s Dream (yellow). It is not so surprising to find Wives standing out, given the frequent critical observation that this play is generically and linguistically unusual for Shakespeare: but A Midsummer Night’s Dream is a result we certainly would not have predicted.

This visualisation of difference certainly caught the actors’ attention, and they immediately focussed in on the very white square about 2/3 of the way along the MSND line (here picked out in yellow):

So which play is MSND even less like than all of the others? A tragedy? A history? Again, the answer is not one we’d have guessed: Measure for Measure.

This is a good example of how a visualisation can alert you to a surprising finding. We would never have intuited that MSND was anomalous linguistically without this heatmap. It is also a good example of how visualisations should send you back to the data: we now need to investigate the language of MSND to explain what it is that Shakespeare does, or does not do, in this play that makes it stand out so clearly. The visualisation is striking – and it allowed the cast members to identify an interesting problem very quickly – but the visualisation doesn’t give us an explanation for the result. For that we need to dig a bit deeper.

One of the most useful features of LATtice is the bottom right window, which identifies the LATs that account for the most distance between two texts:

This is a very quick way of finding out what is going on – and here the results point us to two LATs which are much more frequent in MSND than Measure for Measure: SenseObject and SenseProperty. SenseObject picks up concrete nouns, while SenseProperty codes for adjectives describing their properties. A quick trip to the LATice box plot screen (on the left of these windows):

confirms that MSND (red dots) is right at the top end of the Shakespeare canon for these LATs (another surprise, since we’ve got used to thinking of these LATs as characteristic of History), while Measure for Measure (blue dots) has the lowest rates in Shakespeare for these LATs.

So Docuscope findings suggest that MSND is a play concerned with concrete objects and their descriptions – another counter-intuitive finding given the associations most of us have with the supposed ethereal, fairy, dream-like atmosphere of the play. Cast members were fascinated by this and its possible implications for how they should use props – and someone also pointed out that many of the names in the play are concrete nouns (Quince, Bottom, Flute, Snout, Peaseblossom, Cobweb, Mote and so on) – what is the effect on the audience of this constant linguistic wash of ‘things’?

Here is a screenshot from Docuscope with SenseObject and SenseProperty tokens underlined in yellow. Reading these tokens in context, you realise that many of these concrete objects and qualities, in this section at least, are fictional in the world of the play. A wall is evoked – but it is one in a play, represented by a man. Despite the frequency of SenseObject in this play, we should be wary of assuming that this implies the straightforward evocation of a concrete reality (try clicking if you need to enlarge):

Also raised in MSND are LATs to do with locating and describing space: Motions and SpaceRelations (as suggested by our loglikelihood finding for ‘through’?). So accompanying a focus on things, is a focus on describing location, and movement – perhaps, someone suggested, because the characters are often so unsure of their location? (In the following screenshot, Motions and SpatialRelation tokens are underlined in yellow.)

Moving on, we also looked at those LATs that are relatively absent from MSND – and here the findings were very interesting indeed. We have seen that MSND does not pattern like a comedy – and the main reason for this is that it lacks the highly interactive language we expect in Shakespearean comedy: DirectAddress and Question are lowered. So too are PersonPronoun (which picks up third person pronouns, and matches our loglikelihood finding for ‘he’ and ‘his’), and FirstPerson – indeed, all types of pronoun are less frequent in the play than is normal for Shakespeare. At this point one of the actors suggested that the lack of pronouns might be because full names are used constantly – she’d noticed in rehearsal how often she was using characters’ names – and we wondered if this was because the play’s characters are so frequently uncertain of their own, and others’ identity.

Also lowered in the play is PersonProperty, the LAT which picks up familial roles (‘father’, ‘mother’, ‘sister’ etc) and social ones (job titles) – if you add this to the lowered rate of pronouns, then a rather strange social world starts to emerge, one lacking the normal points of orientation (and the play is also low on CommonAuthority, which picks up appeals to external structures of social authority – the law, God, and so on).

The visualisation, and Docuscope screens, provoked a discussion I found fascinating: we agreed that the action of the play seems to exist in an eternal present. There seems to be little sense of future or past (appropriately for a dream) – and this ties in with the relative absence of LATs coding for past tense and looking back. As the LATtice heatmap first indicated, MSND is unlike any of the recognised Shakespearean genres – but digging into the data shows that it is unlike them in different ways:
- It is unlike comedy in its lack of features associated with verbal interaction
- It is unlike tragedy in its lack of first person forms (though it is perhaps more like tragedy than any other genre)
- It is unlike history in its lack of CommonAuthority
Waiting for my train back to Glasgow (at the excellent Euston Tap bar near Euston Station), I tried to summarize our findings in four tweets (read them from the bottom, up!):

I’ll try to keep in touch with the actors as they rehearse the play – this was a lesson for me in using the tools to spark an investigation into Shakespeare’s language, and I can now see that we could adapt these tools to various educational settings (including schools and rehearsal rooms!).

Jonathan Hope February 2012
February 6, 2012
Finding the Sherlock in Shakespeare: some ideas about prose genre and linguistic uniqueness

An unexpected point of linguistic similarity between detective fiction and Shakespearean comedy recently led me to consider some of the theoretical implications of tools like DocuScope, which frequently identify textual similarities that remain invisible in the normal process of reading.

A Linguistic Approach to Suspense Plot

Playing around with a corpus of prose, we discovered that the linguistic specs associated with narrative plot are surprisingly unique. Principle Component Analysis performed on the linguistic features counted by DocuScope suggested the following relationship between the items in the corpus:

I interpreted the two strongest axes of differentiation seen in the graph (PC 1 and PC 2) as (1) narrative, and (2) plot. The two poles of the narrative axis are Wuthering Heights (most narrative) and The Communist Manifesto (least narrative). The plot axis is slightly more complicated. But on the narrative side of the spectrum, plot-driven mysteries like “The Speckled Band” and The Canterville Ghost score high on plot, while the least plotted narrative is Samuel Richardson’s Clarissa (9 vols.). For now, I won’t speculate about why Newton’s Optics scores so astronomically high on plot. It is enough that when dealing with narrative, PC 2 predicts plot.

The fact that something as qualitative and amorphous as plot has a quantitative analogue leads to several questions about the meaning of the data tools like DocuScope turn up.

Linguistic Plot without Actual Plot

Because linguistic plot is quantifiable, it allows us to look for passages where plot is present to a relative degree. Given a large enough sample, it is more than likely that some relatively plotted passages will occur in texts that are not plotted in any normal sense. This would at minimum raise questions about how to handle genre boundaries in digital literary research.

Our relative-emplotment test (done in TextViewer) yielded intuitive results when performed on the dozen or so stories in The Adventures of Sherlock Holmes: the passages exhibiting the strongest examples of linguistic plot generally narrated moments of discovery, and moved the actual plot forward in significant ways. Often, these passages showed Holmes and Watson bursting into locked rooms and finding bodies.

When we performed the same test on the Shakespeare corpus, something intriguing happened. The passages identified by TextViewer as exhibiting linguistic plot look very different from the corresponding passages in Sherlock Holmes. There were no dead bodies, no broken-down doors, and no exciting discoveries. Nonetheless, the ‘plotted’ Shakespeare scenes were remarkably consistent with each other. Perhaps most significant in the context of their genre, these scenes had a strong tendency to show characters putting on performances for other characters. Additionally, in a factor that is fascinating even though it is probably a red herring, the ‘plotted’ Shakespeare scenes had an equally strong tendency to involve fairies.

The consistent nature of the ‘plotted’ Shakespeare scenes suggests that the linguistic specs associated with plot when they occur in Sherlock Holmes may have different, but equally specific, effects in other genres. The next step would be to find a meaningful correspondence between the two seemingly disparate literary devices that accompany linguistic plot – detectives bursting into rooms to solve murders, and plays within plays involving fairies. I have some hunches about this. But in many ways the more important question is what is at stake in using DocuScope to identify such unexpected points of overlap.

Enough measurable links between seemingly unlike texts could suggest an invisible web of cognates, which share an underlying structure despite their different appearances and literary classifications. Accordingly, we might hypothesize that reading involves selective ignorance of semantic similarities that could otherwise lead to the socially deviant perception that A Midsummer Night’s Dream resembles a Sherlock Holmes mystery.

The question, then, is this: if the act of reading consists in part of ignoring unfruitful similarities, then what happens when these similarities nonetheless become apparent to us? Looking back at the corpus graph, we begin to see all sorts of possibilities, many of which would be enough make us literary outcasts if voiced in the wrong company. Could Newton’s Optics contain the most exciting suspense plot no one has ever noticed? Could Martin Luther be secretly more sentimental than Clarissa?

Estranging Capacities of Digital Cognates

I have been using the term ‘cognate’ to describe the relationship between linguistically similar but otherwise dissimilar texts. These correspondences will only be meaningful if we can connect them in a plausible way to our readerly understanding of the texts or genres in question. In the case of detective fiction and Shakespearean comedy, this remains to be seen. But our current lack of an explanation does not mean we should feel shy about pursuing the cognates computers direct us to. My analogy is the pop-culture ritual of watching The Wizard of Oz, starting the Pink Floyd album Dark Side of the Moon on the third roar of the MGM lion. The movie and the record sync up in a plausible pattern, prompting the audience to grasp a connection between the cognate genres of children’s movies and psychedelic rock.

If digital methods routinely direct our attention to patterns we would never notice in the normal process of reading, then we can expect them to turn up a large number of such cognates. If we want to understand the results these tools are turning up, we should develop a terminology and start thinking about implications – not just for the few correspondences we can explain, but also for the vast number we cannot explain, at least right now.

October 29, 2011