We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not ‘interesting’ in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by ‘traditional’ literary methods. Here’s an example.
Dr Farah Karim-Cooper, head of research at Shakespeare’s Globe just asked on Twitter if I had any suggestions for a lecture on Hamlet she was due to give. Ten minutes later I had some ‘interesting’ questions for her.
which is nice, but for real text analytics you need to read the spreadsheet of figures. Word-frequency analysis is limited in many ways, but it can surprise you if you look in the right places and at the right things.
When I run log-likelihood, I always look first for the items that are lower than expected, rather than those that are raised (which tend to be content words associated with the topic of the text, and thus fairly obvious). I also tend to look at function words (pronouns, articles, auxiliary verbs) rather than nouns or adjectives.
If you look for absences of high-frequency items, you are using digital text analysis to do the things it does best compared to human reading: picking up absence, and analysing high-frequency items. Humans are good at spotting the presence of low frequency items, items that disrupt a pattern (outliers, in statistical terms) – but we are not good at noticing things that are not there (dogs that don’t bark in the night) and we are not good at seeing woods (we see trees, especially unusual trees).
The Hamlet results were pretty outstanding in this respect: very high up the list, with 3 stars, indicating very strong statistical significance, is a minus result for the pronoun ‘I’. A check across the figures shows that ‘I’ occurs in Hamlet about 184 times every 10,000 words (see the column headed ‘Analysis parts per 10,000′ – Hamlet is the ‘analysis text’ here), whereas in the rest of Shakespeare it occurs about 228 times every 10,000 words (see the column headed ‘Reference parts per 10,000) – the reference corpus is the rest of Shakespeare) – so every 10,000 words in Hamlet have about 40 fewer ‘I’ pronouns than we’d expect.
Or, to put it another way, Shakespeare normally uses ‘I’ 228 times every 10,000 words. Hamlet is about 30,000 words long, so we’d expect, all other things being equal, that Shakespeare would use ‘I’ 684 times. In fact, he uses it just 546 times – and Wordhoard checks the figures to see if we could expect this drop due to chance or normal variation. The three stars next to the log likelihood score for ‘I’ tell us that this figure is very unlikely to be due to chance – something is causing the drop.
Digital analysis can’t explain the cause of the drop: the only question it is answering here is, ‘How frequently does Shakespeare use “I” in Hamlet compared to his other plays?’. On its own, this is not a very interesting question. But the analysis provokes the much more interesting question, ‘Why does Shakespeare use “I” far less frequently in Hamlet than normal?’.
Given literary-critical claims that Hamlet marks the birth of the modern consciousness, it is surprising to find a drop in the frequency of first-person forms. But for an explanation of why this might happen, you’ll have to attend Dr Karim-Cooper’s lecture, ask on Twitter: @DrFarahKC – or go back to the play yourself.