Tag: log likelihood

  • What happens in Hamlet?

    We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not ‘interesting’ in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by ‘traditional’ literary methods. Here’s an example.

    Dr Farah Karim-Cooper, head of research at Shakespeare’s Globe just asked on Twitter if I had any suggestions for a lecture on Hamlet she was due to give. Ten minutes later I had some ‘interesting’ questions for her.

    I began with Wordhoard‘s log-likelihood function, comparing Hamlet to the rest of Shakespeare’s plays. You can view the results of this as a tag cloud:

     

    a tag cloud: looks good, immediate, doesn't tell you much
    Tag cloud for Hamlet vs the rest of Shakespeare: black words are raised in frequency; grey words lowered; size indicates strength of effect

     

     

     

     

     

     

     

     

     

     

    which is nice, but for real text analytics you need to read the spreadsheet of figures. Word-frequency analysis is limited in many ways, but it can surprise you if you look in the right places and at the right things.

    not nice to look at, but much more information

     

    When I run log-likelihood, I always look first for the items that are lower than expected, rather than those that are raised (which tend to be content words associated with the topic of the text, and thus fairly obvious). I also tend to look at function words (pronouns, articles, auxiliary verbs) rather than nouns or adjectives.

    If you look for absences of high-frequency items, you are using digital text analysis to do the things it does best compared to human reading: picking up absence, and analysing high-frequency items. Humans are good at spotting the presence of low frequency items, items that disrupt a pattern (outliers, in statistical terms) – but we are not good at noticing things that are not there (dogs that don’t bark in the night) and we are not good at seeing woods (we see trees, especially unusual trees).

    The Hamlet results were pretty outstanding in this respect: very high up the list, with 3 stars, indicating very strong statistical significance, is a minus result for the pronoun ‘I’. A check across the figures shows that ‘I’ occurs in Hamlet about 184 times every 10,000 words (see the column headed ‘Analysis parts per 10,000’ – Hamlet is the ‘analysis text’ here), whereas in the rest of Shakespeare it occurs about 228 times every 10,000 words (see the column headed ‘Reference parts per 10,000) – the reference corpus is the rest of Shakespeare) – so every 10,000 words in Hamlet have about 40 fewer ‘I’ pronouns than we’d expect.

     

    Or, to put it another way, Shakespeare normally uses ‘I’ 228 times every 10,000 words. Hamlet is about 30,000 words long, so we’d expect, all other things being equal, that Shakespeare would use ‘I’ 684 times. In fact, he uses it just 546 times – and Wordhoard checks the figures to see if we could expect this drop due to chance or normal variation. The three stars next to the log likelihood score for ‘I’ tell us that this figure is very unlikely to be due to chance – something is causing the drop.

    Digital analysis can’t explain the cause of the drop: the only question it is answering here is, ‘How frequently does Shakespeare use “I” in Hamlet compared to his other plays?’. On its own, this is not a very interesting question. But the analysis provokes the much more interesting question, ‘Why does Shakespeare use “I” far less frequently in Hamlet than normal?’.

    Given literary-critical claims that Hamlet marks the birth of the modern consciousness, it is surprising to find a drop in the frequency of first-person forms. But for an explanation of why this might happen, you’ll have to attend Dr Karim-Cooper’s lecture, ask on Twitter: @DrFarahKC – or go back to the play yourself.

     

     

     

  • The comic ‘I’ and the tragic ‘we’?

    In our Shakespeare Quarterly paper, we used Docuscope to come up with a description of Shakespeare’s comic language which centres on the rapid exchange of singular pronouns: I/you and my/your. We claimed there that Shakespearean comedies typically involve people arguing about things, striving to arrive at a ‘we’ of agreement, but not being able to until the final scene. Here’s what we said in more detail (we’re discussing Twelfth Night):

    The quick trading of I/you and my/your strings in Comic dialogue suggests a world in which predicates are attached to subjects from two, and only two, points of view. This is not a universe of one; nor is it a crowd. It is not surprising that Comic plotting, built as it is on sexual pairings, would favor this type of bivalent, perspectival tagging of action by speakers. But there is something else going on here. Olivia is trying to make something happen in this exchange. She says, “do not extort thy reasons from this clause,” and earlier, “I would you were as I would have you be!” (3.1/1392, 1381). The “thy” and “you” are important because the speaker is trying to create or assert a particular interpretation of how these two individuals relate to one another (and the words exchanged between them). The essential drama in this situation is the asymmetry of desire that obtains between the two characters, an asymmetry that keeps Viola from assenting to Olivia’s advances. That resistance is actually what forces Olivia to make these statements that are rich with I/you and me/my, since she uses these words as anchors for a broader interpretation that does not yet obtain. She really wants to say we. And Cesario doesn’t, so they remainin I/you dialogue…

    Shakespeare writes Comedies in which characters, sometimes quite perversely, find the wrong way to the ones they love. Often it is chance or an onstage helper who sorts this out. Shakespeare is actually quite reserved when it comes to showing love as naturally progressing through its obstacles unassisted. But given that in the initial stages of courtship Shakespearean lovers almost never meet and join in a perfectly symmetrical way—they don’t start out as stones set in an arch, leaning perfectly on a keystone—we should expect this asymmetry to show itself in the language. Where does it show up? It appears when a resistant individual, a “you,” prevents another “I” from arriving at an interpretation of a relationship that might be referred to as a “we” before others. Let’s call this the “resistant-you” hypothesis. Linguistically, the effect manifests itself in the assertion of the self (“FirstPerson”) and the rejection of suggested mental and emotional realities (“DenyDisclaim”).

    We’ve been finding that high frequencies of first person pronouns, and other features associated with rapid dialogue, are characteristic of most types of Early Modern comedy. But what of the implied correlative to this? If comedies are the genre of ‘I’; are tragedies the genre of ‘we’?

    A quick way to test this is to use Martin Mueller et al.’s excellent Wordhoard tool to run a log likelihood vocabulary test on Shakespeare’s comedies and tragedies. This type of test takes an analysis corpus (in this case Shakespeare’s comedies), and compares it to a reference corpus (Shakespeare’s tragedies). The output flags those words that are either more or less frequent in the analysis corpus than we would expect, given the frequencies found in the reference corpus.

    The results in this case are as follows:

     

    What we are interested in here is the list of lemmas in column 1: ‘she’, ‘I’, ‘master’, ‘a’, ‘sir’ etc; and the symbol in column 3 ‘Relative use’ – which tells us if the frequency is greater (+) or less (-) than expected. (Column 4 gives the log likelihood value, and a number of asterisks indicating degree of statistical significance, but all the results we are looking at here are highly significant, so we can ignore this.)

    Behold: pronouns used more in the comedies than the tragedies are the singular ‘she’, ‘I’, ‘you’ (let’s assume these are mainly singular uses) – these are all marked + in column 3. Now look at the results for the plural pronouns ‘our’, ‘we’, ‘they’: all marked -, and so lowered in the comedies/raised in the tragedies.

    This is a very strong finding (especially considering how frequent pronouns are), and it invites further exploration of the dialogic nature of comedy in comparison with the communal nature of tragedy.
    jh/29.7.2011