New Image from Original Post from Google Books

We had a request for a clearer version of the image we discussed in our post last year, which shows changes in the catalogued subject of Library of Congress books over the course of several hundred years. Jon Orwant from Google was kind enough to send an updated image, which we’re sharing here. I include below his description of the visualization.

“The visualization…was derived exclusively from the metadata feed of the Library of Congress. The LoC catalog doesn’t restrict itself to American or even English language books, but likely does have some sample bias (as every union catalog does).”

Posted in Counting Other Things | Tagged , , | 1 Comment

What is influence?

Offline, we’ve been discussing The New York Times‘ article on Matt Jockers’ work, and the notion that iterative/digital analysis might be able to track literary influence.

My first reaction to these articles was that it would be hard to track something as high-level as influence with word counts.

But then I remembered that I would have said the same thing about genre several years ago.

I suspect the problem is not really the ability to track influence – let’s assume, for the sake of argument, that it leaves textual evidence just like genre does.

The problem is that I don’t think we have a stable definition of what influence is – so it is hard to make an educated guess about which countable features might track influence. And that is a problem for literary scholars to solve, not computer scientists. (Though I might be proved wrong on this when Matt Jockers’ book comes out.)

With genre, we have lots of theories of what it is, and, more importantly, many lists of plays put into generic categories for us to test. So we had real things to test when we started to look at the linguistic fingerprints of generic groups, even though we did not then know what features might encode genre at the level of the sentence.

The problems with defining ‘influence’ became clear to me when I read this piece:

http://www.newyorker.com/online/blogs/books/2013/01/without-austen-no-eliot.html

on the ‘influence’ of Jane Austen on George Eliot, which even cites empirical evidence in terms of demonstrating Eliot’s interest in, and re-reading of Austen. I think it goes very well with the NYT piece on Jockers, and gives a sense of the slippery nature of ‘influence’.

The piece argues for Austen’s influence on Eliot (no Austen, no Eliot), but this influence isn’t really described – an unkind parody of it might say that the argument boils down to ‘Austen influenced Eliot to write long prose narratives which are called novels’.

Indeed, the article spends more time talking about the differences between the two, rather than the similarities.

So what exactly is ‘influence’ in literary terms? Jockers’ work identifies similarity with influence (or at least the newspaper reports do) – and this is a sensible first step. But is literary influence the use of similar frequencies of function words? Or does similarity at a higher level (plot, character-choreography, the use of certain types of point of view) produce similarities in linguistic frequencies?

Des Higham, professor of statistics at Strathclyde is doing some interesting work producing algorithms to track the ‘influence’ of certain twitter-users over others, following responses to TV programmes. But with Twitter, you can measure ‘influence’ in terms of retweets. In what sense was George Eliot re-tweeting Jane Austen?

jh

Posted in Counting Other Things | Tagged , , , , | 7 Comments

What happens in Hamlet?

We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not ‘interesting’ in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by ‘traditional’ literary methods. Here’s an example.

Dr Farah Karim-Cooper, head of research at Shakespeare’s Globe just asked on Twitter if I had any suggestions for a lecture on Hamlet she was due to give. Ten minutes later I had some ‘interesting’ questions for her.

I began with Wordhoard‘s log-likelihood function, comparing Hamlet to the rest of Shakespeare’s plays. You can view the results of this as a tag cloud:

 

a tag cloud: looks good, immediate, doesn't tell you much

Tag cloud for Hamlet vs the rest of Shakespeare: black words are raised in frequency; grey words lowered; size indicates strength of effect

 

 

 

 

 

 

 

 

 

 

which is nice, but for real text analytics you need to read the spreadsheet of figures. Word-frequency analysis is limited in many ways, but it can surprise you if you look in the right places and at the right things.

not nice to look at, but much more information

 

When I run log-likelihood, I always look first for the items that are lower than expected, rather than those that are raised (which tend to be content words associated with the topic of the text, and thus fairly obvious). I also tend to look at function words (pronouns, articles, auxiliary verbs) rather than nouns or adjectives.

If you look for absences of high-frequency items, you are using digital text analysis to do the things it does best compared to human reading: picking up absence, and analysing high-frequency items. Humans are good at spotting the presence of low frequency items, items that disrupt a pattern (outliers, in statistical terms) – but we are not good at noticing things that are not there (dogs that don’t bark in the night) and we are not good at seeing woods (we see trees, especially unusual trees).

The Hamlet results were pretty outstanding in this respect: very high up the list, with 3 stars, indicating very strong statistical significance, is a minus result for the pronoun ‘I’. A check across the figures shows that ‘I’ occurs in Hamlet about 184 times every 10,000 words (see the column headed ‘Analysis parts per 10,000′ – Hamlet is the ‘analysis text’ here), whereas in the rest of Shakespeare it occurs about 228 times every 10,000 words (see the column headed ‘Reference parts per 10,000) – the reference corpus is the rest of Shakespeare) – so every 10,000 words in Hamlet have about 40 fewer ‘I’ pronouns than we’d expect.

 

Or, to put it another way, Shakespeare normally uses ‘I’ 228 times every 10,000 words. Hamlet is about 30,000 words long, so we’d expect, all other things being equal, that Shakespeare would use ‘I’ 684 times. In fact, he uses it just 546 times – and Wordhoard checks the figures to see if we could expect this drop due to chance or normal variation. The three stars next to the log likelihood score for ‘I’ tell us that this figure is very unlikely to be due to chance – something is causing the drop.

Digital analysis can’t explain the cause of the drop: the only question it is answering here is, ‘How frequently does Shakespeare use “I” in Hamlet compared to his other plays?’. On its own, this is not a very interesting question. But the analysis provokes the much more interesting question, ‘Why does Shakespeare use “I” far less frequently in Hamlet than normal?’.

Given literary-critical claims that Hamlet marks the birth of the modern consciousness, it is surprising to find a drop in the frequency of first-person forms. But for an explanation of why this might happen, you’ll have to attend Dr Karim-Cooper’s lecture, ask on Twitter: @DrFarahKC – or go back to the play yourself.

 

 

 

Posted in Early Modern Drama, Shakespeare | Tagged , , , , | 3 Comments

What Do People Read During a Revolution?

These two visualizations spark two interesting questions: What do people read during a revolution?  What is the connection between what people read and political events? Both images spike dramatically around moments of upheaval in the Western World: The English, American, and French Revolutions, the mid-19th-century Europe-wide overthrow of governments, and World War I, to name just a few.  These images are all the more striking because they did not arise from a historical study of warfare or publishing, but from a more workaday task—that of categorizing all books from 1600 to 2010 according to Library of Congress subject headings. (The source of the data was Google’s catalog of books as of 2010.)  The visualizations were shown in passing during a 2010 meeting between researchers at Google, where the data had been produced, and a group of humanities scholars and advocates, who were meeting with the Google team to exchange ideas. When Google’s Jon Orwant flashed this image on the screen, the professors in the assembly gasped. Genuinely gasped. We could see in this visualization of data things that had been debated for centuries, but that had never been seen: a connection between the world of print and the world of political action, a link between revolution and reading a certain kind of book.

We are experienced readers of books, book history, and—we like to think—of book diagrams. Humans invented stream charts well before the age of computing; this style of conveying information is at least 250 years old and draws on sources that are even older.  (See Rosenberg and Grafton, Cartographies of Time, 2010.)  However, the union of technologies—modern cataloging systems, the increasingly systematized concatenation of library catalogs worldwide, and the capacity to render data chronologically in the style of a geological diagram—produces a compact vision of Western print culture hitherto unseen. Simple in execution, the visualization prompts new thinking.

Like any metaphorical or mathematical rendering, the diagram below should be read with care: the strata are normalized so that the spikes do not necessarily indicate a greater number of books published, but rather a shift in the proportion of books composed of a given subject.  A spike in one layer of the diagram can give the illusion that all strata of the diagram have increased in size, a trick of the eye that the mind needs to combat.  The second visualization helps with this by zooming in and thereby singling out the area of greatest mathematical change, but it, too, needs to be viewed critically.

Now that the caveats have been put to one side, we return to the original questions and then offer a reformulation.

What are people reading during a revolution? Poetry? Books on military technology? Theology? No. If we take the first spike, the years leading up to the English Revolution, the answer in the years leading up to the 1642 regicide seems to be “Old World History.” The second chronological peak—in the decades around the American (1776) and French (1789) Revolutions—shows the same pattern. In periods that historians would link to major political upheaval, the world of print shows similar disruptions: publishers are offering more history for readers who, perhaps, think of themselves as living through important historical changes.

We should be precise: these data don’t indicate that more people are reading history, but that a higher proportion of books published by presses can be classed by cataloguers as history. There are many follow up questions one might ask here. Does publication tie strongly to actual reading, or are these only loosely connected? Are publishers reducing the number of books in other subject areas because of scarcity of resources or some other factor, which would again lead to the proportional spikes seen above? Are the cataloguing definitions of what counts as Old World History or history in general themselves modeled on the books published during the spike years?

One has to ask questions about the size and representativity of the dataset, the uniformity of the classifications, and the nature of the spatial plot in order to understand what is going on. And, crucially in this case, one has to have the initial insight—born of a reading knowledge of history itself—that the timing of the spikes is important. But if you’ve got that kind of knowledge in the room, you might see something you haven’t seen before.

 

 

Posted in Counting Other Things | Tagged , , , | 11 Comments