Blog

New Image from Original Post from Google Books

We had a request for a clearer version of the image we discussed in our post last year, which shows changes in the catalogued subject of Library of Congress books over the course of several hundred years. Jon Orwant from Google was kind enough to send an updated image, which we’re sharing here. I include below his description of the visualization.

“The visualization…was derived exclusively from the metadata feed of the Library of Congress. The LoC catalog doesn’t restrict itself to American or even English language books, but likely does have some sample bias (as every union catalog does).”

February 1, 2013
What is influence?

Offline, we’ve been discussing The New York Times‘ article on Matt Jockers’ work, and the notion that iterative/digital analysis might be able to track literary influence.

My first reaction to these articles was that it would be hard to track something as high-level as influence with word counts.

But then I remembered that I would have said the same thing about genre several years ago.

I suspect the problem is not really the ability to track influence – let’s assume, for the sake of argument, that it leaves textual evidence just like genre does.

The problem is that I don’t think we have a stable definition of what influence is – so it is hard to make an educated guess about which countable features might track influence. And that is a problem for literary scholars to solve, not computer scientists. (Though I might be proved wrong on this when Matt Jockers’ book comes out.)

With genre, we have lots of theories of what it is, and, more importantly, many lists of plays put into generic categories for us to test. So we had real things to test when we started to look at the linguistic fingerprints of generic groups, even though we did not then know what features might encode genre at the level of the sentence.

The problems with defining ‘influence’ became clear to me when I read this piece:

http://www.newyorker.com/online/blogs/books/2013/01/without-austen-no-eliot.html

on the ‘influence’ of Jane Austen on George Eliot, which even cites empirical evidence in terms of demonstrating Eliot’s interest in, and re-reading of Austen. I think it goes very well with the NYT piece on Jockers, and gives a sense of the slippery nature of ‘influence’.

The piece argues for Austen’s influence on Eliot (no Austen, no Eliot), but this influence isn’t really described – an unkind parody of it might say that the argument boils down to ‘Austen influenced Eliot to write long prose narratives which are called novels’.

Indeed, the article spends more time talking about the differences between the two, rather than the similarities.

So what exactly is ‘influence’ in literary terms? Jockers’ work identifies similarity with influence (or at least the newspaper reports do) – and this is a sensible first step. But is literary influence the use of similar frequencies of function words? Or does similarity at a higher level (plot, character-choreography, the use of certain types of point of view) produce similarities in linguistic frequencies?

Des Higham, professor of statistics at Strathclyde is doing some interesting work producing algorithms to track the ‘influence’ of certain twitter-users over others, following responses to TV programmes. But with Twitter, you can measure ‘influence’ in terms of retweets. In what sense was George Eliot re-tweeting Jane Austen?

jh

UPDATE: There is a *very* interesting discussion of influence and Matt Jockers’ work here (Bill Benzon’s blog New Savanna).

January 29, 2013
What happens in Hamlet?

We perform digital analysis on literary texts not to answer questions, but to generate questions. The questions digital analysis can answer are generally not ‘interesting’ in a humanist sense: but the questions digital analysis provokes often are. And these questions have to be answered by ‘traditional’ literary methods. Here’s an example.

Dr Farah Karim-Cooper, head of research at Shakespeare’s Globe just asked on Twitter if I had any suggestions for a lecture on Hamlet she was due to give. Ten minutes later I had some ‘interesting’ questions for her.

I began with Wordhoard‘s log-likelihood function, comparing Hamlet to the rest of Shakespeare’s plays. You can view the results of this as a tag cloud:

Tag cloud for Hamlet vs the rest of Shakespeare: black words are raised in frequency; grey words lowered; size indicates strength of effect

which is nice, but for real text analytics you need to read the spreadsheet of figures. Word-frequency analysis is limited in many ways, but it can surprise you if you look in the right places and at the right things.

When I run log-likelihood, I always look first for the items that are lower than expected, rather than those that are raised (which tend to be content words associated with the topic of the text, and thus fairly obvious). I also tend to look at function words (pronouns, articles, auxiliary verbs) rather than nouns or adjectives.

If you look for absences of high-frequency items, you are using digital text analysis to do the things it does best compared to human reading: picking up absence, and analysing high-frequency items. Humans are good at spotting the presence of low frequency items, items that disrupt a pattern (outliers, in statistical terms) – but we are not good at noticing things that are not there (dogs that don’t bark in the night) and we are not good at seeing woods (we see trees, especially unusual trees).

The Hamlet results were pretty outstanding in this respect: very high up the list, with 3 stars, indicating very strong statistical significance, is a minus result for the pronoun ‘I’. A check across the figures shows that ‘I’ occurs in Hamlet about 184 times every 10,000 words (see the column headed ‘Analysis parts per 10,000’ – Hamlet is the ‘analysis text’ here), whereas in the rest of Shakespeare it occurs about 228 times every 10,000 words (see the column headed ‘Reference parts per 10,000) – the reference corpus is the rest of Shakespeare) – so every 10,000 words in Hamlet have about 40 fewer ‘I’ pronouns than we’d expect.

Or, to put it another way, Shakespeare normally uses ‘I’ 228 times every 10,000 words. Hamlet is about 30,000 words long, so we’d expect, all other things being equal, that Shakespeare would use ‘I’ 684 times. In fact, he uses it just 546 times – and Wordhoard checks the figures to see if we could expect this drop due to chance or normal variation. The three stars next to the log likelihood score for ‘I’ tell us that this figure is very unlikely to be due to chance – something is causing the drop.

Digital analysis can’t explain the cause of the drop: the only question it is answering here is, ‘How frequently does Shakespeare use “I” in Hamlet compared to his other plays?’. On its own, this is not a very interesting question. But the analysis provokes the much more interesting question, ‘Why does Shakespeare use “I” far less frequently in Hamlet than normal?’.

Given literary-critical claims that Hamlet marks the birth of the modern consciousness, it is surprising to find a drop in the frequency of first-person forms. But for an explanation of why this might happen, you’ll have to attend Dr Karim-Cooper’s lecture, ask on Twitter: @DrFarahKC – or go back to the play yourself.

August 17, 2012
What Do People Read During a Revolution?

These two visualizations spark two interesting questions: What do people read during a revolution? What is the connection between what people read and political events? Both images spike dramatically around moments of upheaval in the Western World: The English, American, and French Revolutions, the mid-19th-century Europe-wide overthrow of governments, and World War I, to name just a few. These images are all the more striking because they did not arise from a historical study of warfare or publishing, but from a more workaday task—that of categorizing all books from 1600 to 2010 according to Library of Congress subject headings. (The source of the data was Google’s catalog of books as of 2010.) The visualizations were shown in passing during a 2010 meeting between researchers at Google, where the data had been produced, and a group of humanities scholars and advocates, who were meeting with the Google team to exchange ideas. When Google’s Jon Orwant flashed this image on the screen, the professors in the assembly gasped. Genuinely gasped. We could see in this visualization of data things that had been debated for centuries, but that had never been seen: a connection between the world of print and the world of political action, a link between revolution and reading a certain kind of book.

We are experienced readers of books, book history, and—we like to think—of book diagrams. Humans invented stream charts well before the age of computing; this style of conveying information is at least 250 years old and draws on sources that are even older. (See Rosenberg and Grafton, Cartographies of Time, 2010.) However, the union of technologies—modern cataloging systems, the increasingly systematized concatenation of library catalogs worldwide, and the capacity to render data chronologically in the style of a geological diagram—produces a compact vision of Western print culture hitherto unseen. Simple in execution, the visualization prompts new thinking.

Like any metaphorical or mathematical rendering, the diagram below should be read with care: the strata are normalized so that the spikes do not necessarily indicate a greater number of books published, but rather a shift in the proportion of books composed of a given subject. A spike in one layer of the diagram can give the illusion that all strata of the diagram have increased in size, a trick of the eye that the mind needs to combat. The second visualization helps with this by zooming in and thereby singling out the area of greatest mathematical change, but it, too, needs to be viewed critically.

Now that the caveats have been put to one side, we return to the original questions and then offer a reformulation.

What are people reading during a revolution? Poetry? Books on military technology? Theology? No. If we take the first spike, the years leading up to the English Revolution, the answer in the years leading up to the 1642 regicide seems to be “Old World History.” The second chronological peak—in the decades around the American (1776) and French (1789) Revolutions—shows the same pattern. In periods that historians would link to major political upheaval, the world of print shows similar disruptions: publishers are offering more history for readers who, perhaps, think of themselves as living through important historical changes.

We should be precise: these data don’t indicate that more people are reading history, but that a higher proportion of books published by presses can be classed by cataloguers as history. There are many follow up questions one might ask here. Does publication tie strongly to actual reading, or are these only loosely connected? Are publishers reducing the number of books in other subject areas because of scarcity of resources or some other factor, which would again lead to the proportional spikes seen above? Are the cataloguing definitions of what counts as Old World History or history in general themselves modeled on the books published during the spike years?

One has to ask questions about the size and representativity of the dataset, the uniformity of the classifications, and the nature of the spatial plot in order to understand what is going on. And, crucially in this case, one has to have the initial insight—born of a reading knowledge of history itself—that the timing of the spikes is important. But if you’ve got that kind of knowledge in the room, you might see something you haven’t seen before.

July 11, 2012
The Time Problem: Rigid Classifiers, Classifier Postmarks

Here is a thought experiment. Make the following assumptions about a historically diverse collection of texts:

1) I have classified them according to genre myself, and trust these classifications.

2) I have classified the items according to time of composition, and I trust these classifications.

So, my items are both historically and generically diverse, and I want to understand this diversity in a new way.

The metadata I have now allows me to partition the set. The partition, by decade, items, and genre class (A, B, C) looks like this:

Decade 1, 100 items: A, 25; B, 50; C, 25

Decade 2, 100 items: A, 30; B, 40; C, 30

Decade 3, 100 items: A, 30; B, 30; C, 40

Decade 4, 100 items: A, 40; B, 40; C, 20

Each decade is labeled (D1, D2 D3) and each contains 100 items. These items are classed by Genre (A, B, C) and the proportions of items belonging to each genre changes from one decade to the next. What could we do with this collection partitioned in this way, particularly with respect to changes in time?

I am interested in genre A, so I focus on that: how does A’ness change over time? Or how does what “counts as A” change over time? I derive a classifier (K) for A in the first Decade and use this distance metric to arrange all items in this decade with respect to A’ness. So my new description allows me to supply the following information about every item: Item 1 participates in A to this degree, and A’ness means “not being B or C in D1.” Let’s call this classifier D1Ka. I can now derive the set of all classifiers with respect to these metadata: D1Ka, D1Kb, D1Kc, D2Ka, D2Kb, etc. And let’s say I derive a classifier for A using the whole dataset. So we add DKa, DKb, DKc. What are these things I have produced and how can they be used to answer interesting questions?

I live in D1, and am confident I know what belongs to A having seen lots of examples. But I get access to a time travel machine and someone sends me a text written much later in time. It is a visitor from D4, and by my own lights, it looks like another example of A. So, I have projected D1Ka onto an item from D4 and made a judgment. Now we lift the curtain and find that for a person living in D4, the item is not an A but a B. Is my classifier wrong? Is this type of projection illegitimate? I don’t think so. We have learned that classifiers themselves have postmarks, and these postmarks are specific to the population in which they are derived. D1Ka is an *artifact* of the initial partitioning of my data: if there were different proportions of A, B, and C within D1, or different items in each of these categories, the classifier would change.

Experiment two. I live in D4 and I go to a used bookstore, where I find a beautifully preserved copy of an item produced in D1. The title page of the this book says, “The Merchant of Venice, a Comedy.” Nonsense, I say. There’s nothing funny about this repellent little play. So D1Ka fails to classify an A for someone in D4. Why? Because the classifier D4Ka is rigidly determined by the variety of the later population, and this variety is different from that found in D1. When classifiers are themselves rigidly aligned with their population of origin, they generalize in funny ways.

Wait, you say. I have another classifier, namely Ka produced over the entire population, which represents all of the time variation in the dataset of 400 items. Perhaps this is useful for describing how A’ness changes over time? Could I compare D1Ka, D2Ka, D3Kz and D4Ka to one another using DKa as my reference? Perhaps, but you have raised a new question: who, if anyone, ever occupies this long interval of time? What kind of abstraction or artifact is DKa, considering that most people really think 10 years ahead or behind when they classify a book? If we are dealing with 27 decades (as we do in the case of our latest big experiment), we have effectively created a classifier for a time interval that no one could ever occupy. Perhaps there is a very well-read person who has read something from each decade and so has an approximation of this longer perspective: that is the advantage of the durability of print, the capacity of memory, and perhaps the viability of reprinting, which in effect imports some of the variation from an earlier decade into a newer one. When we are working with DKa, everything is effectively written at the same time. Can we use this strange assumption — everything is written at once — to explore the real situation, which is that everything is written at a different time?

Another interesting feature of the analysis. This same type of “all written at the same time” reasoning is occurring in our single decade blocks, since when we create the metadata that allows us to treat a subpopulation of texts and belonging to *a* decade, we once again say they were written simultaneously. We use obvious untruths to get at underlying truths, like an astronomer using the inertial assumption to calculate forces, even though we’ve never seen a body travel in a straight line forever.

If classifiers are artifacts of an arbitrarily scalable partitioning of the population, and if these partitions can be compared, what is the ideal form of “classifier time travel” to use when thinking about how actual writing is influenced by other writing, and how a writer’s memory of texts produced in the past can be projected forward into new spaces? Is there anything to be learned about genre A by comparing the classifiers that can be produced to describe it over time? If so, whose perspective are we approximating, and what does that implied perspective say about our underlying model of authorship and literary history?

If classifiers have postmarks, when are they useful in generalizing over — or beyond — a lifetime’s worth of reading?

April 16, 2012