Lost Books, “Missing Matter,” and the Google 1-Gram Corpus

Mike Gleicher's Visualization of 1-gram Ranks in Google English Corpus

My colleague Mike Gleicher (UW-Madison Computer Science) has been working on a rough and ready visualization of popular words (1-grams) in the Google English Books corpus, which contains several million items. He produced this visualization after working with the dataset for a day. I find the visualization appealing because of what it shows us about English “closed-class” or “function words.”

When you explore the interactive version on Gleicher’s website by clicking on the image above, you can highlight the path of certain words as they increase or decrease in rank by decade. (Rank here means the popularity of a given word among all others in the scanned and published works for that decade.) Note that the stable items at the top of the visualization are function words; that is, they are words which contain syntactical or grammatical information rather than lexical content. (Function words are hardest to define in the dictionary; they also tend not to have synonyms.) We expect function words to be used frequently and for that use to be invariant over time: such words mediate functions that the language must repeatedly deploy. Notice that the function words at the top – “the” “of” “and” “to” –  tend to be stable throughout in terms of rank; they are also the function words that do not have multiple spellings. So, for example, “the” does not contain a long “ƒ” in the same way that “was” does. We know that these characters were interchangeable prior to the standardization of English orthography, so it makes sense that “the” remains quite stable while “was” does not.

What should we say about the spaghetti like tangle at the lefthand side of the graph? I think its fair to say that this tangle shows two things at once: first, that function words are plentiful over time and, second, that such high frequency words had multiple spellings. The viewer that Gleicher has created allows you to see how a single function word varies through multiple spellings. So, for example, the period of high rank-fluctuation in the function word “have” coincides with the high-fluctuation period of “haue,” which is its alternate spelling.

Rank of Function Word "Have" in Google English Corpus

Rank of the Function Word "Haue" in the Google English Corpus

Visualizations ought to provoke new ideas, not simply prove that certain relationships exist. So, what interesting thoughts can you have while looking at such visualizations? I was struck by the contrast between the straight lines at the top and the tangle further below. What does such a contrast tell us beyond what we already know about spelling variation in the early periods? Perhaps it allows us to imagine the conditions under which the highly-ranked function words might progress in linear fashion across the x-axis, as in the case of “the.” What if we aggregated the counts for the function words by including occurrences of known alternate spellings and then recalculated? If we combined the counts of “have” and “haue” for example, we might expect the rank of the lemma “have” (the aggregate of all alternative spellings) to go up, and for its path to be less wobbly. The end result of this process should be that lines across the top become increasingly straight, and that the lower lefthand side of the visualization becomes less tangled.

Surely some remaining tangles would exist, however. Leftover tangles might be an effect of the limited size of this earlier portion of the corpus: we know, for example, that there are far fewer words in these earlier decades than in the later ones; we also know that the Optical Character Recognition process fails to capture all of the surviving words accurately during translation. If we assume that certain function words are so essential that their relative rank in a given time period ought to be invariant, then the residual wobbles might provide us with a measure of how much linguistic variation is missing from the Google corpus in a given decade. It would suggest, like a companion sun wobbling around a black hole, the existence of lost books and lost letters. This is not the “dark matter” of the Google corpus referred to in a recent article in Science: the proper nouns that never make it into the dictionary. Rather, this is “missing matter,” things which existed but did not survive to be counted because books were destroyed or characters were not recognized.

Google is trying to quantify just how large the corpus of printed English books is so that it can say what percentage of books it has scanned. “Function word wobble” might be a proxy for such a measure.We already use function word counts to characterize differences in Shakespeare’s literary genres and Victorian novels. Perhaps they are useful for something other than genre discrimination within groups of texts – useful, that is, when profiled across the entire population of word occurrences in a decade rather than a generically diversified sub-population of books.

Prospero destroys his magnificent book of magic before leaving the island in The Tempest, saying “deeper than did ever plummet sound / I’ll drown my book.” When he does so, certain words disappear to the bottom of the sea. Much later, a plummet may sound the loss.

This entry was posted in Counting Other Things and tagged , , , , , , . Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

2 Trackbacks

  • By Bigger Post 5 on January 4, 2011 at 12:19 am

    […] function words, Google n-grams, lost books, Michael Gleicher, OCR, visualization. Bookmark the permalink. Comments are closed, but you can leave a trackback: Trackback […]

  • By Winedarksea on January 12, 2011 at 4:46 pm


    […] something about winedarksea[…]…

Post a Comment

Your email is never published nor shared. Required fields are marked *

You may use these HTML tags and attributes <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>