Category: Counting Other Things

Lost Books, “Missing Matter,” and the Google 1-Gram Corpus

Mike Gleicher's Visualization of 1-gram Ranks in Google English Corpus

My colleague Mike Gleicher (UW-Madison Computer Science) has been working on a rough and ready visualization of popular words (1-grams) in the Google English Books corpus, which contains several million items. He produced this visualization after working with the dataset for a day. I find the visualization appealing because of what it shows us about English “closed-class” or “function words.”

When you explore the interactive version on Gleicher’s website by clicking on the image above, you can highlight the path of certain words as they increase or decrease in rank by decade. (Rank here means the popularity of a given word among all others in the scanned and published works for that decade.) Note that the stable items at the top of the visualization are function words; that is, they are words which contain syntactical or grammatical information rather than lexical content. (Function words are hardest to define in the dictionary; they also tend not to have synonyms.) We expect function words to be used frequently and for that use to be invariant over time: such words mediate functions that the language must repeatedly deploy. Notice that the function words at the top – “the” “of” “and” “to” – tend to be stable throughout in terms of rank; they are also the function words that do not have multiple spellings. So, for example, “the” does not contain a long “ƒ” in the same way that “was” does. We know that these characters were interchangeable prior to the standardization of English orthography, so it makes sense that “the” remains quite stable while “was” does not.

What should we say about the spaghetti like tangle at the lefthand side of the graph? I think its fair to say that this tangle shows two things at once: first, that function words are plentiful over time and, second, that such high frequency words had multiple spellings. The viewer that Gleicher has created allows you to see how a single function word varies through multiple spellings. So, for example, the period of high rank-fluctuation in the function word “have” coincides with the high-fluctuation period of “haue,” which is its alternate spelling.

Rank of Function Word "Have" in Google English Corpus

Rank of the Function Word "Haue" in the Google English Corpus

Visualizations ought to provoke new ideas, not simply prove that certain relationships exist. So, what interesting thoughts can you have while looking at such visualizations? I was struck by the contrast between the straight lines at the top and the tangle further below. What does such a contrast tell us beyond what we already know about spelling variation in the early periods? Perhaps it allows us to imagine the conditions under which the highly-ranked function words might progress in linear fashion across the x-axis, as in the case of “the.” What if we aggregated the counts for the function words by including occurrences of known alternate spellings and then recalculated? If we combined the counts of “have” and “haue” for example, we might expect the rank of the lemma “have” (the aggregate of all alternative spellings) to go up, and for its path to be less wobbly. The end result of this process should be that lines across the top become increasingly straight, and that the lower lefthand side of the visualization becomes less tangled.

Surely some remaining tangles would exist, however. Leftover tangles might be an effect of the limited size of this earlier portion of the corpus: we know, for example, that there are far fewer words in these earlier decades than in the later ones; we also know that the Optical Character Recognition process fails to capture all of the surviving words accurately during translation. If we assume that certain function words are so essential that their relative rank in a given time period ought to be invariant, then the residual wobbles might provide us with a measure of how much linguistic variation is missing from the Google corpus in a given decade. It would suggest, like a companion sun wobbling around a black hole, the existence of lost books and lost letters. This is not the “dark matter” of the Google corpus referred to in a recent article in Science: the proper nouns that never make it into the dictionary. Rather, this is “missing matter,” things which existed but did not survive to be counted because books were destroyed or characters were not recognized.

Google is trying to quantify just how large the corpus of printed English books is so that it can say what percentage of books it has scanned. “Function word wobble” might be a proxy for such a measure.We already use function word counts to characterize differences in Shakespeare’s literary genres and Victorian novels. Perhaps they are useful for something other than genre discrimination within groups of texts – useful, that is, when profiled across the entire population of word occurrences in a decade rather than a generically diversified sub-population of books.

Prospero destroys his magnificent book of magic before leaving the island in The Tempest, saying “deeper than did ever plummet sound / I’ll drown my book.” When he does so, certain words disappear to the bottom of the sea. Much later, a plummet may sound the loss.

January 3, 2011
Google n-grams and Philosophy: Use Versus Mention

Well, the Google n-gram corpus is out, and the world has been introduced to a fabulous new intellectual parlor game. Here are a few searches I ran today which deal with philosophers and philosophical terms:

A lot of people are going to be playing with this tool, and I think there are some genuine discoveries to be made. But here is a question: is what’s being counted in these n-gram searches “uses” of certain words or “mentions” of those words? The use/mention distinction is a favorite one in analytic philosophy, and has roots in the theory of “suppositio” explored by the medieval Terminists. It is useful here as well. The google n-gram corpus is simply a bag of words and sequences of words divided by year. So what does it mean that an n-gram occurs more frequently in one bag rather than some other? Does philosophy become more interested in “the subject” as opposed to “the object” around 1800? (Never mind that these terms have precisely the opposite meaning for medieval thinkers.) Does Heidegger eclipse Bergson in importance in the mid-1960s? Does “ethics” displace “morality” as a way of thinking about what is right or wrong in human action?

These are different cases; in each, however, we ought to read the results returned from the n-gram corpus search as “mentionings” of these terms. Understanding how these words are used, and in what kinds of texts, is much more difficult than saying that they are mentioned in such and such a quantity. The important question, then, concerns what can you learn from the occurrence or mention of a word in a field as wide as this. I think the mention of a proper name like “Heidegger” is probably more revealing than the mention of a particular philosophical term like “subject” or “object.” While it’s not an earth-shaking discovery that Heidegger gets more mentions than Bergson in the latter half of the twentieth century, this fact is nevertheless interesting and useful. In the case of terms such as “subject” and “object,” however, we are dealing with terms that are regularly used outside of philosophical analysis: they may not have a “philosophical use” in the cases being counted. Another factor to consider: the name Heidegger likely refers to the German philosopher, but it could also point to other individuals sharing this name. The philosopher Donald Davidson, for example, who spent a lot of time thinking about the use/mention distinction, would not necessarily be picked out of a crowd by a search on his surname. Even with a rare proper name we can’t be certain that mention accomplishes something like Kripke’s “rigid designation.”

We could get closer to a word’s use by trying a longer string, something along the lines of Daniel Shore’s study of uses of the subjunctive with reference to Jesus, as in “what would Jesus do?” When it is embedded in the string’s Shore identifies, the proper name Jesus seems designates its referent more precisely. So too, the word “do” refers to the context of ethical deliberation, although even now there are ironic uses of the phrase that are really “mentionings” of earnest uses of these words by evangelicals. The special use-case of irony would, I suspect, be the hardest to track in large numbers. But there may be phrases that are invented by philosophers precisely in order to specify their own use, which is what makes them reliably citable or iterable in philosophical discourse. Terms of art, such as “a priori synthetic judgment,” are actually highly compressed attempts to specify a writer’s use of terms. As use-specific strings, terms of art are likely to produce use-specific results when they are used as search terms. Indeed, it seems likely that most philosophers are actually doing a roundabout form of mentioning when they coin such phrases. Such moments are imperative contracts, meaning something like: “whenever you see the phrase ‘a priori synthetic,’ interpret it as meaning ‘a judgment that pertains to experience but is not itself derived experientially.’”

It would be nice if we could see occurrences displayed by subject heading of book. That would allow the user to be more precise in linking occurrence claims to use claims, a link that must inevitably be made in quantitative studies of culture. I suspect it is much harder to link occurrence to use than most people think; this tool may have the unintended use of bearing out that fact.

December 17, 2010
Early and Late Plato II: The Apology and The Timaeus

In the previous post we were examining three dimensional clusterings of the Platonic dialogues as rated on scaled Principal Components 1, 2 and 5, a technique that allowed us to see the early Platonic dialogues (as defined by Vlastos) standing apart from the middle and later ones. Vlastos’ claim, we remember, was that these early dialogues represent the historical Socrates, whose technique of argumentation was elenctic. Socrates used this technique to draw out the implications of an opponent’s views until those views collapsed under their own contradictions.

The translator of these dialogues, Jowett, would have had to preserve at least some of the linguistic “footings” required for such a dialogical structure in the early dialogues, and it was my contention in the previous post that Docuscope would detect these footings because they are exactly what a translator must preserve. Perhaps a more provocative claim, which I would like to advance now, is that the irony which attends this elenctic method — while not itself visible to Docuscope — might also require certain reliable linguistic pivots. In keeping with our analogy of the body of a dancer, certain upper body moves like the ironic twist in which Socrates seems to be asking a question for the sake of clarification but is actually pushing his interlocutor into deeper confusion, require a lower body stance that can support the weight of the move. If we could define this lower body stance, we would not be defining Socratic irony itself, but rather its linguistic correlates. (At some point, the analogy will break down, since language is not a “weight bearing system”: but it does support gestures and turns, so let’s see how far we can go with it.)

What is it exactly that is happening in these early dialogues that Docuscope and principal component Analysis are able to see from afar? Here is a scree plot which rates the power of the principal components as they are derived sequentially, from most powerful to least:

Scree Plot for Principal Components Derived from Cluster Docuscope Data on Jowett Translations of the Platonic Corpus

The first two principal components are shown here to be quite powerful: together they account for almost 54% of the variation in the entire corpus. When we rate all of the dialogues on just these first two components, we get the following bubble plot:

Graph of Platonic Dialogues Scored on Principal Components 1 and 2

I have highlighted the upper left quadrant, where almost all of the dialogues that Vlastos identified as “early” are clustering. Their presence in this quadrant means that they score low on PC1 and high on PC2. PC1 might be described as an anti-early component, because it powerfully discriminates against early dialogues. PC2, on the other hand, might be described as a pro-early component, since its highly loaded variables are more frequently used in early dialogues. We can literally see the sorting power of these two components here, but it can also be quantified by the Tukey text, which was applied to both principal components, the results being available here and here. Note that the Apology is one of the most strongly “early” dialogues by these measures, whereas the Timaeus is one of the least early. We will pay closer attention to these two items as a way of exemplifying the differences that Docuscope sees between the two types of items.

Before making the comparison, let’s look at the variables that are most powerfully loaded on these components and so are most responsible for discriminating the early/non-early difference. We do this either by consulting the loadings of our variables on the two principal components or by looking at a biplot which arrays those variables in two dimensions, exactly the two that were used to produce the bubble plot above. First the loadings scores (reported as eigenvectors) and then the loadings biplot:

Loadings of Cluster Scores for PC1 and PC2

Loadings Biplot for PC1 and PC2

The loadings biplot (lower diagram) is a two dimensional image of the loadings scores (upper diagram), showing how these variables behave with respect to one another in the entire corpus. Clusters of words that oppose each other by 180 degrees — for example, [Public_Values] and [Special_Referencing] — tend not to co-occur with one another in the same text. Here we are interested in what makes a particular text cluster in the upper left-hand quadrant, so we are looking for vectors (red arrows) that extend furthest to the left and to the top of the diagram. Vectors extending to the left are: Reasoning, Interactivity, Directing Action, Interior Mind and First Person. (These are the clusters that have significant negative loadings on the first column in the top diagram: if an item scores high on words contained in these clusters, it will be “punished” for that abundance and pushed to the left of the plot, as the red dots are above.) Note that we can also use our 180 rule to say something about items that are far left in the bubble plot as well: they must lack items contained in the clusters that are positively loaded on PC1, which are Narrating, Description, and Time Orientation.

Similarly, with PC2, we are looking for the tall vectors heading upward: Emotion, Public Values and Topical Flow. Having tokens that were counted under these clusters will push an item up in the diagram, as will lacking items from the negatively loaded clusters: Directing Readers, Elaborating, Special Referencing. Note that Topical Flow (which is often populated by third person pronoun use) is loaded positively for the second principal component, but also positively for the first, which makes it fork upward and to the right. This means that an item scoring high on Topical Flow tokens will probably lack some of the items to the far left and contain items to the far right, which may discourage that item’s appearance in our “early” quadrant unless there are differences in these other variables.

I have discussed some of these clusters in earlier posts about Shakespeare, so my main focus here will not be on elaborating the contents of the clusters. Rather, I want to use these loadings to zero in on specific words in exemplary passages from the early and later dialogues to see what is captured and then leave it to readers to say what these particular tokens are doing. Looking at our bubble plot above, the two dialogues that exemplify these opposing linguistic trends — in translation — are the Apology and the Timaeus.

Here are two passages from the Apology that exemplify “earliness” in the Platonic corpus, if we agree that the clustering above seems compelling. Note that these are screenshots from Docuscope in which the clusters that are doing the work of pushing the texts up and to the left are turned on or color coded. I have not turned on the clusters that are absent, since these will be exemplified in the Timaeus:

I think these passages are certainly illustrative of the elenctic method described by Vlastos, although it ought to be said that the high amount of dialogical interaction here — one that was a hallmark of comedy in Shakespearean drama — is sometimes implied by Socrates rather than really enacted by both speakers. That is, Socrates sometimes simulates a dialogue that is not really happening (“to him I may fairly answer”), and this procedure actually multiplies the Interaction strings (sky blue) beyond what might be the case in actual interaction. Note too that Docuscope is seeing lots of Public Values words, words that gesture toward communally sanctioned values, in this earlier style: demigods, heroes, fairly, mistaken, good for, doing right, disgrace. These values must be cited in elenctic exchange because they are the topic of conversation (people have opinions about them), but such implied communality may also coerce assent from an interlocutor for reasons that extend beyond mere shame at self-contradiction. We see, too, more emotionally charged words (in orange); the occasional Topical Flow token (their); and some Reason tokens (if he, thus, may, do not).

Now look at a passage from the Timaeus, which does the things that items in the early quadrant (on the whole) cannot do:

This is cosmogeny, not dialogue, which is why we have a number of Narrative strings (the year when, then, the night, overtaken the, as they) and Description strings (orbit, the moon, stars, sun, wanderings, motion, swiftness). Special Referencing here is picking up a lot of abstract references (dark purple) such as animals, measure, relative, the whole, nature, variety and degrees. The slightly lighter purple, Reporting strings, are complimenting the Narrative tokens: having, completion, After this, came into being, received, to the end that, created. This should not be surprising since the two vectors for these clusters were almost overlapping in the loadings biplot above.

Whereas the Apology is staging a dialogue (real or implied), the Timaeus is creating a world and pacing that act of creation (through narrative) with a set of abstract terms that can be referenced in conversation. Indeed, one of the burdens of this kind of world-making, I think, is that the abstractions must be folded in with the concrete descriptions in equal measure so that the passage is something more than a Georgic description of a natural scene or a praise poem to nature. Note too that there is absolutely no irony in this passage from the Timaeus. That is not because Docuscope has a category that allows it to discern irony in its local environs and so rule out such an effect in the Timaeus: only a human being can make such a discrimination, by virtue of being able to look beyond the simple mentioning of words to assess their use. (For Docuscope, all counted words are mentionings of words whose single use has been classed a priori in the categories assigned to them.)

And yet, even in translation, Docuscope may be identifying the linguistic footings of irony: a necessary but not sufficient condition for its use.

March 14, 2010
Platonic Dialogues and the “Two Socrates”

Press to Start: Vlastos (1991) Groupings, PCA on Correlations

I have been thinking for a while now that Docuscope preserves, in its tagging structure, what a translator preserves — that this is a good definition of what it is looking to classify. One way to test this hypothesis would be to try Docuscope on a set of translations, which is what I’ve tried to do here.

The visualization above (press to rotate) shows the Platonic Corpus as translated by the nineteenth-century classicist Benjamin Jowett, rated by principal components on correlations and color coded by the divisions proposed by the great Plato scholar Gregory Vlastos (1991), whose division of the dialogues into early (red), middle (blue), and late (green) are highlighted here. (The semitransparent elipsoids are drawn to capture 50 percent of the items in the group.) Vlastos argued, on the basis of the types of arguments used in these texts, that the early dialogues represent a distinct group from those produced in the middle or later periods. The mode of argument in these earlier dialogues, he observes, is elenctic or adversative, which means that in these dialogues Socrates does not “defend a thesis of his own” but rather examines one held by an interlocutor (113). Socrates thus avoids making knowledge claims in these dialogues, instead forcing his interlocutors to enunciate them as the weakness of their own positions becomes apparent. Believing that there are two “Socrates” presented in these dialogues, Vlastos argues that the early Socrates — who likely represents the philosophical position of the historical Socrates rather than Plato — must rely on the “‘say what you believe’ rule” (113), this rule supplying the rough materials of his proofs. As epistemologist (which he is not in these dialogues), Socrates does not advance certain knowledge claims: the elenctic method will not support them.

The middle and later Socrates, by contrast, is fully willing to advance certain knowledge claims, which he seeks to present demonstratively (48). Rather than being simply a moral philosopher, he is now a “moral philosopher and metaphysician and epistemologist and philosopher of science and philosopher of language and philosopher of religion and philosopher of eduction and philosopher of art.” In these dialogues, Socrates advances a theory of knowledge as the recollection of separately existing Forms – a significant epistemological leap. This Socrates is now a spokesman for Plato, making the most important division of the corpus that between the early dialogues and all the rest.

Taking this division as a starting point, let’s look at how Docuscope divides the dialogues, which it does here simply on the basis of mean scores on all 101 of the Language Action Types. These scores are plotted in a hyperspace and then the least dissimilar items are paired using Ward’s method on unscaled data. The technique is the same as the one that produced the most effective genre clustering of Shakespeare’s plays. I am thus using what I know of a particular mathematical technique as it applies to historically accepted clusterings of Shakespeare’s plays and applying it to a body of works that is less familiar to me – not quite what Franco Moretti calls “the great unread,” but definitely a case of trying to understand the lesser known through the better known.

Wards Clustering on Translated Plato Dialogues

As you can see from the clustering of red or early period dialogues above, we can arrive at an arrangement of the dialogues using Docuscope data that is remarkably similar to the basic division in the dialogues that Vlastos argued for in 1991. But what is perhaps most interesting is that roughly the same division was arrived at stylometrically in the late nineteenth century, and that there has been at least some convergence within Plato studies of what we might call “intensive” techniques for sorting the dialogues (based on reactions of readers to the doctrines or manner of presentation) and “extensive” ones (built on groups that themselves represent the capture of stylometrically significant counted items). As Brandwood shows in The Chronology of Plato’s Dialogues (1990), it was already apparent to computationally unassisted readers of Plato such as L. Campbell that the later dialogues exhibited more technical and rare words, as well as a “peculiar, stately rhythm.” These claims were advanced with quantitative evidence (Campbell, 1867) but were grounded in an impression gathered through close and repeated reading. This line of inquiry was also taken up by the German classicist W. Dittenberger, who in 1896 argued that early and later dialogues could be discriminated by looking at the particles καἰ μήν, ἀλλὰ μήν, which co-occur in the early dialogues, and τί μήν, ἀλλὰ…μήν, and γε μήν, which co-occur in the later ones. This essentially multivariate pattern yielded the early grouping: Crito, Euthyphro, Progagoras, Charmides, Laches, Euthydemus, Meno, Gorgias, Cratylus, Phaedo. As you can see from the above, Vlastos’ groupings and those of Dittenberger overlap significantly. To this we might add the groupings derived from the Docuscope codings.

This convergence is interesting for a number of reasons. First, it shows us extensive and intensive techniques working in tandem, which raises the basic question of how these two things are related. Second, it shows us how a certain conversational style or dialogical setting connects with a philosophical position, and how may themselves become available for analysis through the counting of seemingly inconsequential particles such as μήν. The Platonic corpus is an excellent one to work with because it has been well studied, and we have the advantage of pre-computational techniques to examine alongside actual readers’ responses. In my next post, I will examine those features in the translated dialogues that – once tagged by Docuscope – seem to be doing a good job of reproducing the scholarly divisions described above.

February 3, 2010
Pre-Digital Iteration: The Lindstrand Comparator

I’ve just finished a terrific conference at Loyola organized by Suzanne Gosset on “The Future of Shakespeare’s Text(s).” This photo shows a device, used by one of the conference organizers Peter Shillingsburg, to perform manual collation of printed editions of texts. There is a long tradition of using optical collators to find and identify differences in printed editions of texts; this one, the Lindstrand Comparator, works on a deviously clever principle. Exploiting a feature of the human visual system, the Lindstrand Comparator allows you to view two different pages simultaneously, with each image being fed to only one eye at a time through a series of mirrors. When the brain tries to reconcile the two disparate images — a divergence caused by print differences on the page — these textual differences will levitate on the surface of the page or, conversely, sink into it. What is in actuality a spatial disparity becomes a disparity of depth via contributions of the brain (which is clearly an integral part of the apparatus).

In this photo, Shakespeare scholar Gabriel Egan compares two variant printed editions of a modern novel. The device is an excellent example of mechanical-optical technology being used to assist in the iterative tasks of scholarship — iterations we now perform with computers. It is also the only technology I know of that lets you see depth in a page, something you cannot do with hypertexts or e-readers. Maybe we should stop writing code for fancy web-pages and start working with wood and mirrors?

November 8, 2009