Blog

The Ancestral Text

Rosamond Purcell, "The Book, the Land"

In this post I want to understand the consequences of “massive addressability” for “philosophies of access”–philosophies which assert that all beings exist only as correlates of our own consciousness. The term “philosophy of access” is used by members of the Speculative Realist school: it seems to have been coined largely as a means of rejecting everything the term names. Members of this school dismiss the idea that any speculative analysis of the nature of beings can be replaced by an apparently more basic inquiry into how we access to the world, an access obtained either through language or consciousness. The major turn to “access” occurs with Kant, but the move is continued in an explicitly linguistic register by Heidegger, Wittgenstein, Derrida, and a range of post-structuralists.

One reason for jettisoning the priority of access, according to Ray Brassier, is that it violates “the basic materialist requirement that being, though perfectly intelligible, remain irreducible to thought.” As will become clear below, I am sympathetic to this materialist requirement, and more broadly to the Speculative Realist project of dethroning language as our one and only mode of access to the world. (There are plenty of ways of appreciating the power and complexity of language without making it the wellspring of Being, as some interpreters of Heidegger have insisted.) Our quantitative work with texts adds an unexpected twist to these debates: as objects of massive and variable address, we grasp things about texts in precisely the ways usually reserved for non-linguistic entities. When taken as objects of quantitative description, texts possess qualities that–at some point in the future–could be said to have existed in the present, regardless of our knowledge of them. There is thus a temporal asymmetry surrounding quantitative statements about texts: if one accepts the initial choices about what gets counted, such statements can be “true” now even if they can only be produced and recognized later. Does this asymmetry, then, mean that language itself, “though perfectly intelligible, remain[s] irreducible to thought?” Do iterative methods allow us to satisfy Brassier’s materialist requirement in the realm of language itself?

Let us begin with the question of addressability and access. The research described on this blog involves the creation of digitized corpora of texts and the mathematical description of elements within that corpus. These descriptions obtain at varying degrees of abstractions (nouns describing sensible objects, past forms of verbs with an auxiliary, etc.). If we say that we know something quantitatively about a given corpus, then, we are saying that we know it on the basis of a set of relations among elements that we have provisionally decided to treat as countable unities. Our work is willfully abstract in the sense that, at crucial moments of the analysis, we foreground relations as such, relations that will then be reunited with experience. When I say that objects of the following kind – “Shakespearean texts identified as comedies in the First Folio” – contain more of this type of thing–first and second person singular pronouns–than objects of a different kind (Shakespeare’s tragedies, histories), I am making a claim about a relation between groups and what they contain. These groupings and the types of things that we use to sort them are provisional unities: the circle we draw around a subset of texts in a population could be drawn another way if we had chosen to count other things. And so, we must recognize several reasons why claims about these relations might always be revised.

Every decision about what to count offers a caricature of the corpus and the modes of access this corpus allows. A caricature is essentially a narrowing of address: it allows us to make contact with an object in some of the ways Graham Harman has described in his work on vicarious causation. One can argue, for example, that the unity “Shakespeare’s Folio comedies” is really a subset of a larger grouping, or that the group can itself be subdivided into smaller groups. Similarly, one might say that the individual plays in a given group aren’t really discrete entities and so cannot be accurately counted in or out of that group. There are certain words that Hamlet may or may not contain, for example, because print variants and multiple sources have made Hamlet a leaky unity. (Accommodating such leaky unities is one of the major challenges of digital text curation.) Finally, I could argue that addressing these texts on the level of grammar–counting first and second person singular pronouns–is just one of many modes of address. Perhaps we will discover that these pronouns are fundamentally linked to semantic patterns that we haven’t yet decided to study, but should. All of these alternatives demonstrate the provisional nature of any decision to count and categorize things: such decisions are interpretive, which is why iterative criticism is not going to put humanities professors out of business. But such counting decisions are not–and this point is crucial–simply another metaphoric reduction of the world. PCA, cluster analysis and the other techniques we use are clearly inhuman in the number of comparisons they are able to make. The detour through mathematics is a detour away from consciousness, even if that detour produces findings that ultimately converge with consciousness (i.e., groupings produced by human reading).

Once the counting decisions are made, our claims to know something in a statistical sense about texts boils down to a claim that a particular set of relations pertains among entities in the corpus. Indeed, considered mathematically, the things we call texts, genres, or styles simply are such sets of relations–the mathematical reduction being one of many possible caricatures. But counting is a very interesting caricature: it yields what is there now–a real set of relations–but is nevertheless impossible to contemplate at present. Once claims about texts become mathematical descriptions of relations, such statements possess what the philosopher Quentin Meillassoux calls ancestrality, a quality he associates primarily with statements about the natural world. Criticizing the ascendance of what he calls the Kantian dogma of correlationism—the assumption that everything which can be said “to be” exists only as correlate of consciousness—Meillassoux argues that the idealist or critical turn in Continental philosophy has impoverished our ability to think about anything that exceeds the correlation between mind and world. This “Great Outdoors,” he goes on to suggest, is a preserve that an explicitly speculative philosophy must now rediscover, one which Meillassoux believes becomes available to us through mathematics. So, for example, Meillassoux would agree with the statement, “the earth existed 4.5 billion years ago,” precisely because it can be formulated mathematically using measured decay rates of carbon isotopes. The statement itself may be ideal, but the reality it points to is not. What places The Great Outdoors out of doors, then, is its indifference to our existence or presence as an observer. Indeed, for Meillassoux, it is only those things which are “mathematically conceivable” that exceed the post-Kantian idealist correlation. For Meillassoux,

all those aspects of the object that can be formulated in mathematical terms can be meaningfully conceived as properties of the object in itself.

Clearly such a statement is a goad for those who place mind or natural language at the center of philosophy. But the statement is also a philosophical rallying cry: be curious about objects or entities that do not reference human correlates! I find this maxim appealing in the wake of the “language is everything” strain of contemporary theory, which is itself a caricature of the work of Wittgenstein, Derrida and others. Such exaggerations have been damaging to those of us working in the humanities, not least because they suggest that our colleagues in the sciences do nothing but work with words. By making language everything–and, not accidentally, making literary studies the gatekeeper of all disciplines–this line of thought amounts to a new kind of species narcissism. Meillassoux and others are finding ways to not talk about language all the time, which seems like a good thing to me.

But would Meillassoux, Harman and other Speculative Realists consider texts to be part of The Great Outdoors? Wouldn’t they have to? After all, statements about groupings in the corpus can be true now even when there is no human being to recognize that truth as a correlate of thought. Precisely because texts are susceptible to address and analysis on a potentially infinite variety of levels, we can be confident that a future scholar will find a way of counting things that turns up a new-but-as-yet-unrecognized grouping. Human reading turned up such a thing when scholars in the late nineteenth century “discovered” the genre of Shakespeare’s Late Romances. (Hope and I have, moreover, re-described these grouping statistically.) Like our future mathematical sleuth might do a century from now, nineteenth-century scholars were arguing that Romance was already a real feature of the Shakespearean corpus, albeit one that no one had yet recognized. They had, in effect, picked out a new object by emphasizing a new set of relations among elements in a collection of words. Couldn’t we expect another genre to emerge from this sort of analysis–a Genre X, let’s say–given sufficient time and resources? Would we accept such a genre if derived through iterative means?

I can imagine a day, 100 years from now, when we have different dictionaries that address the text on levels we have not thought to explore at present. What if someone creates a dictionary that allows me to use differences in a word’s linguistic origin (Latinate, Anglo-Saxon, etc.) to relate the contents of one text to another? What if a statistical procedure is developed that allows us to “see” groupings we could recognize today but simply have not developed the mathematics to expose? When you pair the condition of massive addressability with (a) the possibility of new tokenizations (new elements or strata of address) or (b) the possibility that all token counts past and future can be subjected to new mathematical procedures, you arrive at a situation in which something that is arguably true now about a collection of texts can only be known in the future.

And if something can be true about an object now without itself being a correlate of human consciousness, isn’t that something part of the natural world, the one that is supposed to be excluded from the charmed circle of the correlation? Does this make texts more like objects in nature, or objects in nature more like texts? Either way, The Great Outdoors has become larger.

May 9, 2011
Lost Books, “Missing Matter,” and the Google 1-Gram Corpus

Mike Gleicher's Visualization of 1-gram Ranks in Google English Corpus

My colleague Mike Gleicher (UW-Madison Computer Science) has been working on a rough and ready visualization of popular words (1-grams) in the Google English Books corpus, which contains several million items. He produced this visualization after working with the dataset for a day. I find the visualization appealing because of what it shows us about English “closed-class” or “function words.”

When you explore the interactive version on Gleicher’s website by clicking on the image above, you can highlight the path of certain words as they increase or decrease in rank by decade. (Rank here means the popularity of a given word among all others in the scanned and published works for that decade.) Note that the stable items at the top of the visualization are function words; that is, they are words which contain syntactical or grammatical information rather than lexical content. (Function words are hardest to define in the dictionary; they also tend not to have synonyms.) We expect function words to be used frequently and for that use to be invariant over time: such words mediate functions that the language must repeatedly deploy. Notice that the function words at the top – “the” “of” “and” “to” – tend to be stable throughout in terms of rank; they are also the function words that do not have multiple spellings. So, for example, “the” does not contain a long “ƒ” in the same way that “was” does. We know that these characters were interchangeable prior to the standardization of English orthography, so it makes sense that “the” remains quite stable while “was” does not.

What should we say about the spaghetti like tangle at the lefthand side of the graph? I think its fair to say that this tangle shows two things at once: first, that function words are plentiful over time and, second, that such high frequency words had multiple spellings. The viewer that Gleicher has created allows you to see how a single function word varies through multiple spellings. So, for example, the period of high rank-fluctuation in the function word “have” coincides with the high-fluctuation period of “haue,” which is its alternate spelling.

Rank of Function Word "Have" in Google English Corpus

Rank of the Function Word "Haue" in the Google English Corpus

Visualizations ought to provoke new ideas, not simply prove that certain relationships exist. So, what interesting thoughts can you have while looking at such visualizations? I was struck by the contrast between the straight lines at the top and the tangle further below. What does such a contrast tell us beyond what we already know about spelling variation in the early periods? Perhaps it allows us to imagine the conditions under which the highly-ranked function words might progress in linear fashion across the x-axis, as in the case of “the.” What if we aggregated the counts for the function words by including occurrences of known alternate spellings and then recalculated? If we combined the counts of “have” and “haue” for example, we might expect the rank of the lemma “have” (the aggregate of all alternative spellings) to go up, and for its path to be less wobbly. The end result of this process should be that lines across the top become increasingly straight, and that the lower lefthand side of the visualization becomes less tangled.

Surely some remaining tangles would exist, however. Leftover tangles might be an effect of the limited size of this earlier portion of the corpus: we know, for example, that there are far fewer words in these earlier decades than in the later ones; we also know that the Optical Character Recognition process fails to capture all of the surviving words accurately during translation. If we assume that certain function words are so essential that their relative rank in a given time period ought to be invariant, then the residual wobbles might provide us with a measure of how much linguistic variation is missing from the Google corpus in a given decade. It would suggest, like a companion sun wobbling around a black hole, the existence of lost books and lost letters. This is not the “dark matter” of the Google corpus referred to in a recent article in Science: the proper nouns that never make it into the dictionary. Rather, this is “missing matter,” things which existed but did not survive to be counted because books were destroyed or characters were not recognized.

Google is trying to quantify just how large the corpus of printed English books is so that it can say what percentage of books it has scanned. “Function word wobble” might be a proxy for such a measure.We already use function word counts to characterize differences in Shakespeare’s literary genres and Victorian novels. Perhaps they are useful for something other than genre discrimination within groups of texts – useful, that is, when profiled across the entire population of word occurrences in a decade rather than a generically diversified sub-population of books.

Prospero destroys his magnificent book of magic before leaving the island in The Tempest, saying “deeper than did ever plummet sound / I’ll drown my book.” When he does so, certain words disappear to the bottom of the sea. Much later, a plummet may sound the loss.

January 3, 2011
Text: A Massively Addressable Object

Phone Book Dress by Jolis PaonsFirst

At the Working Group for Digital Inquiry at Wisconsin, we’ve just begun our first experiment with a new order of magnitude of texts. Hope and I started working with 36 items about 6 years ago when we began to study Shakespeare’s First Folio plays. Last year we expanded to 320 items with the help of Martin Mueller at Northwestern, exploring the field of early modern drama. Now that UW has negotiated a license with the University of Michigan to begin working with the files from the Text Creation Partnership (TCP, which contains over 27000 items from early modern print), we can up the number again. By January we will have begun our first 1000 item experiment, spanning items printed in Britain and North America from 1530-1809. Robin Valenza and I, along with our colleagues in Computer Science and the Library, will begin working up the data in the spring. Stay tuned for results.

New experiments provide opportunities for thought that precede the results. What does it mean to collect, tag and store an array of texts at this level of generality? What does it mean to be an “item” or “computational object” within this collection? What is such a collection? In this post, I want to think further about the nature of the text objects and populations of texts we are working with.

What is the distinguishing feature of the digitized text – that ideal object of analysis considered in all of its hypothetical relations with other ideal objects? The question itself goes against the grain of recent materialist criticism, which focuses on the physical existence of books and practices involved in making and circulating them. Unlike someone buying an early modern book in the bookstalls around St. Paul’s four hundred years ago, we encounter our TCP texts as computational objects. That doesn’t mean that they are immaterial, however. Human labour has transformed them from microfilm facsimiles of real pages into diplomatic quality digital transcripts, marked up in TEI so that different formatting features can be distinguished. That labor is as real as any other.

What distinguishes this text object from others? I would argue that a text is a text because it is massively addressable at different levels of scale. Addressable here means that one can query a position within the text at a certain level of abstraction. In an earlier post, for example, I argued that a text might be thought of as a vector through a meta-table of all possible words. Why is it possible to think of a text in this fashion? Because a text can be queried at the level of single words and then related to other texts at the same level of abstraction: the table of all possible words could be defined as the aggregate of points of address at a given level of abstraction (the word, as in Google’s new n-gram corpus). Now, we are discussing ideal objects here: addressability implies different levels of abstraction (character, word, phrase, line, etc) which are stipulative or nominal: such levels are not material properties of texts or Pythagorean ideals; they are, rather, conventions.

Here’s the twist. We have physical manifestations of ideal objects (the ideal 1 Henry VI, for example), but these manifestations are only provisional realizations of that ideal. (I am using the word manifestation in the sense advanced in OCLC’s FRBR hierarchy.) The book or physical instance, then, is one of many levels of address. Backing out into a larger population, we might take a genre of works to be the relevant level of address. Or we could talk about individual lines of print; all the nouns in every line; every third character in every third line. All of this variation implies massive flexibility in levels of address. And more provocatively: when we create a digitized population of texts, our modes of address become more and more abstract: all concrete nouns in all the items in the collection, for example, or every item identified as a “History” by Heminges and Condell in the First Folio. Every level is a provisional unity: stable for the purposes of address, but also: stable because it is the object of address. Books are such provisional unities. So are all the proper names in the phone book.

The ontological status of the individual text is the same as that of the population of texts: both are massively addressable, and when they are stored electronically, we are able to act on this flexibility in more immediate ways through iterative searches and comparisons. At first glance, this might seem like a Galilean insight, similar to his discipline-collapsing claim that the laws which apply to heavens (astronomy) are identical with the ones that apply to the sublunar realm (physics). But it is not.

Physical texts were already massively addressable before they were ever digitized, and this variation in address was and is registered at the level of the page, chapter, the binding of quires, and the like. When we encounter an index or marginal note in a printed text — for example, a marginal inscription linking a given passage of a text to some other in a different text — we are seeing an act of address. Indeed, the very existence of such notes and indexes implies just this flexibility of address.

What makes a text a text – its susceptibility to varying levels of address – is a feature of book culture and the flexibility of the textual imagination. We address ourselves to this level, in this work, and think about its relation to some other. “Oh, this passage in Hamlet points to a verse in the Geneva bible,” we say. To have this thought is to dispose relevant elements in the dataset in much the same way a spreadsheet aggregates a text in ways that allow for layered access. A reader is a maker of such a momentary dispositif, and reading might be described as the continual redisposition of levels of address in this manner. We need a phenomenology of these acts, one that would allow us to link quantitative work on a culture’s “built environment” of words to the kinesthetic and imaginative dimensions of life at a given moment.

A physical text or manifestation is a provisional unity. There exists a potentially infinite array of such unities, some of which are already lost to us in history: what was a relevant level of address for a thirteenth century monk reading a manuscript? Other provisional unities can be operationalized now, as we are doing in our experiment at Wisconsin, gathering 1000 texts and then counting them in different ways. Grammar, as we understand it now, affords us a level of abstraction at which texts can be stabilized: we lemmatize texts algorithmically before modernizing them, and this lemmatization implies provisional unities in the form of grammatical objects of address.

One hundred years from now, the available computational objects may be related to one another in new ways. I can only imagine what these are: every fourth word in every fourth document, assuming one could stabilize something like “word length” in any real sense. (The idea of a word is itself an artifact of manuscript culture, one that could be perpetuated in print through the affordances of moveable type.) What makes such thought experiments possible is, once again, the addressability of texts as such. Like a phone book, they aggregate elements and make these elements available in multiple ways. You could even think of such an aggregation as the substance of another aggregation, for example, the phone book dress designed by Jolis Paons above. But unlike a phone-book, the digitized text can be reconfigured almost instantly into various layers of arbitrarily defined abstraction (characters, words, lines, works, genres). The mode of storage or virtualization is precisely what allows the object to be addressed in multiple ways.

Textuality is massive addressability. This condition of texts is realized in various manifestations, supported by different historical practices of reading and printing. The material affordances of a given medium put constraints on such practices: the practice of “discontinuous reading, ” for example, develops alongside the fingerable discrete leaves of a codex. But addressability as such – this is a condition rather than a technology, action or event. And its limits cannot be exhausted at a given moment. We cannot, in a Borgesian mood, query all of the possible datasets that will appear in the fullness of time. And we cannot import future query types into the present. But we can and do approximate such future searches when we automate our modes of address in unsupervised multi-variate statistical analysis – for example, factor analysis or PCA. We want all the phonebooks. And we can simulate some of them now.

December 31, 2010
Texts as Probability Clouds

Electron probability density cloud of hydrogen atom

We have thought a lot about what a “text” is in literary studies over the last few decades, spurred on by editorial theory, deconstruction, new media studies and book history. A nominalist by inclination, I tend to think of a text (real or digitized) as a provisional state of something, this other something being a hypothetical ideal or a fiction of analysis. So when I encounter a print version of a Shakespeare play, I am encountering an entity (for example, 1 Henry VI) in a state that is more or less suitable to the medium of print. But the printed play is not the performance. Nor is it whatever idea Shakespeare had when he began working with his company on the play.

An additional complication: versions of any given Shakespeare play in print — those found in the 1623 First Folio — may contain variation at the level of the individual word or character, variation that (in the case of the Folio) is corrected during the print run. Whatever is “behind” the First Folio, then, that original is a reconstruction of something that can only be said to exist in an ideal sense. We can think of that meta-object as having a probabalistic character: different letters in particular positions have a likelihood of being x or y, for example. But in the end, the actual identity of even an individual character must be understood as a likelihood.

None of these ideas except the last is particularly novel in Shakespeare studies. Peter Stallybrass and Margareta de Grazia, among many others, have already made the point that the sources behind Shakespeare plays are an editorial ideal — approximated in practice but unreachable in an ideal sense. Less has said, however, about the probabalistic nature of the text itself: its existence as a set of likelihoods that realized provisionally in different cases. A text as a cloud of probabilities. That’s interesting.

December 26, 2010
Google n-grams and Philosophy: Use Versus Mention

Well, the Google n-gram corpus is out, and the world has been introduced to a fabulous new intellectual parlor game. Here are a few searches I ran today which deal with philosophers and philosophical terms:

A lot of people are going to be playing with this tool, and I think there are some genuine discoveries to be made. But here is a question: is what’s being counted in these n-gram searches “uses” of certain words or “mentions” of those words? The use/mention distinction is a favorite one in analytic philosophy, and has roots in the theory of “suppositio” explored by the medieval Terminists. It is useful here as well. The google n-gram corpus is simply a bag of words and sequences of words divided by year. So what does it mean that an n-gram occurs more frequently in one bag rather than some other? Does philosophy become more interested in “the subject” as opposed to “the object” around 1800? (Never mind that these terms have precisely the opposite meaning for medieval thinkers.) Does Heidegger eclipse Bergson in importance in the mid-1960s? Does “ethics” displace “morality” as a way of thinking about what is right or wrong in human action?

These are different cases; in each, however, we ought to read the results returned from the n-gram corpus search as “mentionings” of these terms. Understanding how these words are used, and in what kinds of texts, is much more difficult than saying that they are mentioned in such and such a quantity. The important question, then, concerns what can you learn from the occurrence or mention of a word in a field as wide as this. I think the mention of a proper name like “Heidegger” is probably more revealing than the mention of a particular philosophical term like “subject” or “object.” While it’s not an earth-shaking discovery that Heidegger gets more mentions than Bergson in the latter half of the twentieth century, this fact is nevertheless interesting and useful. In the case of terms such as “subject” and “object,” however, we are dealing with terms that are regularly used outside of philosophical analysis: they may not have a “philosophical use” in the cases being counted. Another factor to consider: the name Heidegger likely refers to the German philosopher, but it could also point to other individuals sharing this name. The philosopher Donald Davidson, for example, who spent a lot of time thinking about the use/mention distinction, would not necessarily be picked out of a crowd by a search on his surname. Even with a rare proper name we can’t be certain that mention accomplishes something like Kripke’s “rigid designation.”

We could get closer to a word’s use by trying a longer string, something along the lines of Daniel Shore’s study of uses of the subjunctive with reference to Jesus, as in “what would Jesus do?” When it is embedded in the string’s Shore identifies, the proper name Jesus seems designates its referent more precisely. So too, the word “do” refers to the context of ethical deliberation, although even now there are ironic uses of the phrase that are really “mentionings” of earnest uses of these words by evangelicals. The special use-case of irony would, I suspect, be the hardest to track in large numbers. But there may be phrases that are invented by philosophers precisely in order to specify their own use, which is what makes them reliably citable or iterable in philosophical discourse. Terms of art, such as “a priori synthetic judgment,” are actually highly compressed attempts to specify a writer’s use of terms. As use-specific strings, terms of art are likely to produce use-specific results when they are used as search terms. Indeed, it seems likely that most philosophers are actually doing a roundabout form of mentioning when they coin such phrases. Such moments are imperative contracts, meaning something like: “whenever you see the phrase ‘a priori synthetic,’ interpret it as meaning ‘a judgment that pertains to experience but is not itself derived experientially.’”

It would be nice if we could see occurrences displayed by subject heading of book. That would allow the user to be more precise in linking occurrence claims to use claims, a link that must inevitably be made in quantitative studies of culture. I suspect it is much harder to link occurrence to use than most people think; this tool may have the unintended use of bearing out that fact.

December 17, 2010