At the Working Group for Digital Inquiry at Wisconsin, we’ve just begun our first experiment with a new order of magnitude of texts. Hope and I started working with 36 items about 6 years ago when we began to study Shakespeare’s First Folio plays. Last year we expanded to 320 items with the help of Martin Mueller at Northwestern, exploring the field of early modern drama. Now that UW has negotiated a license with the University of Michigan to begin working with the files from the Text Creation Partnership (TCP, which contains over 27000 items from early modern print), we can up the number again. By January we will have begun our first 1000 item experiment, spanning items printed in Britain and North America from 1530-1809. Robin Valenza and I, along with our colleagues in Computer Science and the Library, will begin working up the data in the spring. Stay tuned for results.
New experiments provide opportunities for thought that precede the results. What does it mean to collect, tag and store an array of texts at this level of generality? What does it mean to be an “item” or “computational object” within this collection? What is such a collection? In this post, I want to think further about the nature of the text objects and populations of texts we are working with.
What is the distinguishing feature of the digitized text – that ideal object of analysis considered in all of its hypothetical relations with other ideal objects? The question itself goes against the grain of recent materialist criticism, which focuses on the physical existence of books and practices involved in making and circulating them. Unlike someone buying an early modern book in the bookstalls around St. Paul’s four hundred years ago, we encounter our TCP texts as computational objects. That doesn’t mean that they are immaterial, however. Human labour has transformed them from microfilm facsimiles of real pages into diplomatic quality digital transcripts, marked up in TEI so that different formatting features can be distinguished. That labor is as real as any other.
What distinguishes this text object from others? I would argue that a text is a text because it is massively addressable at different levels of scale. Addressable here means that one can query a position within the text at a certain level of abstraction. In an earlier post, for example, I argued that a text might be thought of as a vector through a meta-table of all possible words. Why is it possible to think of a text in this fashion? Because a text can be queried at the level of single words and then related to other texts at the same level of abstraction: the table of all possible words could be defined as the aggregate of points of address at a given level of abstraction (the word, as in Google’s new n-gram corpus). Now, we are discussing ideal objects here: addressability implies different levels of abstraction (character, word, phrase, line, etc) which are stipulative or nominal: such levels are not material properties of texts or Pythagorean ideals; they are, rather, conventions.
Here’s the twist. We have physical manifestations of ideal objects (the ideal 1 Henry VI, for example), but these manifestations are only provisional realizations of that ideal. (I am using the word manifestation in the sense advanced in OCLC’s FRBR hierarchy.) The book or physical instance, then, is one of many levels of address. Backing out into a larger population, we might take a genre of works to be the relevant level of address. Or we could talk about individual lines of print; all the nouns in every line; every third character in every third line. All of this variation implies massive flexibility in levels of address. And more provocatively: when we create a digitized population of texts, our modes of address become more and more abstract: all concrete nouns in all the items in the collection, for example, or every item identified as a “History” by Heminges and Condell in the First Folio. Every level is a provisional unity: stable for the purposes of address, but also: stable because it is the object of address. Books are such provisional unities. So are all the proper names in the phone book.
The ontological status of the individual text is the same as that of the population of texts: both are massively addressable, and when they are stored electronically, we are able to act on this flexibility in more immediate ways through iterative searches and comparisons. At first glance, this might seem like a Galilean insight, similar to his discipline-collapsing claim that the laws which apply to heavens (astronomy) are identical with the ones that apply to the sublunar realm (physics). But it is not.
Physical texts were already massively addressable before they were ever digitized, and this variation in address was and is registered at the level of the page, chapter, the binding of quires, and the like. When we encounter an index or marginal note in a printed text — for example, a marginal inscription linking a given passage of a text to some other in a different text — we are seeing an act of address. Indeed, the very existence of such notes and indexes implies just this flexibility of address.
What makes a text a text – its susceptibility to varying levels of address – is a feature of book culture and the flexibility of the textual imagination. We address ourselves to this level, in this work, and think about its relation to some other. “Oh, this passage in Hamlet points to a verse in the Geneva bible,” we say. To have this thought is to dispose relevant elements in the dataset in much the same way a spreadsheet aggregates a text in ways that allow for layered access. A reader is a maker of such a momentary dispositif, and reading might be described as the continual redisposition of levels of address in this manner. We need a phenomenology of these acts, one that would allow us to link quantitative work on a culture’s “built environment” of words to the kinesthetic and imaginative dimensions of life at a given moment.
A physical text or manifestation is a provisional unity. There exists a potentially infinite array of such unities, some of which are already lost to us in history: what was a relevant level of address for a thirteenth century monk reading a manuscript? Other provisional unities can be operationalized now, as we are doing in our experiment at Wisconsin, gathering 1000 texts and then counting them in different ways. Grammar, as we understand it now, affords us a level of abstraction at which texts can be stabilized: we lemmatize texts algorithmically before modernizing them, and this lemmatization implies provisional unities in the form of grammatical objects of address.
One hundred years from now, the available computational objects may be related to one another in new ways. I can only imagine what these are: every fourth word in every fourth document, assuming one could stabilize something like “word length” in any real sense. (The idea of a word is itself an artifact of manuscript culture, one that could be perpetuated in print through the affordances of moveable type.) What makes such thought experiments possible is, once again, the addressability of texts as such. Like a phone book, they aggregate elements and make these elements available in multiple ways. You could even think of such an aggregation as the substance of another aggregation, for example, the phone book dress designed by Jolis Paons above. But unlike a phone-book, the digitized text can be reconfigured almost instantly into various layers of arbitrarily defined abstraction (characters, words, lines, works, genres). The mode of storage or virtualization is precisely what allows the object to be addressed in multiple ways.
Textuality is massive addressability. This condition of texts is realized in various manifestations, supported by different historical practices of reading and printing. The material affordances of a given medium put constraints on such practices: the practice of “discontinuous reading, ” for example, develops alongside the fingerable discrete leaves of a codex. But addressability as such – this is a condition rather than a technology, action or event. And its limits cannot be exhausted at a given moment. We cannot, in a Borgesian mood, query all of the possible datasets that will appear in the fullness of time. And we cannot import future query types into the present. But we can and do approximate such future searches when we automate our modes of address in unsupervised multi-variate statistical analysis – for example, factor analysis or PCA. We want all the phonebooks. And we can simulate some of them now.