At the Working Group for Digital Inquiry at Wisconsin, we’ve just begun our first experiment with a new order of magnitude of texts. Hope and I started working with 36 items about 6 years ago when we began to study Shakespeare’s First Folio plays. Last year we expanded to 320 items with the help of Martin Mueller at Northwestern, exploring the field of early modern drama. Now that UW has negotiated a license with the University of Michigan to begin working with the files from the Text Creation Partnership (TCP, which contains over 27000 items from early modern print), we can up the number again. By January we will have begun our first 1000 item experiment, spanning items printed in Britain and North America from 1530-1809. Robin Valenza and I, along with our colleagues in Computer Science and the Library, will begin working up the data in the spring. Stay tuned for results.
New experiments provide opportunities for thought that precede the results. What does it mean to collect, tag and store an array of texts at this level of generality? What does it mean to be an “item” or “computational object” within this collection? What is such a collection? In this post, I want to think further about the nature of the text objects and populations of texts we are working with.
What is the distinguishing feature of the digitized text – that ideal object of analysis considered in all of its hypothetical relations with other ideal objects? The question itself goes against the grain of recent materialist criticism, which focuses on the physical existence of books and practices involved in making and circulating them. Unlike someone buying an early modern book in the bookstalls around St. Paul’s four hundred years ago, we encounter our TCP texts as computational objects. That doesn’t mean that they are immaterial, however. Human labour has transformed them from microfilm facsimiles of real pages into diplomatic quality digital transcripts, marked up in TEI so that different formatting features can be distinguished. That labor is as real as any other.
What distinguishes this text object from others? I would argue that a text is a text because it is massively addressable at different levels of scale. Addressable here means that one can query a position within the text at a certain level of abstraction. In an earlier post, for example, I argued that a text might be thought of as a vector through a meta-table of all possible words. Why is it possible to think of a text in this fashion? Because a text can be queried at the level of single words and then related to other texts at the same level of abstraction: the table of all possible words could be defined as the aggregate of points of address at a given level of abstraction (the word, as in Google’s new n-gram corpus). Now, we are discussing ideal objects here: addressability implies different levels of abstraction (character, word, phrase, line, etc) which are stipulative or nominal: such levels are not material properties of texts or Pythagorean ideals; they are, rather, conventions.
Here’s the twist. We have physical manifestations of ideal objects (the ideal 1 Henry VI, for example), but these manifestations are only provisional realizations of that ideal. (I am using the word manifestation in the sense advanced in OCLC’s FRBR hierarchy.) The book or physical instance, then, is one of many levels of address. Backing out into a larger population, we might take a genre of works to be the relevant level of address. Or we could talk about individual lines of print; all the nouns in every line; every third character in every third line. All of this variation implies massive flexibility in levels of address. And more provocatively: when we create a digitized population of texts, our modes of address become more and more abstract: all concrete nouns in all the items in the collection, for example, or every item identified as a “History” by Heminges and Condell in the First Folio. Every level is a provisional unity: stable for the purposes of address, but also: stable because it is the object of address. Books are such provisional unities. So are all the proper names in the phone book.
The ontological status of the individual text is the same as that of the population of texts: both are massively addressable, and when they are stored electronically, we are able to act on this flexibility in more immediate ways through iterative searches and comparisons. At first glance, this might seem like a Galilean insight, similar to his discipline-collapsing claim that the laws which apply to heavens (astronomy) are identical with the ones that apply to the sublunar realm (physics). But it is not.
Physical texts were already massively addressable before they were ever digitized, and this variation in address was and is registered at the level of the page, chapter, the binding of quires, and the like. When we encounter an index or marginal note in a printed text — for example, a marginal inscription linking a given passage of a text to some other in a different text — we are seeing an act of address. Indeed, the very existence of such notes and indexes implies just this flexibility of address.
What makes a text a text – its susceptibility to varying levels of address – is a feature of book culture and the flexibility of the textual imagination. We address ourselves to this level, in this work, and think about its relation to some other. “Oh, this passage in Hamlet points to a verse in the Geneva bible,” we say. To have this thought is to dispose relevant elements in the dataset in much the same way a spreadsheet aggregates a text in ways that allow for layered access. A reader is a maker of such a momentary dispositif, and reading might be described as the continual redisposition of levels of address in this manner. We need a phenomenology of these acts, one that would allow us to link quantitative work on a culture’s “built environment” of words to the kinesthetic and imaginative dimensions of life at a given moment.
A physical text or manifestation is a provisional unity. There exists a potentially infinite array of such unities, some of which are already lost to us in history: what was a relevant level of address for a thirteenth century monk reading a manuscript? Other provisional unities can be operationalized now, as we are doing in our experiment at Wisconsin, gathering 1000 texts and then counting them in different ways. Grammar, as we understand it now, affords us a level of abstraction at which texts can be stabilized: we lemmatize texts algorithmically before modernizing them, and this lemmatization implies provisional unities in the form of grammatical objects of address.
One hundred years from now, the available computational objects may be related to one another in new ways. I can only imagine what these are: every fourth word in every fourth document, assuming one could stabilize something like “word length” in any real sense. (The idea of a word is itself an artifact of manuscript culture, one that could be perpetuated in print through the affordances of moveable type.) What makes such thought experiments possible is, once again, the addressability of texts as such. Like a phone book, they aggregate elements and make these elements available in multiple ways. You could even think of such an aggregation as the substance of another aggregation, for example, the phone book dress designed by Jolis Paons above. But unlike a phone-book, the digitized text can be reconfigured almost instantly into various layers of arbitrarily defined abstraction (characters, words, lines, works, genres). The mode of storage or virtualization is precisely what allows the object to be addressed in multiple ways.
Textuality is massive addressability. This condition of texts is realized in various manifestations, supported by different historical practices of reading and printing. The material affordances of a given medium put constraints on such practices: the practice of “discontinuous reading, ” for example, develops alongside the fingerable discrete leaves of a codex. But addressability as such – this is a condition rather than a technology, action or event. And its limits cannot be exhausted at a given moment. We cannot, in a Borgesian mood, query all of the possible datasets that will appear in the fullness of time. And we cannot import future query types into the present. But we can and do approximate such future searches when we automate our modes of address in unsupervised multi-variate statistical analysis – for example, factor analysis or PCA. We want all the phonebooks. And we can simulate some of them now.
8 Comments
Paul Saenger might disagree with you on the word as a convention controlled by the protocols of moving type. He would locate the origin of the convention in two inventions of Irish monks: space between words and lower case letters, which created visually distinct word shapes.
I like very much what you say about the scalable addressability of digital texts. It makes me think of the Stephanus bible of the 1550’s. Before then the Bible was citable by book and chapter –the chapter being a medieval convention. The “chapter and versification” introduced by Stephanus not only turned the Bible into a sequence of smaller citable objects. It also turned every verse into a self-contained object of sorts. One can carry on an argument about the Bible simply by exchanging citations of the type “John 3:16.”
Citation schemes are primitive and mechanical finding aids. But they are also more than that. They change the conditions of access and alter the calculus of the possible. I have from time to time thought whether a large digital corpus of, say, Early Modern English, should get its own citation scheme as one of the conditions of interoperability. I could think of a digital equivalent to Stephanus Plato pages, which are quintiles of the physical page. Thus 81c directs your attention to the middle of page 81. I have toyed with the idea of ‘centuries’ and ‘decade’s, where every digital text in an archive is divided into arbitrary chunks of 1,000 words (a magnitude on the order of a printed double page), which are then subdivided into 100 lines of ten words each.
But to underscore what I think is a major point in your argument: the different forms of “addressability” of digital texts build on a long history of citable texts.
I would defer to Paul Saenger as to the origins of words. Print, I suppose, was a technology with the proper affordances to perpetuate the Irish invention. These are wonderful comments, and raise the basic question of what the most useful addressable unit in a digitized text might be.
I have changed the text of the original post to reflect Martin’s comment (1.12.11)
Very interesting! Two notes:
1) The kind of addressability which is at work daily at Google computers which crawl 15 billion web pages, create massive indexes of text and links, extract hundreds of other features from every page, etc. seems to me qualitatively different from what was done 500, 100, or 20 years ago. It is also qualitatively different from even most daring digital humanities projects done today. See
For more details, see http://en.wikipedia.org/wiki/Search_engine.
2) Although discrete nature of text gives a different kind of addressability than analog media (images, sound, video), digitization and use of computers to extract features and algorithmically manipulate these media bridges the gap significantly. So I can, for instance, make a new film by extracting and combining all frames from all 20th films which fit particular criteria – or even do the same with individual pixels.
Thanks for these comments, Lev. I enoy reading your work.
I agree that the scale of addressability increases massively with digital texts. The point for me is to show that older procedures of paper book indexing, commonplacing, formalizations of mise-en-page, etc. demonstrate prior awareness of the fact that what makes a text a text is its ability to serve as an open-ended destination of address. Digitization makes a text’s inherent addressability actionable in ways that are breathtaking, to be sure. But perhaps addressability itself was part of the appeal of inscriptions–whatever the medium–in the first place.
As for making mashups of images, sounds, etc., I agree that we can now automate this procedure in a way that we once could not. The metadata required for the actual automation, however, cannot itself be algorithmically produced. I can tag a portion of available film images as “twentieth century,” but this act of designating what counts as an element for recombination is part of an older indexing tradition. The larger point I really want to make is that texts cannot specify the scale or manner in which they will be addressed: nevertheless, they are intrinsically susceptible to such variable forms. This is as true of a text printed on paper as one stored as a vector in a data table.
Mike
Thanks for this post!
I first read it when you published it and added it to the compulsory reading assignments of my textuality PhD class. Today I reread it with PhD students and your post lead to many interesting discussions about academic blog posts, addressability, modelling traditional questions addressed to literary texts, levels of abstraction and their relatedness to queries concerning computational objects. This was a very good class owing to your thought-provoking paper.
So thanks for this post again.
I could think of a digital equivalent to Stephanus Plato pages, which are quintiles of the physical page.The larger point I really want to make is that texts cannot specify the scale or manner in which they will be addressed: nevertheless, they are intrinsically susceptible to such variable forms. Anyway, thanks for sharing.
Nice post though! Although discrete nature of text gives a different kind of addressability than analog media (images, sound, video), digitization and use of computers to extract features and algorithmically manipulate these media bridges the gap significantly
I like very much what you say about the scalable addressability of digital texts. It makes me think of the Stephanus bible of the 1550′s. Before then the Bible was citable by book and chapter –the chapter being a medieval convention. The “chapter and versification” introduced by Stephanus not only turned the Bible into a sequence of smaller citable objects.
10 Trackbacks
[…] This post was mentioned on Twitter by Zsolt Almási. Zsolt Almási said: A must read by Michael Witmore: Text: A Massively Addressable Object – http://goo.gl/xIrH8 […]
[…] this post I want to understand the consequences of “massive addressability” for “philosophies of access,” that is, philosophies which assert that anything that can […]
[…] lingers, I would like to redirect thoughts to what Michael Witmore wrote in one of his recent posts. In it, he speaks about a text as being “massively addressable at different levels of […]
[…] Witmore, Michael. “Text: A Massively Addressable Object.” Wine Dark Sea, December 31, 2010. http://winedarksea.org/?p=926. […]
[…] is Michael Witmore‘s term. It means “one can query a position within the text at a certain level of […]
[…] a blog post by Michael Witmore entitled “Text, A Massively Addressable Object,” Witmore explains that texts have always been massively addressable at different levels of […]
[…] Witmore writes, “[R]eading might be described as the continual redisposition of levels of address.” To read is […]
[…] reading and thinking about the A Massively Addressable short essay gave me some juice to keep my gears running. What was appealing behind that text was […]
[…] (1994). Jerome J. McGann, Radiant Textuality: Literature after the World Wide Web (2004). Michael Witmore, “Text: A Massively Addressable Object,” Published on Wine Dark Sea Dece… Ian Small and Marcus Walsh, The Theory and Practice of Text-Editing: Essays in Honour of James T. […]
[…] Jerome J. McGann, Radiant Textuality: Literature after the World Wide Web (2004). Michael Witmore, “Text: A Massively Addressable Object,” Published on Wine Dark Sea De… Ian Small and Marcus Walsh, The Theory and Practice of Text-Editing: Essays in Honour of James T. […]