These two visualizations spark two interesting questions: What do people read during a revolution? What is the connection between what people read and political events? Both images spike dramatically around moments of upheaval in the Western World: The English, American, and French Revolutions, the mid-19th-century Europe-wide overthrow of governments, and World War I, to name just a few. These images are all the more striking because they did not arise from a historical study of warfare or publishing, but from a more workaday task—that of categorizing all books from 1600 to 2010 according to Library of Congress subject headings. (The source of the data was Google’s catalog of books as of 2010.) The visualizations were shown in passing during a 2010 meeting between researchers at Google, where the data had been produced, and a group of humanities scholars and advocates, who were meeting with the Google team to exchange ideas. When Google’s Jon Orwant flashed this image on the screen, the professors in the assembly gasped. Genuinely gasped. We could see in this visualization of data things that had been debated for centuries, but that had never been seen: a connection between the world of print and the world of political action, a link between revolution and reading a certain kind of book.
We are experienced readers of books, book history, and—we like to think—of book diagrams. Humans invented stream charts well before the age of computing; this style of conveying information is at least 250 years old and draws on sources that are even older. (See Rosenberg and Grafton, Cartographies of Time, 2010.) However, the union of technologies—modern cataloging systems, the increasingly systematized concatenation of library catalogs worldwide, and the capacity to render data chronologically in the style of a geological diagram—produces a compact vision of Western print culture hitherto unseen. Simple in execution, the visualization prompts new thinking.
Like any metaphorical or mathematical rendering, the diagram below should be read with care: the strata are normalized so that the spikes do not necessarily indicate a greater number of books published, but rather a shift in the proportion of books composed of a given subject. A spike in one layer of the diagram can give the illusion that all strata of the diagram have increased in size, a trick of the eye that the mind needs to combat. The second visualization helps with this by zooming in and thereby singling out the area of greatest mathematical change, but it, too, needs to be viewed critically.
Now that the caveats have been put to one side, we return to the original questions and then offer a reformulation.
What are people reading during a revolution? Poetry? Books on military technology? Theology? No. If we take the first spike, the years leading up to the English Revolution, the answer in the years leading up to the 1642 regicide seems to be “Old World History.” The second chronological peak—in the decades around the American (1776) and French (1789) Revolutions—shows the same pattern. In periods that historians would link to major political upheaval, the world of print shows similar disruptions: publishers are offering more history for readers who, perhaps, think of themselves as living through important historical changes.
We should be precise: these data don’t indicate that more people are reading history, but that a higher proportion of books published by presses can be classed by cataloguers as history. There are many follow up questions one might ask here. Does publication tie strongly to actual reading, or are these only loosely connected? Are publishers reducing the number of books in other subject areas because of scarcity of resources or some other factor, which would again lead to the proportional spikes seen above? Are the cataloguing definitions of what counts as Old World History or history in general themselves modeled on the books published during the spike years?
One has to ask questions about the size and representativity of the dataset, the uniformity of the classifications, and the nature of the spatial plot in order to understand what is going on. And, crucially in this case, one has to have the initial insight—born of a reading knowledge of history itself—that the timing of the spikes is important. But if you’ve got that kind of knowledge in the room, you might see something you haven’t seen before.
Press to Play: 767 Pieces of Shakespeare in Scaled PCA Space
Now for something a little different. I mentioned before that we can conduct similar analyses on pieces of the plays rather than the plays as a whole. In this experiment, I have been working with 1000 word chunks of Shakespeare plays, which allows me to use many more variables in the analysis. (This was the technique that Hope and I used in our 2007 article on Tragicomedy.) Obviously the plays weren’t written to be read, much less analyzed, in identically sized pieces: the procedure is artificial through and through. It does allow us, however, to see things that Shakespeare does consistently throughout different genres, things that happen repeatedly throughout an entire play rather than just the beginning or end. Another caveat: we partitioned the plays starting at the beginning of each text, making the first 1000 words the first “piece.” This results in a loss of some of the playtext at the end, since any remainder that is less than 1000 words is dropped. In future analyses, we will take evenly spaced 1000 word samples from beginning to end, partitioning losses in between. There are no perfect answers here when it comes to dividing the plays into working units. So this is a first installment.
The video above (press to play) is a three dimensional JMP plot of 767 pieces of Shakespeare in a dataspace of three scaled Principal Components (1, 4, and 9) which I have chosen based on their power to sort the plays using in the Tukey Test. (See Tukey results for PCs 1 and 4.) When you run the video capture, you’ll see a series of dots that are color coded based on generic differences: red is comedy, green history, blue is late plays and orange tragedies. Early in the capture, I move an offscreen slider that creates a series of chromatic “halos” or elipsoid bubbles around neighboring dots: these halos envelop dot groupings as they meet certain contiguity thresholds. You see the two major clusters I am interested in here, histories and comedies, forming in the lower left and upper right respectively. (Green on lower left, red on upper right.) Interestingly enough, the see-saw effect we saw in our analysis of entire plays is repeated here: comedies and histories are the most easily separated, because whenever Shakespeare is using strings associated with comedy, he can’t or won’t simultaneously use strings associated with history (and vice versa). Linguistic weight cannot be placed both sides of this particular generic fulcrum at once.
Now the resulting encrusted object, which I have rotated in three dimensions, is a lot less elegant than the object we would be contemplating were to do discriminant analysis of these groups. I am saving Discriminant Analysis for a later post. For all its imperfections, Principal Component Analysis is still going to give us some results or linguistic patterns we can make sense of, which is the ultimate measure of success here. I think it’s worth appreciating the spatial partitioning here in all of its messiness: the multicolored object presents both a pattern that we are familiar with — comedies and histories really do flock to opposite ends of the containing dataspace — and some jagged edges that show the imperfections of the analysis. Imperfections are good: we want to find exceptions to generic rules, not just confirmations of a pattern.
Looking at the upper right hand quadrant, we see the items that are high on both PC1 and PC4. In this analysis we are using Language Action Types or LATs, the finest grained categories that Docuscope uses (it has 101 of them). We will want to ask which specific LATs are pushing items into the different areas here, and to do so, I have produced the following loading biplot:
A loadings biplot gives information about components in spatial form, showing our different analytic categories (LAT’s such as “Common Authorities,” “DenyDisclaim,” “SelfDisclosure,” etc.) as red arrows or vectors. To read this diagram, consider the two components individually. What makes an item high on PC1? Since PC1 is rated on the horizontal axis, we scan left to right for the vectors or arrows that are at the extremes. To my eye, SelfDisclosure, FirstPer[son] and DirectAddress are the most strongly “loaded” on this component, which means that any piece that has a relatively high score on these variables will be favored by this component and thus pushed to he right had side of a scatterplot (see below). Conversely, any item that is relatively low in the words that fall under categories such as Motions, SenseProperty, Sense Object, and Inclusive will be pushed to the left. Notice that the two variables SelfDisclosure and SenseObject are almost directly opposed: the loadings biplot is telling us here that, statistically at least, the use of this one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin.
We can then do the same thing with the vertical axis, which represents PC4. Here we see that LangRef [Language Reference], DenyDisclaim and Uncertainty strings are used in opposition to those classed under the LAT Common Authority. If an item scores high on PC4 (which most comedies do), it will be high in LangRef, Uncertainty and DenyDisclaim strings while simultaneously lacking Common Authority strings. So what about the vectors that bisect the axes, for example, DenyDisclaim, which appears to load positively on both PC1 and PC2? This LAT is shared by the two components: it does something for both. We can learn a lot by looking at this diagram, since — once we’ve decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language “schooling together” in the process of making this distinction. DirectAddress and FirstPer [or, First Person], Autobio and Acknowledge thus tend to go together here (lower right), as do Motions, SenseProperties, and Sense Objects (upper left).
In fact, the designer of Docuscope saw these LATs as being related, which is why elsewhere he aggregated them together into larger “buckets” such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where words that are grouped together on theoretical grounds are associating with one another statistically in a group of texts. If the intellectual architecture of Docuscope’s categories is good, this schooling should happen with almost any biplot of components, no matter what types of texts they discriminate. The power of this combination of Principal Components, then, is that it aligns the filiations and exclusions of the underlying language architecture with genres that we recognize, and will hopefully suggest theatrical or narrative strategies that support these recognizable divisions.
The loadings biplot shows us how the variables in our analysis are pushing items in the corpus into different regions of a dataspace. We can now populate that dataspace with the 767 pieces of Shakespeare’s plays, rating each of them on the two components. Here is how the plays appear in a plot of scaled Component 1 against Component 2, again, color coded with the scheme used above:
Notice the pattern we’ve seen before: comedies (here represented in red) are opposite histories (green) in diagonal quadrants. In general, they don’t mingle. The upper right hand quadrant, which is where the comedies tend to locate, contains the first item that I’d like to discuss: the red dot labelled Merry Wives (circa 2.1). This dot represents a piece of the first scene, second act of The Merry Wives of Windsor. As the item that rates highest on both PC1 and PC4 — components which the Tukey Test shows us to be best at discriminating comedy — this piece of The Merry Wives of Windsor is the most comic 1000 word passage that Shakespeare wrote. Here is an excerpt:
“I’ll entertain myself like one that I am not acquainted withal; for, sure, unless he know some strain in me, that I know not myself, he would never have boarded me in this fury.” In this color coded sentence we can see diagrammed the comic dance step. While I think there are funnier lines — “I had rather be a giantess, and lie under Mount Pelion” — the former is significant for what it does linguistically: it shows a speaker entertaining and then rejecting a perspective on her own situation (that of Falstaff) while comparing it with another (her own). The uncertainty strings (orange) such as “know not,” “doubt” and the indefinite “some” contribute to this mock searching rhetoric. Self-disclosure strings such as “myself” and “makes me” anchor the reality testing exercise to the speaker, who must make explicit her own place in the sentence as the object of doubt, while the oppositional reasoning strings such as “never” and “not” mark the mobility of this speakers perspective: I will try this toying perspective on my honesty, seeing myself as Jack Falstaff does, but will reject it soon enough. The reason that this passage is so highly rated on these two factors has something to do with the multiplication of perspectives that are being juggled onstage: there are two individuals here — Mistress Page and Mistress Ford — who are, as it were, rising above an imbedded perspective contained in Falstaff’s letter, commenting upon that perspective, and then rejecting it. Each time a partition in reality (a level) is broached in the stage action and dialogue, comic language appears.
We can oppose this most comic piece of writing — again, according to PCA — to its opposite in linguistic terms, a piece that contains what the comic one lacks and lacks what the comic one has. Here, then, is a portion of the “most historical” piece of Shakespeare, from Richard II 1.3:
Here we see the formal settings of royal display, a herald offering Mowbray’s formal challenge — no surprise this exemplifies history, a genre in which the nation and its kings are front and center. Yet where the passage really begins to rack up points is in its use of descriptive words, which are underlined in yellow. Chairs, helmets, blood, earth, gentle sleep, drums, quite confines…we don’t think of history as the genre of objects and adjectives, but linguistically it is. Inclusive strings, in the olive colored green, are perhaps less surprising given our previous analyses. We expect kings to speak about “our council” and what “we have done.” But notice that such language is quite difficult to use in comedy: even in a passage of collusion, where we would expect Mistress page and Mistress Ford to be using first person plural pronouns, the language tends to pivot off of first person singular perspectives. The language of “we” really isn’t a part of comedy.
I am less surprised to find, at this finer grained level of analysis, words from official life (what Docuscope tags as Commonplace Authority, in bright green) associated with history, since these are context specific. More interesting is the presence of the purple words, which Docuscope tags as person properties. These are high in history, but show up in comedy as well, as you can see on the loading biplot above. This marked up passage is also useful because it shows us something we’d want to disagree with: you don’t have to be Saul Kripke to see that a proper name like Henry is an imperfect designator of persons, particularly because other proper names such as Richard do not get counted under this category by Docuscope. We live with the imperfections, unless it appears that there are so many mentions of the name Henry in the plays that this entire LAT category must be discounted.
This passage from the Open Source Shakespeare’s Love’s Labour’s Lost shows language patterns that push the play into the area where the Histories cluster, something visible in the scatterplot discussed below. Returning to the taxonomy of Docuscope, this passage has a lot of Description strings combined with a relative lack of Interaction and First Person strings, both of which can be seen in the Docuscope screen shot below. We are looking at something slightly more complicated in this visualization of the text, however, because I have “turned on” the First Person and Interaction strings in the Docuscope Single Text Viewer. I did this because I want to show what cannot be shown: a relative lack of blue (Interaction) and red (First Person) strings combined with a relative abundance of yellow (Description) ones. To really “see” this in the wild, you would have to consult a completely color tagged text of the complete Folio Works and — while reading — keep track of the relative differences in quantities of blue, red and yellow in the different containers (the plays themselves). Only an Argus-eyed text tagger and a statistical analysis can do this. The results are heuristic in that they lead us toward certain areas of the text for continued interpretation. In this case, I have used the color coding facility in Docuscope to scan the entire play (once I knew the categories I was interested in) in order to find a passage like the one above: one that has lots of yellow and very little blue or red.
Inspection of such candidate “History” passages reveals a number of pedantic exchanges like this one between Sir Nathaniel, Holofernes and Dull. This scene is a hilarious sendup of of rhetorical display and vacuous learning, and it burlesques the famous Renaissance idea of verbal variety or copia that was recommended by Erasmus. It makes sense that this kind of passage would withdraw the play from the type of comic verbal interaction analyzed in the previous post. Because these characters are not speaking with one another, but rather are addressing an invisible audience of discerning rhetorical literates, there is not much interaction in the form of second person pronouns or corresponding first person singular pronouns — the very strings one would tend to find in Comic exchanges about acts or actions taken by characters themselves.
We expect this kind of thing from the pedants, but the analysis reveals a continuity of this History-like pattern among the French nobles who have vowed to live a life of Platonic study, characters like Biron who can never resist plumping their own rhetorical plumage. Don Armado, another parodic figure with an almost Quixotic appreciation for his own courtly expression, is also linguistically self-indulgent, and his passages would look similar to the one I have excerpted here and shown color coded below (although with Don Armado, there are marginally more interactions with his Page). The point of this analysis is to show that there are reasons why Docuscope would place Love’s Labour’s Lost with the Histories, and these reasons make sense to us once we begin to think about how the play is put together. This is a world of narcissists, something the Princess and her ladies point out when they defer the proposed courtship that is offered at the end of the play. That narcissism shows itself as a tendency to monologue, which cuts out the interaction that is characteristic of comedy and highlights instead a description-rich kind of oration that pushes these plays into the realm of History.
Would it be fair then to call Love’s Labour’s Lost a History? It depends. I would be comfortable saying that on the level of plot it has the elements of a Comedy, but on the level of its language, it is a History.
What, then, do we make of the historical decision made by Heminges and Condell to call this play a Comedy? Unsupervised statistical analysis has shown us (1) a pattern of groupings among the plays that roughly approximates at least two of H&Cs generic groupings of 1623 but also (2) exceptions to those classifications that make a certain amount of critical sense when we look at the construction of those plays. I would argue that we need an ontology here to sort out what elements of the analysis are fundamental as opposed to derivative. We could have an ontology of levels, for example, which says that “on the linguistic level, the play belongs with one group,” but “on the level of plot, the play belongs with another group.”
But eventually we would have to decide how the levels go together. That is also part of the point of this kind of work, since the overlap-with-divergence of linguistic and historical groupings of the plays introduce the possibility that there are levels of coherence here whose interaction needs to be explained. The language of levels needs a compliment in a theory of objects: what are the things that are being compared here? Aren’t the tagged texts themselves a kind of hypothetical or abstracted version of the text itself? And what is the relationship between this hypothetical object and those that are arrayed into a generic group by, say, the historical editors of the First Folio? I will try, in future posts, to show why these are not trivial metaphysical questions.
By way of preview, however, I think the most fundamental “level” here is the one on which individuals or groups make decisions and act. So I would say that Heminges and Condell’s decision about how to order the plays in the First Folio is the most real thing in the analysis, while the statistical objects (tagged texts, Principal Components, regions of a scatterplot) are derivative. How else could we be “surprised” to find LLL clustering with the Histories, unless we were already enticed by the idea (as I was) that the initial clusterings themselves coincided with the classes stipulated by Shakespeare’s editors? More interesting: what is the abstract recipe of family resemblances or species traits that human beings like Heminges and Condell are carrying around in their heads? Their decision to sort the plays a certain way is real. It is a historical fact. But the “sensibility” or “weightings” that led them to take this empirical action must itself be hypothesized or modeled. We might be able to reconstruct this model, but even H&C may not have had direct access to it. This detour might change the way we think about the status of our statistical model, since that model may be only an approximation of something far more comprehensive — capacity for literary judgment in historical actors — whose dynamic, differential powers of comparison are suggestively approximated things like “principal components.”
The latitude in linguistic practice that makes Loves Labour’s Lost look like a History is evidently something that Heminges and Condell did not notice, and I’m not sure why they should have. But once we have noticed it, this latitude in terms of linguistic practice may makes sense to us. Why couldn’t there be a filiation of Love’s Labour’s Lost with Histories on the level of stance and language that does not “show up” on the level of plot? Surely this filiation is real too. The question is, where and on what level?