Tag: genre

Now Read This: A Thought Experiment

Let’s say that we believe we can learn something more about what literary critics call “authorial style” or “genre” by quantitative work. We want to say what that “more” is. We assemble a community of experts, convening a panel of early modernists to identify 10 plays that they feel are comedies based on prevailing definitions (they end in marriage), and 10 they feel are tragedies (a high born hero falls hard). To test these classifications, we randomly ask others in the profession (who were not on the panel) to sort these 20 plays into comedies and tragedies and see how far they diverge from the classifications of our initial panel. That subsequent sorting matches the first one, so we start to treat these labels (comedy/tragedy) as “ground truths” generated by “domain experts.” Now assume that I take a computer program, it doesn’t matter what that program is, and ask for it to count things in these plays and come up with a “recipe” for each genre as identified by our experts. The computer is able to do so, and the recipes make sense to us. (Trivially: comedies are filled with words about love, for example, while tragedies use more words that indicate pain or suffering.) A further twist: because we have an unlimited, thought-experiment budget, we decide to put dozens of early modernists into MRI machines and measure the activity in their brains while they are reading any of these 20 plays. After studying the brain activity of these machine-bound early modernists, we realize that there is a distinctive pattern of brain activity that corresponds with what our domain experts have called “comedies” and “tragedies.” When someone reads a comedy, regions A, B and C become active, whereas when a person reads tragedies, regions C, D, E, and F become active. These patterns are reliably different and track exactly the generic differences between plays that our subjects are reading in the MRI machine.

So now we have three different ways of identifying – or rather, describing – our genre. The first is by expert report: I ask someone to read a play and she says, “This is a comedy.” If asked why, she can give a range of answers, perhaps connected to plot, perhaps to her feelings while reading the play, or even to a memory: “I learned to call this and other plays like it ‘comedies’ in graduate school.” The second is a description, not necessarily competing, in terms of linguistic patterns: “This play and others like it use the conjunction ‘if’ and ‘but’ comparatively more frequently than others in the pool, while using ‘and’ less frequently.” The last description is biological: “This play and others like it produce brain activity in the following regions and not in others.” In our perfect thought experiment, we now have three ways of “getting at genre.” They seem to be parallel descriptions, and if they are functionally equivalent, any one of them might just be treated as a “picture” of the other two. What is a brain scan of an early modernist reading comedy? It is a picture of the speech act: “The play I’m reading right now is a comedy.”

Now the question. The first three acts of a heretofore unknown early modern play are discovered in a Folger manuscript, and we want to say what kind of play it is. We have our choice of either:

• asking an early modernist to read it and make his or her declaration

• running a computer program over it and rating it on our comedy/tragedy classifiers

• having an early modernist read it in an MRI machine and characterizing the play on the basis of brain activity.

Let’s say, for the sake of argument, that you can only pick one of these approaches. Which one would you pick, and why? If this is a good thought experiment, the “why” part should be challenging.

April 5, 2015
Fuzzy Structuralism

Several years ago I did some experiments with Franco Moretti, Matt Jockers, Sarah Allison and Ryan Heuser on a set of Victorian novels, experiments that developed into the first pamphle t issued by the Stanford Literary Lab. Having never tried Docuscope on anything but Shakespeare, I was curious to see how the program would perform on other texts. Looking back on that work, which began with a comparison of tagging techniques using Shakespeare’s plays, I think the group’s most important finding was that different tagging schemes can produce convergent results. By counting different things in the texts – strings that Docuscope tags and, alternatively, words that occur with high frequency (most frequent words) – we were able to arrive at similar groupings of texts using different methods. The fact that literary genres could be rendered according to multiple tagging schemes sparked the idea that genre was not a random projection of whatever we had decided to count. What we began to think as we compared methods, and it is as exciting a thought now as it was then, was that genre was something real.

Real as an iceberg, perhaps, genre may have underwater contours that are invisible but mappable with complementary techniques. Without delving too deeply into the specifics of the pamphlet, I’d like to sketch its findings and then discuss them in some of the terms I outlined in the previous post on critical gestures. First the preliminaries. In the initial experiment, we established a corpus (the Globe Shakespeare) and then used two tagging schemes to assign the tokens into those documents to a smaller number of types. (This is the crucial step of reducing the dimensionality of the documents, or “caricaturing” them.) The first tagging scheme, Docuscope, rendered the plays as percentage scores on the types it counts; the second, implemented by Jockers, identified the most frequent words (MFWs) in the corpus and likewise used these as the types or variables for analysis.

What we found was that the circles drawn by critics around these texts – circles here bounding different genres – could be reproduced by multiple means. Docuscope’s hand-curated tagging scheme did a fairly good job of reproducing the genre groupings via an unsupervised clustering algorithm, but so did the MFWs. We were excited by these results, but also cautious. Perhaps the words counted by Docuscope might include the very MFWs that were producing such good results in the parallel trial, which would mean we were working with one tokenization scheme rather than two. Subsequent experiments on Victorian novels curated by the Stanford team – for example, a comparison of the Gothic Novel versus Jacobin (see pp. 20-23) – showed that Docuscope was adding something over and above what was offered by counting MFWs. MFWs such as “was,” “had,” “who,” and “she,” for example were quite good at pulling these two groups apart when used as variables in an unsupervised analysis. But these high frequency words, even when they composed some of the Docuscope types that were helpful in sorting the genres, were correlated with other text strings that were more narrative in character, phrases such as “heard the,” “reached the,” and “commanded the.” So while we had some overlap in the two tagging schemes, what they shared did not explain the complementary sorting power each seemed to bring to the analysis. The rhetorical and semantic layers picked out in Docuscope were, so to speak, doing something alongside the more syntactically important function words that occur in texts with such high frequency.

The nature of that parallelism or convergence continues to be an interesting subject for thought as we discover more tagging schemes and contemplate making our own. Discussions in the NEH sponsored Early Modern Digital Agendas workshop at the Folger, some of which I have been lucky enough to attend, have pushed Hope and me to return to the issue of convergence and think about it again, especially as we think about how our research project, Visualizing English Print, 1470-1800, might implement new tagging schemes. If MFWs produce viable syntactical criteria for sorting texts, why would this “layer” of syntax be reliably coordinated with another, Docuscope-visible layer that is more obviously semantic or rhetorical? If different tagging schemes can produce convergent results, is it because they are invoking two perspectives on a single entity?

Because one doesn’t get completely different groupings of texts each time one counts new things, we must posit the existence of something underneath all the variation, something that can be differently “sounded” by counting different things. The main attribute of this entity is its capacity to encourage or limit certain sorts of linguistic entailments. As I think back on how the argument developed in the Stanford paper with Moretti et al., the crucial moment came when we found that we could describe the Gothic novel as having both more spatial prepositions (“from,” “on,” “in,” “to”) and more narrative verb phrases (“heard the,” “reached the”) than the Jacobin novel. Our next move was to begin asking whether either of the tagging schemes was picking out a more foundational or structural layer of the text – whether, for example, the decision to use a certain type of narrative convention and, so, narrative phrase, entailed the use of corresponding spatial prepositions. As soon as the word “structural” appeared, I think everyone’s heart began to beat a little faster. But why? What is so special about the word “structural,” and what does it mean?

In the context of this experiment, I think “structural” means “is the source of the entailment;” its use, moreover, suggests that the entailment has direction. We (the authors of the Stanford paper) were claiming that, in deciding to honor the plot conventions of a particular generic type, the writer of a Gothic novel had already committed him or herself to using certain types of very frequent words that critics tend to ignore. The structure or plot was obligating, perhaps in an unconscious way.

I think now that I would pause before using the word “structure,” a word used liberally in that paper, not because I don’t think there is such a thing, but because I don’t know if it is one or many things. Jonathan Hope and I have been looking for a term to describe the entailments that are the focus of our digital work. We have chose to adopt, in this context, a deliberately “fuzzy structuralism” when talking about entailments among features in texts. We would prefer to say, that is, that the presence of one type of token (spatial preposition) seems to entail the presence of another type (narrative verb phrases), and remain agnostic about the direction of the entailment. Statistical analysis provides evidence of that relationship, and it is the first order of iterative criticism to describe such entailments, both exhaustively (by laying bare the corpus, counts, and classifying techniques) and descriptively (by identifying, through statistical means, passages that exemplify the variables that classify the texts most powerfully). Just as important, we feel one ought where possible to assign a shorthand name – “Gothicness,” “Shakespearean” – to the features that help sort certain kinds of texts. In doing so, we begin to build a bridge connecting our linguistic description to certain already known genre conventions that critics recognize or “circle” in their own thinking. But the application of the term”Gothic,” and the further claim that this names the cause of the entailments we discern by multiple means, deserves careful scrutiny.

A series of questions about this entailment entity, then, which sits just under the waterline of our immediate reading:

• How does entailment work? This is a very important question, since it gets at the problem of layers and depth. At one point in the work with the Stanford team, Ryan Heuser offered the powerful analogy alluded to above: genre is like an iceberg, with features visible above the water but depths unseen below. Plot, we all agreed, is an above the waterline phenomenon, whereas MFW word use and certain semantic choices are submerged below the threshold of conscious attention. In the article we say that the below-the-waterline phenomena sounded by our tagging schemes are entailed by the “higher order” choices made when the writer decided to write a “Gothic novel” or “history play.” I still like this idea, but worry it might suggest that all features of genre are the result of some governing, genre-conscious choice. What if some writers, in learning to mimic other writers, take sentence level cues and work “upward” from there? Couldn’t there be some kind of semi-conscious or sentence-based absorption of literary conventions that is specifically not a mimicry of plot?

• Are the entailments pyramidal, with a governing apex at the top, or are they multi-nodal and so radiating from different points within the entity? I can see how syntax, which is mediated by function or high-frequency words, is closely tied to certain higher order choices. If I want to write stories about lovers who don’t get along, this will entail using a lot of singular pronouns in the first and second person alongside words that support mutual misunderstanding. There is a relationship of entailment between these two things, and the source of that entailment is often called “plot” or “genre.” Here again we are at an interpretive turning point, since the names applied to types of texts are as fluid, at least potentially, as those assigned to types of words. Such names can be misleading. Suppose, for example, that I have identified the distinct signature of something like a “Shakespearean sentence,” and that this signature is apparent in all of Shakespeare’s plays. (An author-specific linguistic feature set was created for J. K. Rowling just last week.) Suppose further that, as Shakespeare is almost singlehandedly launching the history play as a theatrical genre in the 1590s, this authorial feature propagates alongside the plot-level features he establishes for the genre. Now someone shows that this Shakespearean sentence signature is reliably present in most plays that critics now call histories. Is that entailment upheld by the force of genre or authorship? The question would be just as hard to answer if we noticed that the generic signal of history plays spans the rest of Shakespeare’s writing and is a useful feature for differentiating his works from those of other authors.

• If entailments can be resolved at varying depths of field, like the two cats below, which are simultaneously resolved by the Lytro Camera at multiple focal lengths, how can we be sure that they are individual pieces of a single entity or scene? Different tagging schemes support the same groupings of texts, so there must be something specific “there” to be tagged which has definite contours. I remain astonished that the groupings derived from tagging schemes like Docuscope and MFWs correspond to names we use in literary criticism, names that designate authors and genres of fiction. But entailments are plural: some seem to correspond to what we call authorship, others genre, and perhaps still others to the medium itself (the small twelvemo, for example, often contains different kinds of words than those found in the larger folio format). There are biological constraints on how long we can attend to a single sentence. The nature and source of these entailments has thus got to be the subject of ongoing study, one that bridges a range of fields as wide as there are forces that constrain language use.

Entailment is real; it suggests an entity. But how should we describe that entity, and with what terms or analogies can its depths be resolved? Sometimes there may be multiple cats, sitting apart in the same room. Sometimes what seems like two icebergs may in fact be one.

Image from the Lytro Camera resolving objects at multiple depths

July 20, 2013
The Time Problem: Rigid Classifiers, Classifier Postmarks

Here is a thought experiment. Make the following assumptions about a historically diverse collection of texts:

1) I have classified them according to genre myself, and trust these classifications.

2) I have classified the items according to time of composition, and I trust these classifications.

So, my items are both historically and generically diverse, and I want to understand this diversity in a new way.

The metadata I have now allows me to partition the set. The partition, by decade, items, and genre class (A, B, C) looks like this:

Decade 1, 100 items: A, 25; B, 50; C, 25

Decade 2, 100 items: A, 30; B, 40; C, 30

Decade 3, 100 items: A, 30; B, 30; C, 40

Decade 4, 100 items: A, 40; B, 40; C, 20

Each decade is labeled (D1, D2 D3) and each contains 100 items. These items are classed by Genre (A, B, C) and the proportions of items belonging to each genre changes from one decade to the next. What could we do with this collection partitioned in this way, particularly with respect to changes in time?

I am interested in genre A, so I focus on that: how does A’ness change over time? Or how does what “counts as A” change over time? I derive a classifier (K) for A in the first Decade and use this distance metric to arrange all items in this decade with respect to A’ness. So my new description allows me to supply the following information about every item: Item 1 participates in A to this degree, and A’ness means “not being B or C in D1.” Let’s call this classifier D1Ka. I can now derive the set of all classifiers with respect to these metadata: D1Ka, D1Kb, D1Kc, D2Ka, D2Kb, etc. And let’s say I derive a classifier for A using the whole dataset. So we add DKa, DKb, DKc. What are these things I have produced and how can they be used to answer interesting questions?

I live in D1, and am confident I know what belongs to A having seen lots of examples. But I get access to a time travel machine and someone sends me a text written much later in time. It is a visitor from D4, and by my own lights, it looks like another example of A. So, I have projected D1Ka onto an item from D4 and made a judgment. Now we lift the curtain and find that for a person living in D4, the item is not an A but a B. Is my classifier wrong? Is this type of projection illegitimate? I don’t think so. We have learned that classifiers themselves have postmarks, and these postmarks are specific to the population in which they are derived. D1Ka is an *artifact* of the initial partitioning of my data: if there were different proportions of A, B, and C within D1, or different items in each of these categories, the classifier would change.

Experiment two. I live in D4 and I go to a used bookstore, where I find a beautifully preserved copy of an item produced in D1. The title page of the this book says, “The Merchant of Venice, a Comedy.” Nonsense, I say. There’s nothing funny about this repellent little play. So D1Ka fails to classify an A for someone in D4. Why? Because the classifier D4Ka is rigidly determined by the variety of the later population, and this variety is different from that found in D1. When classifiers are themselves rigidly aligned with their population of origin, they generalize in funny ways.

Wait, you say. I have another classifier, namely Ka produced over the entire population, which represents all of the time variation in the dataset of 400 items. Perhaps this is useful for describing how A’ness changes over time? Could I compare D1Ka, D2Ka, D3Kz and D4Ka to one another using DKa as my reference? Perhaps, but you have raised a new question: who, if anyone, ever occupies this long interval of time? What kind of abstraction or artifact is DKa, considering that most people really think 10 years ahead or behind when they classify a book? If we are dealing with 27 decades (as we do in the case of our latest big experiment), we have effectively created a classifier for a time interval that no one could ever occupy. Perhaps there is a very well-read person who has read something from each decade and so has an approximation of this longer perspective: that is the advantage of the durability of print, the capacity of memory, and perhaps the viability of reprinting, which in effect imports some of the variation from an earlier decade into a newer one. When we are working with DKa, everything is effectively written at the same time. Can we use this strange assumption — everything is written at once — to explore the real situation, which is that everything is written at a different time?

Another interesting feature of the analysis. This same type of “all written at the same time” reasoning is occurring in our single decade blocks, since when we create the metadata that allows us to treat a subpopulation of texts and belonging to *a* decade, we once again say they were written simultaneously. We use obvious untruths to get at underlying truths, like an astronomer using the inertial assumption to calculate forces, even though we’ve never seen a body travel in a straight line forever.

If classifiers are artifacts of an arbitrarily scalable partitioning of the population, and if these partitions can be compared, what is the ideal form of “classifier time travel” to use when thinking about how actual writing is influenced by other writing, and how a writer’s memory of texts produced in the past can be projected forward into new spaces? Is there anything to be learned about genre A by comparing the classifiers that can be produced to describe it over time? If so, whose perspective are we approximating, and what does that implied perspective say about our underlying model of authorship and literary history?

If classifiers have postmarks, when are they useful in generalizing over — or beyond — a lifetime’s worth of reading?

April 16, 2012
The Ancestral Text

Rosamond Purcell, "The Book, the Land"

In this post I want to understand the consequences of “massive addressability” for “philosophies of access”–philosophies which assert that all beings exist only as correlates of our own consciousness. The term “philosophy of access” is used by members of the Speculative Realist school: it seems to have been coined largely as a means of rejecting everything the term names. Members of this school dismiss the idea that any speculative analysis of the nature of beings can be replaced by an apparently more basic inquiry into how we access to the world, an access obtained either through language or consciousness. The major turn to “access” occurs with Kant, but the move is continued in an explicitly linguistic register by Heidegger, Wittgenstein, Derrida, and a range of post-structuralists.

One reason for jettisoning the priority of access, according to Ray Brassier, is that it violates “the basic materialist requirement that being, though perfectly intelligible, remain irreducible to thought.” As will become clear below, I am sympathetic to this materialist requirement, and more broadly to the Speculative Realist project of dethroning language as our one and only mode of access to the world. (There are plenty of ways of appreciating the power and complexity of language without making it the wellspring of Being, as some interpreters of Heidegger have insisted.) Our quantitative work with texts adds an unexpected twist to these debates: as objects of massive and variable address, we grasp things about texts in precisely the ways usually reserved for non-linguistic entities. When taken as objects of quantitative description, texts possess qualities that–at some point in the future–could be said to have existed in the present, regardless of our knowledge of them. There is thus a temporal asymmetry surrounding quantitative statements about texts: if one accepts the initial choices about what gets counted, such statements can be “true” now even if they can only be produced and recognized later. Does this asymmetry, then, mean that language itself, “though perfectly intelligible, remain[s] irreducible to thought?” Do iterative methods allow us to satisfy Brassier’s materialist requirement in the realm of language itself?

Let us begin with the question of addressability and access. The research described on this blog involves the creation of digitized corpora of texts and the mathematical description of elements within that corpus. These descriptions obtain at varying degrees of abstractions (nouns describing sensible objects, past forms of verbs with an auxiliary, etc.). If we say that we know something quantitatively about a given corpus, then, we are saying that we know it on the basis of a set of relations among elements that we have provisionally decided to treat as countable unities. Our work is willfully abstract in the sense that, at crucial moments of the analysis, we foreground relations as such, relations that will then be reunited with experience. When I say that objects of the following kind – “Shakespearean texts identified as comedies in the First Folio” – contain more of this type of thing–first and second person singular pronouns–than objects of a different kind (Shakespeare’s tragedies, histories), I am making a claim about a relation between groups and what they contain. These groupings and the types of things that we use to sort them are provisional unities: the circle we draw around a subset of texts in a population could be drawn another way if we had chosen to count other things. And so, we must recognize several reasons why claims about these relations might always be revised.

Every decision about what to count offers a caricature of the corpus and the modes of access this corpus allows. A caricature is essentially a narrowing of address: it allows us to make contact with an object in some of the ways Graham Harman has described in his work on vicarious causation. One can argue, for example, that the unity “Shakespeare’s Folio comedies” is really a subset of a larger grouping, or that the group can itself be subdivided into smaller groups. Similarly, one might say that the individual plays in a given group aren’t really discrete entities and so cannot be accurately counted in or out of that group. There are certain words that Hamlet may or may not contain, for example, because print variants and multiple sources have made Hamlet a leaky unity. (Accommodating such leaky unities is one of the major challenges of digital text curation.) Finally, I could argue that addressing these texts on the level of grammar–counting first and second person singular pronouns–is just one of many modes of address. Perhaps we will discover that these pronouns are fundamentally linked to semantic patterns that we haven’t yet decided to study, but should. All of these alternatives demonstrate the provisional nature of any decision to count and categorize things: such decisions are interpretive, which is why iterative criticism is not going to put humanities professors out of business. But such counting decisions are not–and this point is crucial–simply another metaphoric reduction of the world. PCA, cluster analysis and the other techniques we use are clearly inhuman in the number of comparisons they are able to make. The detour through mathematics is a detour away from consciousness, even if that detour produces findings that ultimately converge with consciousness (i.e., groupings produced by human reading).

Once the counting decisions are made, our claims to know something in a statistical sense about texts boils down to a claim that a particular set of relations pertains among entities in the corpus. Indeed, considered mathematically, the things we call texts, genres, or styles simply are such sets of relations–the mathematical reduction being one of many possible caricatures. But counting is a very interesting caricature: it yields what is there now–a real set of relations–but is nevertheless impossible to contemplate at present. Once claims about texts become mathematical descriptions of relations, such statements possess what the philosopher Quentin Meillassoux calls ancestrality, a quality he associates primarily with statements about the natural world. Criticizing the ascendance of what he calls the Kantian dogma of correlationism—the assumption that everything which can be said “to be” exists only as correlate of consciousness—Meillassoux argues that the idealist or critical turn in Continental philosophy has impoverished our ability to think about anything that exceeds the correlation between mind and world. This “Great Outdoors,” he goes on to suggest, is a preserve that an explicitly speculative philosophy must now rediscover, one which Meillassoux believes becomes available to us through mathematics. So, for example, Meillassoux would agree with the statement, “the earth existed 4.5 billion years ago,” precisely because it can be formulated mathematically using measured decay rates of carbon isotopes. The statement itself may be ideal, but the reality it points to is not. What places The Great Outdoors out of doors, then, is its indifference to our existence or presence as an observer. Indeed, for Meillassoux, it is only those things which are “mathematically conceivable” that exceed the post-Kantian idealist correlation. For Meillassoux,

all those aspects of the object that can be formulated in mathematical terms can be meaningfully conceived as properties of the object in itself.

Clearly such a statement is a goad for those who place mind or natural language at the center of philosophy. But the statement is also a philosophical rallying cry: be curious about objects or entities that do not reference human correlates! I find this maxim appealing in the wake of the “language is everything” strain of contemporary theory, which is itself a caricature of the work of Wittgenstein, Derrida and others. Such exaggerations have been damaging to those of us working in the humanities, not least because they suggest that our colleagues in the sciences do nothing but work with words. By making language everything–and, not accidentally, making literary studies the gatekeeper of all disciplines–this line of thought amounts to a new kind of species narcissism. Meillassoux and others are finding ways to not talk about language all the time, which seems like a good thing to me.

But would Meillassoux, Harman and other Speculative Realists consider texts to be part of The Great Outdoors? Wouldn’t they have to? After all, statements about groupings in the corpus can be true now even when there is no human being to recognize that truth as a correlate of thought. Precisely because texts are susceptible to address and analysis on a potentially infinite variety of levels, we can be confident that a future scholar will find a way of counting things that turns up a new-but-as-yet-unrecognized grouping. Human reading turned up such a thing when scholars in the late nineteenth century “discovered” the genre of Shakespeare’s Late Romances. (Hope and I have, moreover, re-described these grouping statistically.) Like our future mathematical sleuth might do a century from now, nineteenth-century scholars were arguing that Romance was already a real feature of the Shakespearean corpus, albeit one that no one had yet recognized. They had, in effect, picked out a new object by emphasizing a new set of relations among elements in a collection of words. Couldn’t we expect another genre to emerge from this sort of analysis–a Genre X, let’s say–given sufficient time and resources? Would we accept such a genre if derived through iterative means?

I can imagine a day, 100 years from now, when we have different dictionaries that address the text on levels we have not thought to explore at present. What if someone creates a dictionary that allows me to use differences in a word’s linguistic origin (Latinate, Anglo-Saxon, etc.) to relate the contents of one text to another? What if a statistical procedure is developed that allows us to “see” groupings we could recognize today but simply have not developed the mathematics to expose? When you pair the condition of massive addressability with (a) the possibility of new tokenizations (new elements or strata of address) or (b) the possibility that all token counts past and future can be subjected to new mathematical procedures, you arrive at a situation in which something that is arguably true now about a collection of texts can only be known in the future.

And if something can be true about an object now without itself being a correlate of human consciousness, isn’t that something part of the natural world, the one that is supposed to be excluded from the charmed circle of the correlation? Does this make texts more like objects in nature, or objects in nature more like texts? Either way, The Great Outdoors has become larger.

May 9, 2011
Shakespeare Quarterly 61.3 Figures

The color figures below correspond to those published in black and white in the print edition of Jonathan Hope and Michael Witmore, “‘The Hundredth Psalm to the Tune of ‘Green Sleeves’: Digital Approaches to Shakespeare’s Language of Genre,” Shakespeare Quarterly 61.3 (fall 2010), Special Issue: New Media Approaches to Shakespeare, edited by Katherine Rowe. These figures can also be found at the University of Wisconsin-Madison Memorial library server.

Figure 1 (above): A total of 776 pieces of Shakespeare’s plays from the First Folio, each piece consisting of 1,000 words, rated on two scaled PCs (1 and 4). The cumulative proportion of variation accounted for by the first four principal components is 12.33 percent, with component 1 accounting for 3.83 percent and component 4 accounting for 2.35 percent.

Figure 2 (above): Loadings biplot for scaled PCs 1 and 4 used to create the scatterplot in Figure 1.

Figure 3 (above): Docuscope screenshot of exemplary Comic strings from Twelfth Night, 3.1. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue; “LanguageReference” in violet.

Figure 4 (above): A total of 767 1,000-word pieces of the Folio plays rated on scaled PCs 1 and 4. This image is the same as Figure 1, except that all the plays are displayed as red dots with the exception of Othello, which is displayed as blue dots and collects mostly in the upper-right-hand quadrant where the Comedies tend to cluster.

Figure 5 (above): Docuscope screenshot illustrating exemplary Comic strings from Othello, 3.3. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue.

Figure 6 (above): Docuscope screenshot illustrating exemplary Comic strings from Othello, 4.2. “SelfDisclosure” and “FirstPerson” strings appear in red; “Uncertainty” in orange; “DenyDisclaim” in cyan; “DirectAddress” in blue; “LanguageReference” in violet.

Figure 7 (above): Docuscope screenshot illustrating exemplary History strings from Richard II, 1.3. “CommonplaceAuthority” strings appear in lime green; “Inclusiveness” in olive; “SenseProperties,” “Sense Objects,” and “Motion” in yellow.

Figure 8 (above): Docuscope screenshot illustrating exemplary History strings from Romeo and Juliet, 1.1. “CommonplaceAuthority” strings appear in lime green; “Inclusiveness” in olive; “SenseProperties,” “Sense Objects,” and “Motion” in yellow.

Figure 9 (above): Dendrogram produced by Ward’s clustering method on scaled data using ninety-eight LATs to profile 320 plays written between 1519 and 1659. Individual items are colored according to genre, with the exception of plays written by Shakespeare, which all appear in yellow. To examine the diagram, click on it. We are grateful to Martin Mueller at Northwestern University for providing us with the modernized versions of these texts from the TCP collection.

The authors would like to thank the University of Wisconsin Memorial Library for its assistance in hosting these illustrations on a permanent basis. Thanks also go to Kate Fedewa for her assistance in preparing Figure 9.

September 9, 2010