Shakespeare Out of Place?

When Jonathan Hope and I did our initial Docuscope study of over 300 Renaissance plays, we found Shakespeare’s plays clustering together for the most part. One explanation for this clustering was that it was caused by something distinctive in Shakespeare’s writing, and that this authorial signature becomes visible in the same way genre does—at the level of the sentence. Indeed, in our first approach to this larger dataset (one we’d assembled from the Globe Shakespeare and Martin Mueller’s semi-algorithmically modernized TCP plays), we thought that authorship was overriding genre as source of patterned variance.

But everything which goes into the dataset also comes out. And in this case, it was editorial difference that was helping to isolate Shakespeare’s plays. When we did a further study of the clusters containing works by Shakespeare, we noticed that their elevated levels of two different LATs that dealt with punctuation – TimeDate and LanguageReference – was an artifact of hand modernization.

Several contracted items from the Globe/Moby Shakespeare edition, tagged as Language Reference Strings by Docuscope

The variability in early modern orthography is well known, and we also know that there were many ways of punctuating early modern texts. (In the case of Shakespeare’s plays, we assume that most of the punctuation originated with the compositors who set up the text in the printhouse rather than Shakespeare himself.) But when the Globe editors modernized their sources in the nineteenth century, they consistently applied certain rules of punctuation that skew Docuscope’s counts when these texts (as a group) were compared with the more varied punctuation to be found in the TCP texts. Sequences that were dealt with consistently in the Globe texts – for example, contractions such as ['tis] or ['twas] or [o'clock] – were being handled much more variously in the original-spelling texts that Martin Mueller was modernizing. (He was only modernizing words in his procedure.)

So, the punctuation was a tip off, increasing the chances that Shakespeare’s plays would cluster together.

We now have the ability to skip or blacklist certain word strings, thanks to a newly updated version of Docuscope created by Suguru Ishizaki. At some point, we will open this can of worms–actually modifying Docuscope’s original tagging protocols–but not yet. There is still more to be learned from the results from an unmodified Docuscope: when we don’t touch the contents of its internal dictionaries, we have the ability to compare results across periods or corpora.

In this case, we learn that Docuscope is sensitive to human editorial intervention in texts. So sensitive, in fact, that it produced an almost complete clustering of Shakespeare’s plays in the larger group of 320 that we profiled in the online draft of our “Hundredth Psalm” article.

The large cluster of Shakespeare plays that resulted from our initial comparison of Globe texts with Mueller's semi-algorithmically modernized TCP plays

Once we realized that this grouping was at least partly artifactual–a product of different editorial procedures applied to our combined corpus–we eliminated the LATs that were registering this difference (TimeDate and LangReference). Of course, by eliminating these, we lost their sorting power on the rest of the corpus, so there was a tradeoff. But we felt that it was not fair to give Docuscope this kind of advantage in sorting text when it was the result of modern editorial intervention. In the future, we might blacklist a word like ['tis] so that we can retain the rest of the category, but I don’t think this is necessary. What really needs to happen is that, in our editorial preparation of texts and corpora,  we must ensure that no set of texts is isolated from the others through special editorial preparation. The fact that “anything goes” in the current TCP collection – it is full of various compositorial and printhouse styles and conventions – is probably a good thing. And in any event, we still see authors’ works and genres clustering together even where printers are multiple. Here, now, is one of the new Shakespeare clusters once the editorial “tell” of certain types of punctuation was removed:

New clustering of Shakespeare's plays with TimeDate and LangRef eliminated from analysis

Now we see that plays by Munday, Heywood, Marlowe, Shirley, Rowley, Webster, Middleton, and Massinger are showing greater similarities with Shakespeare: the variability of their punctuation is not being used against them. Within the Shakespeare plays that do cluster together, we see some of the same similarities–Coriolanus with Cymbeline, for example. But the terms on which Shakespeare’s plays are related to each other are now more limited–we have eliminated two categories of LATs that may have been sorting Shakespeare’s plays with respect to each other. This relative loss of sorting power within Shakespeare’s works seems tolerable to us, however, because it allows for a more meaningful portrait of Shakespeare’s relationship to other dramatists of the period. What excited us about this large diagram was that it says something about 150 years of early modern drama as a whole, inasmuch as that whole could be represented by over 300 works.

Here is the entire diagram, then, constructed without the LATs that capture the nineteenth-century modernization of the Shakespearean texts. (Many thanks to Kate Fedewa for helping us create this large image.)

Revised dendrogram comparing early modern plays from the TCP collection and the Globe Shakespeare (click on image in new screen to zoom)

Posted in Shakespeare | Tagged , , , | Comments closed

Crowdsourced Peer Review in NY Times

The Times this morning did a piece on the Shakespeare Quarterly New Media issue that Jonathan Hope and I participated in. We received some terrific feedback, mostly from Shakespeareans, on the article that was posted to Media Commons–feedback that helped us rewrite the essay for the print edition which will be appearing this fall. There was also a piece on the process by Jennifer Howard in the Chronicle for Higher Education, itself the topic of an opinion piece in the Chronicle’s Brainstorm section.

The idea of open peer review in the humanities raises basic questions about the “specialized” nature of our knowledge in the humanities. Could simply anyone weigh in on a debate about a particular text and its interpretation? Wouldn’t that, in principle, be a good thing? I assume that knowledge in the humanities is in principle available to all. But it is also clearly specialized. The word “allegory,” for example, has a deep history and set of contextual meanings that you just couldn’t pick up from a good dictionary. Our research does expand what is known about certain literatures, cultures and writers, and in this sense, we look like a science that aims to extend the range of objects that are understood. We also refine our terms of art and build communities around these terms (i.e, différance, queering, hegemony, subaltern, hybridity, racialization). One could learn to throw these terms around, as Alan Sokal did in his famous hoax and as graduate students do every day in their seminars, but a good critic or editor should be able to say whether or not the writer really understands the terms. (This is where the editors of Social Text failed.) Perhaps if the paper Sokal submitted to Social Text had been vetted through crowdsourced open peer review, the article would have been rejected. In any event, the hoax itself provides an interesting limit case with which to evaluate the promise of open peer review: a writer acting in bad faith, either as author of the article or peer reviewer.

One last thought: the trajectory of learning in the humanities is intensive rather than cumulative. This is what differentiates us from, say, molecular biology, where you must learn certain things first (organic chemistry, cell physiology) in order to understand other things later (gene transcription). Within the humanities, acquiring expertise might mean re-orienting our approach to existing works rather than expanding the range objects that can be known, although the latter is always possible. But the underlying assumption – that in the humanities one can make qualitative advances in knowledge that do not necessarily fit into a progressive sequence – makes any comparison between the humanities and sciences difficult.

Posted in Shakespeare | Tagged , | Leave a comment

Penalty Kicks and Distributed Movement

Gabriel Dias, graduate student at RPI, has recently modeled the way in which penalty kickers move their bodies as they prepare for a shoot. His findings suggest that there are several “tells” – for example, the angle of the hips, or the position of the planted foot – which predict the ultimate direction of the shot. In the PBS interview that I’ve linked to above, he alludes to the existence of “distributed” movements which show the physical commitment of the kicker to one outcome or another. I hear the word distributed and I immediately think, “integrated physical system,” like a body that is constrained to do certain things because of the way its different parts interact. We see this integration in the competitive world of athletics and the expressive realm of dance. (Perhaps the adjectives here should be reversed?)

In our analysis of texts, we have also find distributed movements of a sort. We find, that is, that certain types of words tend to move with each other in some genres, and others move away from one another. Does this mean that genre is a physical system like penalty kicking, and that our explanation of these distributed movements – of words rather than points on a body – are themselves grounded in a physical reality? I have myself offered analogies to describe this “commitment of weight” in the process of using words to do certain things: if you want to write a Shakespearean comedy, there are certain things you are likely to do: you will tend to use more first and second person singular pronouns and less description than you would in, say, a history play. If Docuscope is the goalie/keeper, it may need only 30 or 40 lines to decide that the ball is going to go toward comedy rather than history. Other things will be ignored as incidental. If say that tagging a play and watching how its “points” move in a mathematical space is like biotagging a kicker and studying his or her movements, I am proposing an analogy. Like kicking, writing is a behavior. In certain situations (penalty kicking, writing for the stage), some aspects of this behavior are signal or cardinal — position of hips, use of pronouns –while others are inessential, like the curve of the kicker’s index finger. (Actually, given the dynamism of the human body, I would be surprised to find out that there is not, on some level, a connection between finger position and kick.)

So, does this mean I am advocating an essentially structuralist account of genre? Am I saying that, because language use is a behavior, then writing in a particular genre is also a behavior with certain “tells” that are, in a sense, built into the physical system of writing? I think people who are doing iterative criticism need to have an intelligent answer to this question, complete with an analysis of its underlying analogy. My answer would be that writing fiction in a historically bound literary field does, like penalty kicking, count as a behavior and that such behaviors will exhibit coordination. There is as much connective tissue in language, grammar, plot and audience expectation as there is in the fabric of the human body. But this is not the same thing as saying that there is an essential structure to particular types of writing – that the existence of a tell implies an underlying recipe, essence or structure that is genetically dictating the behavior of the writer.

Why doesn’t structuralism follow from linguistic integration? First, writing is not like penalty kicking. Dias chose penalty kicking because it is a binary physical outcome. With respect to the standing keeper, the ball goes left or right. Language, on the other hand, is like a flock of birds: it can break any way, 360 degrees, and is doing so dynamically at all times. “Yet the flock shows direction,” you say. “Individual birds may be wobbling left and right, up or down, but there is a recognizable trajectory within the group.” Perhaps there are deterministic ways of saying where this group is going to go next, but I doubt it. The total behavior is distributed, immanent: it has massive integrity as an aggregate, but the existence of that integrity does not imply some non-negotiable locus of control. Another way of saying this, and now I am channeling Whitehead, is to say that the direction of the flock is a continuously unfolding event or “society” of actual occasions. Thus, the penalty kicking example is good for showing entailment and distributed connection in the elements of literary linguistic analysis, but bad as a model for the errant and multiple trajectories of writing.

The existence of the tell essentially pushes back the timeline of intelligibility of the direction of the ball. A good keeper or student of physiology – like a good literary critic – will know earlier than most what kind of behavior is being exhibited. But unlike a keeper in a football game, the critic is not looking for a binary outcome. Rather, the critic or spectator is comparing the unfolding action onstage to any number of possible theatrical “types” of entertainment and generic conventions. Shakespeare takes five penalty shots at a time, all the time. If you are interested in this aspect of the play – its participation in comic conventions – yes, there will be “signal” or orienting linguistic events at the level of the line which you could consult to predict what he is about to do. But you don’t have to consult the tells and this is not a penalty kick: you already know what is going on and, indeed, are a better judge of the texture and generic tonalities of the play as it unfolds than a keeper who has to wait for the ball to be kicked. (Docuscope really is a keeper; it knows nothing until the event happens.) As we have seen in our research, human beings are massively sensitive to variations and distributed cues in linguistic behavior. We make an astonishing number of connections between the kinds of variation we see among the plays and texts we have encountered. Finding out that there is a linguistic “tell” for comedy doesn’t then mean that comedy essentially or structurally “is” the series of tells we reliably find for it. The “tells” here are a parallel description – and this, after the fact — of a perceptual reality that we render qualitatively and immediately, in our feel for certain types of writing or stories.

I have used the words “signal,” “cardinal” and “orienting” to describe the types of tokens that serve as good landmarks for genre in this alternative descriptive universe. I do not use “essential.” As we work further through this analogy between physical and linguistic behaviors, I think we should adopt Spinoza’s metaphysical position from the Ethics, that there is a parallelism between the twin domains of thought and extended physical beings. Neither has priority. When understood as a species of behavior, theatrical writing or literary production must obviously exhibit certain empirical regularities: it takes place on the fleshy platform of human consciousness and is constrained by the physical limits of our bodies, environment and history. As critic, I would want to insist that no material factor – the practices and limitations of stagecraft, the documented or remembered history of past performances, the politically charged distribution of resources and cultural actors – can be a priori excluded as unexpressable in the behavior that is writing. All constraints are summed and expressed, but in different amounts. But I would also want to insist that– whatever the behavior is that we are tracking – there has to be in place a certain set of agreements to make sense of the “movements” in this system as such. I have to want to count “these types” of words and not those. I have to search for significant coordination of these counted things with respect to “this type of outcome” and not another. Someone has to have the desire to study penalty kicks, for example, or authorship, or genre: behaviors don’t simply want to study themselves.

The tell is a “sign” that speaks for the kicker, and speaks early. It is a signal event worth attending to if you are a keeper. It is simultaneously an element in a causal sequence, constrained by events prior to it, and a negotiable sign or expression of an intention to do something. It is a physical way of saying, “I mean to kick the ball this way.” The point of the parallelism is that you never get to dump one half of the phenomenon. Leaning to the left, we acknowledge: all physical tells may be redescribed as expressions of an intention, and so tokens of meaning. But inclining to the right, we say: all tokens of meaning are, on some level, also indexes of empirical constraints. The keeper has to dive both ways.

Posted in Quant Theory, Shakespeare | Tagged , , , , | Leave a comment

Genre Dependence on Character Ideolects? (by Mike Stumpf, UW Undergrad)

And yet, we know that when human beings are involved, all findings are provisional. Odd.

Dendrogram displaying various segments from Romeo and Juliet

To expand on Michael Witmore’s comments in his previous post, it is indeed odd how provisional our results are.  Case in point: I have been examining what John Burrows and Hugh Craig have called the “ideolects” of characters in connection with the plays in which they appear.  I stumbled upon this idea while looking at Shakespeare’s Romeo and Juliet and asking how the language of the title characters may be steering this play towards tragedy or comedy. (This was done as for a panel I presented on with Witmore  and William Blake for a digital salon at UW-Madison.)  Witmore and Blake are themselves working on an analysis of Hamlet without the prince, and the 1 Henry plays/Merry Wives of Windsor without Falstaff: we’re all interested in this kind of “subtraction experiment.” To see my initial findings using this techniques, you can visit my blog, All Is True.

Posted in Shakespeare | Tagged , , , , , | Comments closed

Presentation at London Forum for Authorship Studies/Digital Text and Scholarship Seminar

Jonathan Hope and I presented here in London on a trip arranged by Brian Vickers and Willard McCarty. It was a lovely occasion held in Senate House, attended by some we knew and others we got to know. We began by rolling out paper copies — six feet long scrolls! — of the very large diagram that you saw in the last post. One of the things we have begun to discuss is the ways in which different forces seem to be expressed on various twigs of this dendrogram illustrating relationships among 318 early modern plays. On some twigs, everything that is being grouped together has a common author. On others, the situation is not so clear. Why, for example, aren’t there large groupings of texts written at the same time? (There are some smaller clusters of these.) The principle at work here, when texts are matched in terms of their distance scores on all of Docuscope’s available features (LATs), is that every type of difference present in the population being studies will be expressed in the result. The difficulty is disentangling which type of difference — generational, authorial, generic, company, etc. — is at work in a give grouping.

One thing we spent some time discussing yesterday was three clusters in which Jonson’s plays appear. Here they are below:

All of Jonson’s masques are clustered at the bottom of the diagram (except Cynthia’s Revels, which is clustered in the middle). These are possibly the most distinct items in the entire corpus we are currently working with. Notice how far right the cluster extends before joining with the rest of the diagram: this indicates its dissimilarity with other clusters. But notice too that, within this cluster (as Jonathan pointed out yesterday), there is also a lot of variation. Not only are Jonson’s masques very different from the rest of Renaissance drama (including several interludes), but they are quite different from one another. It’s like a galaxy that is far away from all of the others, but whose stars are themselves quite spread out.

So, what about the other two clusters? We decided to profile all three and came up with some interesting findings. First, the masques. After performing PCA and then rating the clusters on the different components, we found several that were quite good at isolating the items on particular twigs. (This is not a scientific procedure, but it is our first attempt.) With the masques, we found that the language is high in StandardsPositive, StandardsNegative, and ReportingStates. Here’s an exemplary passage, with both StandardsPositive and StandardsNegative in green, and Reporting States in purple:

Masques describe what you are seeing or have just seen in a comparatively static fashion, hence the reporting states. As Brian Vickers pointed out in the question period, the genre of encomium deals with praise and blame, which are the words that are being picked up in the positive and negative standards.

Compare this, now, to the profile of some of Jonson’s other comedies: Poetaster, Volpone, and the other items in the top group. These items are characterized by OralElement (yellow), Question (blue), Intensity (orange), and Person Property (purple):

Here we see a pattern we also saw in Shakespearean comedy: a lot of items associated with one to one interaction. The OralElement here marks the bustle of persons whose social function is marked (PersonProperty) and who are mixing in a state where contact must be established or maintained. Some of the satirical force of the scene is bundled into the intensity strings, which show the emphatic nature of certain social performances that are mannered and so open to mockery. We noticed these intensity strings in Middleton as well, which makes us suspect that a combination of PersonProperty strings and intensity might be a feature of City Comedy. Something to check out in the future.

What makes this top cluster different from the second? Different author? No. Different genre? Not really, at least, not according to the ones we recognize critically. And note too that there are multiple authors on this middle cluster: Chapman, Jonson and Fletcher. Perhaps we should be thinking in terms of modes instead of genres: is there a different mode of storytelling, dramaturgy, or conducting comic business here? When we use PCA to characterize this cluster and compare the results with those that characters the one at top, we find similarity and difference. What’s similar is the OralElement (yellow), Question (blue), and PersonProperty strings (purple):

But we now see strings associated with TimeShift (scarlet), which indicate that a person is marking the difference between two temporal frames (then/now, now/future), and here seems to be associated with figuring out what someone might do or bring about in the present or near future. Here they are anticipatory, looking at what is to come from the standpoint of the present. (In Shakespeare’s late plays, by contrast, we found that action from the past is frequently narrated from the standpoint of the present.) The other thing that is different in this cluster is something that we would never see, because it is not there. The plays in this cluster lack something:

These purple strings, which are classed as ReportingStates. They are tokens that occur frequently in this text — look at how many of them are in this play, which is from the second cluster — but as a whole the plays in this group lack these strings with respect to the larger population of early modern drama (whereas the top group did not). This kind of relative difference between generally quite frequent items is one that you could probably only grasp with the aid of statistics. We hypothesize that these strings are allowing the actors to report action that has taken place offstage in the past, keeping attention focused on the present which is hurtling forward in time. Should this be its own subgenre of Jonson that includes Fletcher and Chapman? Would it be worth naming a grouping like this? Another question for further study.

We received some terrific comments and questions. To our comment that the first Principal Component for this population does seem to track a broad and evolving temporal shift (plays score lower on the component as time goes on), Richard Proudfoot asked if there was more variation in the very early plays in our collection. This is indeed the case, and he followed with the point that we have an uncertain grip on this earlier population because little of it survives. Other explanations for wider variation in the pre-1590 items: English as a language is more fluid prior to 1600, as Jonathan pointed out. It may also be the case that the genre system itself has not stabilized because the professional theater is still gaining its footing in London.

Erica Fudge asked another interesting question: some of the comic strings associated with interaction and comedy (we showed our Shakespeare comedy results) reminded her of the writing in Montaigne. What, she asked, is the relationship between skepticism and comedy, and would we be interested in tracing the presence of something like a skeptical inclination across prose writing and drama. This is a very good question. I would hope that we could study, with these techniques, something like the “sentence level intellectual culture” of the period, one that extends across genres like drama and the essay. Like most of our presentations, we left with more questions and ideas about future experiments. This work seems to us to be provisional in a way that other humanities research is not. You get an idea, talk about it with others, try it, and then decide to try something else. Academic papers at humanities conferences, on the other hand, usually present findings with an air of categorical certainty. And yet, we know that when human beings are involved, all findings are provisional. Odd.

Posted in Early Modern Drama | Tagged , , , , , , , | Leave a comment

Docuscope Goes Live on Shakespeare Quarterly Open Peer Review

Jonathan Hope and I have written a new piece that we submitted to the special issue of Shakespeare Quarterly on “Shakespeare and New Media.” The essay cleared the first stage of editorial review, and is now posted at MediaCommons for general comment and critique prior to final editorial evaluation. Please visit the essay here and make your views known. The abstract and title are as follows:

“The Hundredth Psalm to the Tune of ‘Green Sleeves’”: Digital Approaches Shakespeare’s Language of Genre
In this essay, we explore the underlying linguistic matrix of Shakespeare’s dramatic genres using multivariate statistics and a text tagging device known as Docuscope, a hand-curated corpus of several million English words (and strings of words) that have been sorted into grammatical, semantic and rhetorical categories. Taking Heminges and Condell’s designations of the Folio plays as comedies, histories and tragedies as our starting point, we offer a portrait of Shakespearean genre at the level of the sentence, showing how an identification of frequently iterated combinations of words (either in their presence or absence) can allow us to appreciate the integrity and fluidity of Shakespeare’s genres in new ways. Calling this approach “iterative criticism,” we situate our critical practice in the context of both Shakespearean criticism and more general protocols of reading in the humanities, concluding with a genre map of Shakespeare’s plays in the context of 282 other early modern plays.

As the last line suggests, we have now managed–with the help of Martin Mueller at Northwestern–to produce an analysis of 282 plays from the TCP database alongside the Moby Shakespeare written between 1519 and 1659. I think this is the first visualization of its kind purporting to treat 150 years with of Renaissance drama, which itself feels like something of a hurdle overcome. Here it is:

Dendrogram Produced using Ward’s clustering method on scaled data using 99 LATs to profile 318 plays written between 1519-1659, color coded by genre and separating out the works of Shakespeare as a category of their own: Red=Comedy, Blue=Interlude, Green=History, Cyan=Tragedy, Purple=Tragicomedy, Orange=Masque, Gold=Shakespeare. The item names follow the protocol: (genre)-(date)-(author)-(title).

Two points to make here, although there could be many more. First, this diagram was constructed using scaled data, which means that the “mile away” linguistic markers of similarity and dissimilarity are being balanced with markers whose variation is less visible from a distance. Variables with large standard deviations are not dominating with respect to those with smaller ones. Note then that most of Shakespeare’s works cluster together here, comedies, tragedies and late plays all on the same twig. When I tried this analysis using non-scaled data, these genres split up and Shakespeare’s comedies clustered together with Jonson’s, suggesting that Ward’s clustering procedure on unscaled data is better for picking up genre differences, while the same procedure conducted on scaled data (as is the case here) is more sensitive to authorship. (For an earlier analysis of Shakespeare’s plays only using scaled data with Ward’s clustering technique, see this.) This finding should be tested in other contexts and with other data sets, but it is interesting, since it suggests that authorship becomes legible when fluctuations in variables that contain lots of tokens (say, Description) are coordinated with those that have many fewer tokens. It may be this “adding a dash of something” that pulls the author as such to the fore in an analysis.

I’d like also to offer another observation here about the fact that so many Shakespeare plays are hanging together (as are Shirley’s and Middleton’s), remaining agnostic for the time being about whether it is authorship or genre that is producing these clusterings. The majority of Shakespeare’s plays are clustering on a twig that contains mostly comedies. So when compared with 282 other items written between 1519-1659, Shakespeare’s plays look for the most part like plays that Harbage (in the Annals of English Drama) classed as comedies as opposed to some other genre. (Martin tells me that he followed Harbage for the most part, but made some guesses himself about genre designations based on title page information and common sense.) The thing to remember here is that an individual genre may cluster in different ways depending upon the larger population in which it is situated. That is, a fuller collection of texts from the period–not just the ones that Martin was able to modernize so that we could run a test on them–might show new subdivisions that end up splitting the Shakespeare block into a number of smaller splinters. (Or it may not: this may be a stabilized portrait, more or less.) The best way to understand more about the groupings themselves is to begin looking at them with the help of PCA and other techniques we’ve been using already. That’s where we’re headed next.

Posted in Shakespeare | Tagged , , , | 1 Comment

Early and Late Plato II: The Apology and The Timaeus

In the previous post we were examining three dimensional clusterings of the Platonic dialogues as rated on scaled Principal Components 1, 2 and 5, a technique that allowed us to see the early Platonic dialogues (as defined by Vlastos) standing apart from the middle and later ones. Vlastos’ claim, we remember, was that these early dialogues represent the historical Socrates, whose technique of argumentation was elenctic. Socrates used this technique to draw out the implications of an opponent’s views until those views collapsed under their own contradictions.

The translator of these dialogues, Jowett, would have had to preserve at least some of the linguistic “footings” required for such a dialogical structure in the early dialogues, and it was my contention in the previous post that Docuscope would detect these footings because they are exactly what a translator must preserve. Perhaps a more provocative claim, which I would like to advance now, is that the irony which attends this elenctic method — while not itself visible to Docuscope — might also require certain reliable linguistic pivots. In keeping with our analogy of the body of a dancer, certain upper body moves like the ironic twist in which Socrates seems to be asking a question for the sake of clarification but is actually pushing his interlocutor into deeper confusion, require a lower body stance that can support the weight of the move. If we could define this lower body stance, we would not be defining Socratic irony itself, but rather its linguistic correlates. (At some point, the analogy will break down, since language is not a “weight bearing system”: but it does support gestures and turns, so let’s see how far we can go with it.)

What is it exactly that is happening in these early dialogues that Docuscope and principal component Analysis are able to see from afar? Here is a scree plot which rates the power of the principal components as they are derived sequentially, from most powerful to least:

Scree Plot for Principal Components Derived from Cluster Docuscope Data on Jowett Translations of the Platonic Corpus

The first two principal components are shown here to be quite powerful: together they account for almost 54% of the variation in the entire corpus. When we rate all of the dialogues on just these first two components, we get the following bubble plot:

Graph of Platonic Dialogues Scored on Principal Components 1 and 2

I have highlighted the upper left quadrant, where almost all of the dialogues that Vlastos identified as “early” are clustering. Their presence in this quadrant means that they score low on PC1 and high on PC2. PC1 might be described as an anti-early component, because it powerfully discriminates against early dialogues. PC2, on the other hand, might be described as a pro-early component, since its highly loaded variables are more frequently used in early dialogues. We can literally see the sorting power of these two components here, but it can also be quantified by the Tukey text, which was applied to both principal components, the results being available here and here. Note that the Apology is one of the most strongly “early” dialogues by these measures, whereas the Timaeus is one of the least early. We will pay closer attention to these two items as a way of exemplifying the differences that Docuscope sees between the two types of items.

Before making the comparison, let’s look at the variables that are most powerfully loaded on these components and so are most responsible for discriminating the early/non-early difference. We do this either by consulting the loadings of our variables on the two principal components or by looking at a biplot which arrays those variables in two dimensions, exactly the two that were used to produce the bubble plot above. First the loadings scores (reported as eigenvectors) and then the loadings biplot:

Loadings of Cluster Scores for PC1 and PC2

Loadings Biplot for PC1 and PC2

The loadings biplot (lower diagram) is a two dimensional image of the loadings scores (upper diagram), showing how these variables behave with respect to one another in the entire corpus. Clusters of words that oppose each other by 180 degrees — for example, [Public_Values] and [Special_Referencing] — tend not to co-occur with one another in the same text. Here we are interested in what makes a particular text cluster in the upper left-hand quadrant, so we are looking for vectors (red arrows) that extend furthest to the left and to the top of the diagram. Vectors extending to the left are: Reasoning, Interactivity, Directing Action, Interior Mind and First Person.  (These are the clusters that have significant negative loadings on the first column in the top diagram: if an item scores high on words contained in these clusters, it will be “punished” for that abundance and pushed to the left of the plot, as the red dots are above.) Note that we can also use our 180 rule to say something about items that are far left in the bubble plot as well: they must lack items contained in the clusters that are positively loaded on PC1, which are Narrating, Description, and Time Orientation.

Similarly, with PC2, we are looking for the tall vectors heading upward: Emotion, Public Values and Topical Flow. Having tokens that were counted under these clusters will push an item up in the diagram, as will lacking items from the negatively loaded clusters: Directing Readers, Elaborating, Special Referencing. Note that Topical Flow (which is often populated by third person pronoun use) is loaded positively for the second principal component, but also positively for the first, which makes it fork upward and to the right. This means that an item scoring high on Topical Flow tokens will probably lack some of the items to the far left and contain items to the far right, which may discourage that item’s appearance in our “early” quadrant unless there are differences in these other variables.

I have discussed some of these clusters in earlier posts about Shakespeare, so my main focus here will not be on elaborating the contents of the clusters. Rather, I want to use these loadings to zero in on specific words in exemplary passages from the early and later dialogues to see what is captured and then leave it to readers to say what these particular tokens are doing. Looking at our bubble plot above, the two dialogues that exemplify these opposing linguistic trends — in translation — are the Apology and the Timaeus.

Here are two passages from the Apology that exemplify “earliness” in the Platonic corpus, if we agree that the clustering above seems compelling. Note that these are screenshots from Docuscope in which the clusters that are doing the work of pushing the texts up and to the left are turned on or color coded. I have not turned on the clusters that are absent, since these will be exemplified in the Timaeus:

I think these passages are certainly illustrative of the elenctic method described by Vlastos, although it ought to be said that the high amount of dialogical interaction here — one that was a hallmark of comedy in Shakespearean drama — is sometimes implied by Socrates rather than really enacted by both speakers. That is, Socrates sometimes simulates a dialogue that is not really happening (“to him I may fairly answer”), and this procedure actually multiplies the Interaction strings (sky blue) beyond what might be the case in actual interaction. Note too that Docuscope is seeing lots of Public Values words, words that gesture toward communally sanctioned values, in this earlier style: demigods, heroes, fairly, mistaken, good for, doing right, disgrace. These values must be cited in elenctic exchange because they are the topic of conversation (people have opinions about them), but such implied communality may also coerce assent from an interlocutor for reasons that extend beyond mere shame at self-contradiction. We see, too, more emotionally charged words (in orange); the occasional Topical Flow token (their); and some Reason tokens (if he, thus, may, do not).

Now look at a passage from the Timaeus, which does the things that items in the early quadrant (on the whole) cannot do:

This is cosmogeny, not dialogue, which is why we have a number of Narrative strings (the year when, then, the night, overtaken the, as they) and Description strings (orbit, the moon, stars, sun, wanderings, motion, swiftness). Special Referencing here is picking up a lot of abstract references (dark purple) such as animals, measure, relative, the whole, nature, variety and degrees. The slightly lighter purple, Reporting strings, are complimenting the Narrative tokens: having, completion, After this, came into being, received, to the end that, created. This should not be surprising since the two vectors for these clusters were almost overlapping in the loadings biplot above.

Whereas the Apology is staging a dialogue (real or implied), the Timaeus is creating a world and pacing that act of creation (through narrative) with a set of abstract terms that can be referenced in conversation. Indeed, one of the burdens of this kind of world-making, I think, is that the abstractions must be folded in with the concrete descriptions in equal measure so that the passage is something more than a Georgic description of a natural scene or a praise poem to nature. Note too that there is absolutely no irony in this passage from the Timaeus. That is not because Docuscope has a category that allows it to discern irony in its local environs and so rule out such an effect in the Timaeus: only a human being can make such a discrimination, by virtue of being able to look beyond the simple mentioning of words to assess their use. (For Docuscope, all counted words are mentionings of words whose single use has been classed a priori in the categories assigned to them.)

And yet, even in translation, Docuscope may be identifying the linguistic footings of irony: a necessary but not sufficient condition for its use.

Posted in Counting Other Things | Tagged , , , | Leave a comment

Platonic Dialogues and the “Two Socrates”

Press to Start: Vlastos (1991) Groupings, PCA on Correlations

I have been thinking for a while now that Docuscope preserves, in its tagging structure, what a translator preserves — that this is a good definition of what it is looking to classify. One way to test this hypothesis would be to try Docuscope on a set of translations, which is what I’ve tried to do here.

The visualization above (press to rotate) shows the Platonic Corpus as translated by the nineteenth-century classicist Benjamin Jowett, rated by principal components on correlations and color coded by the divisions proposed by the great Plato scholar Gregory Vlastos (1991), whose division of the dialogues into early (red), middle (blue), and late (green) are highlighted here. (The semitransparent elipsoids are drawn to capture 50 percent of the items in the group.) Vlastos argued, on the basis of the types of arguments used in these texts, that the early dialogues represent a distinct group from those produced in the middle or later periods. The mode of argument in these earlier dialogues, he observes, is elenctic or adversative, which means that in these dialogues Socrates does not “defend a thesis of his own” but rather examines one held by an interlocutor (113). Socrates thus avoids making knowledge claims in these dialogues, instead forcing his interlocutors to enunciate them as the weakness of their own positions becomes apparent. Believing that there are two “Socrates” presented in these dialogues, Vlastos argues that the early Socrates — who likely represents the philosophical position of the historical Socrates rather than Plato — must rely on the “‘say what you believe’ rule” (113), this rule supplying the rough materials of his proofs. As epistemologist (which he is not in these dialogues), Socrates does not advance certain knowledge claims: the elenctic method will not support them.

The middle and later Socrates, by contrast, is fully willing to advance certain knowledge claims, which he seeks to present demonstratively (48). Rather than being simply a moral philosopher, he is now a “moral philosopher and metaphysician and epistemologist and philosopher of science and philosopher of language and philosopher of religion and philosopher of eduction and philosopher of art.” In these dialogues, Socrates advances a theory of knowledge as the recollection of separately existing Forms – a significant epistemological leap. This Socrates is now a spokesman for Plato, making the most important division of the corpus that between the early dialogues and all the rest.

Taking this division as a starting point, let’s look at how Docuscope divides the dialogues, which it does here simply on the basis of mean scores on all 101 of the Language Action Types. These scores are plotted in a hyperspace and then the least dissimilar items are paired using Ward’s method on unscaled data. The technique is the same as the one that produced the most effective genre clustering of Shakespeare’s plays. I am thus using what I know of a particular mathematical technique as it applies to historically accepted clusterings of Shakespeare’s plays and applying it to a body of works that is less familiar to me – not quite what Franco Moretti calls “the great unread,” but definitely a case of trying to understand the lesser known through the better known.

Wards Clustering on Translated Plato Dialogues

As you can see from the clustering of red or early period dialogues above, we can arrive at an arrangement of the dialogues using Docuscope data that is remarkably similar to the basic division in the dialogues that Vlastos argued for in 1991. But what is perhaps most interesting is that roughly the same division was arrived at stylometrically in the late nineteenth century, and that there has been at least some convergence within Plato studies of what we might call “intensive” techniques for sorting the dialogues (based on reactions of readers to the doctrines or manner of presentation) and “extensive” ones (built on groups that themselves represent the capture of stylometrically significant counted items). As Brandwood shows in The Chronology of Plato’s Dialogues (1990), it was already apparent to computationally unassisted readers of Plato such as L. Campbell that the later dialogues exhibited more technical and rare words, as well as a “peculiar, stately rhythm.” These claims were advanced with quantitative evidence (Campbell, 1867) but were grounded in an impression gathered through close and repeated reading. This line of inquiry was also taken up by the German classicist W. Dittenberger, who in 1896 argued that early and later dialogues could be discriminated by looking at the particles καἰ μήν, ἀλλὰ μήν, which co-occur in the early dialogues, and τί μήν, ἀλλὰ…μήν, and γε μήν, which co-occur in the later ones. This essentially multivariate pattern yielded the early grouping: Crito, Euthyphro, Progagoras, Charmides, Laches, Euthydemus, Meno, Gorgias, Cratylus, Phaedo. As you can see from the above, Vlastos’ groupings and those of Dittenberger overlap significantly. To this we might add the groupings derived from the Docuscope codings.

This convergence is interesting for a number of reasons. First, it shows us extensive and intensive techniques working in tandem, which raises the basic question of how these two things are related. Second, it shows us how a certain conversational style or dialogical setting connects with a philosophical position, and how may themselves become available for analysis through the counting of seemingly inconsequential particles such as μήν. The Platonic corpus is an excellent one to work with because it has been well studied, and we have the advantage of pre-computational techniques to examine alongside actual readers’ responses. In my next post, I will examine those features in the translated dialogues that – once tagged by Docuscope – seem to be doing a good job of reproducing the scholarly divisions described above.

Posted in Counting Other Things | Tagged , , , | Leave a comment

The Funniest Thing Shakespeare Wrote? 767 Pieces of the Plays

Press to Play: 767 Pieces of Shakespeare in Scaled PCA Space

Now for something a little different. I mentioned before that we can conduct similar analyses on pieces of the plays rather than the plays as a whole. In this experiment, I have been working with 1000 word chunks of Shakespeare plays, which allows me to use many more variables in the analysis. (This was the technique that Hope and I used in our 2007 article on Tragicomedy.) Obviously the plays weren’t written to be read, much less analyzed, in identically sized pieces: the procedure is artificial through and through. It does allow us, however, to see things that Shakespeare does consistently throughout different genres, things that happen repeatedly throughout an entire play rather than just the beginning or end. Another caveat: we partitioned the plays starting at the beginning of each text, making the first 1000 words the first “piece.” This results in a loss of some of the playtext at the end, since any remainder that is less than 1000 words is dropped. In future analyses, we will take evenly spaced 1000 word samples from beginning to end, partitioning losses in between. There are no perfect answers here when it comes to dividing the plays into working units. So this is a first installment.

The video above (press to play) is a three dimensional JMP plot of 767 pieces of Shakespeare in a dataspace of three scaled Principal Components (1, 4, and 9) which I have chosen based on their power to sort the plays using in the Tukey Test. (See Tukey results for PCs 1 and 4.) When you run the video capture, you’ll see a series of dots that are color coded based on generic differences: red is comedy, green history, blue is late plays and orange tragedies. Early in the capture, I move an offscreen slider that creates a series of chromatic “halos” or elipsoid bubbles around neighboring dots: these halos envelop dot groupings as they meet certain contiguity thresholds. You see the two major clusters I am interested in here, histories and comedies, forming in the lower left and upper right respectively. (Green on lower left, red on upper right.) Interestingly enough, the see-saw effect we saw in our analysis of entire plays is repeated here: comedies and histories are the most easily separated, because whenever Shakespeare is using strings associated with comedy, he can’t or won’t simultaneously use strings associated with history (and vice versa). Linguistic weight cannot be placed both sides of this particular generic fulcrum at once.

Now the resulting encrusted object, which I have rotated in three dimensions, is a lot less elegant than the object we would be contemplating were to do discriminant analysis of these groups. I am saving Discriminant Analysis for a later post. For all its imperfections, Principal Component Analysis is still going to give us some results or linguistic patterns we can make sense of, which is the ultimate measure of success here. I think it’s worth appreciating the spatial partitioning here in all of its messiness: the multicolored object presents both a pattern that we are familiar with — comedies and histories really do flock to opposite ends of the containing dataspace — and some jagged edges that show the imperfections of the analysis. Imperfections are good: we want to find exceptions to generic rules, not just confirmations of a pattern.

Looking at the upper right hand quadrant, we see the items that are high on both PC1 and PC4. In this analysis we are using Language Action Types or LATs, the finest grained categories that Docuscope uses (it has 101 of them). We will want to ask which specific LATs are pushing items into the different areas here, and to do so, I have produced the following loading biplot:

A loadings biplot gives information about components in spatial form, showing our different analytic categories (LAT’s such as “Common Authorities,” “DenyDisclaim,” “SelfDisclosure,” etc.) as red arrows or vectors. To read this diagram, consider the two components individually. What makes an item high on PC1? Since PC1 is rated on the horizontal axis, we scan left to right for the vectors or arrows that are at the extremes. To my eye, SelfDisclosure, FirstPer[son] and DirectAddress are the most strongly “loaded” on this component, which means that any piece that has a relatively high score on these variables will be favored by this component and thus pushed to he right had side of a scatterplot (see below). Conversely, any item that is relatively low in the words that fall under categories such as Motions, SenseProperty, Sense Object, and Inclusive will be pushed to the left. Notice that the two variables SelfDisclosure and SenseObject are almost directly opposed: the loadings biplot is telling us here that, statistically at least, the use of this one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin.

We can then do the same thing with the vertical axis, which represents PC4. Here we see that LangRef [Language Reference], DenyDisclaim and Uncertainty strings are used in opposition to those classed under the LAT Common Authority. If an item scores high on PC4 (which most comedies do), it will be high in LangRef, Uncertainty and DenyDisclaim strings while simultaneously lacking Common Authority strings. So what about the vectors that bisect the axes, for example, DenyDisclaim, which appears to load positively on both PC1 and PC2? This LAT is shared by the two components: it does something for both. We can learn a lot by looking at this diagram, since — once we’ve decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language “schooling together” in the process of making this distinction. DirectAddress and FirstPer [or, First Person], Autobio and Acknowledge thus tend to go together here (lower right), as do Motions, SenseProperties, and Sense Objects (upper left).

In fact, the designer of Docuscope saw these LATs as being related, which is why elsewhere he aggregated them together into larger “buckets” such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where words that are grouped together on theoretical grounds are associating with one another statistically in a group of texts. If the intellectual architecture of Docuscope’s categories is good, this schooling should happen with almost any biplot of components, no matter what types of texts they discriminate. The power of this combination of Principal Components, then, is that it aligns the filiations and exclusions of the underlying language architecture with genres that we recognize, and will hopefully suggest theatrical or narrative strategies that support these recognizable divisions.

The loadings biplot shows us how the variables in our analysis are pushing items in the corpus into different regions of a dataspace. We can now populate that dataspace with the 767 pieces of Shakespeare’s plays, rating each of them on the two components. Here is how the plays appear in a plot of scaled Component 1 against Component 2, again, color coded with the scheme used above:

Notice the pattern we’ve seen before: comedies (here represented in red) are opposite histories (green) in diagonal quadrants. In general, they don’t mingle. The upper right hand quadrant, which is where the comedies tend to locate, contains the first item that I’d like to discuss: the red dot labelled Merry Wives (circa 2.1). This dot represents a piece of the first scene, second act of The Merry Wives of Windsor. As the item that rates highest on both PC1 and PC4 — components which the Tukey Test shows us to be best at discriminating comedy — this piece of The Merry Wives of Windsor is the most comic 1000 word passage that Shakespeare wrote. Here is an excerpt:

“I’ll entertain myself like one that I am not acquainted withal; for, sure, unless he know some strain in me, that I know not myself, he would never have boarded me in this fury.” In this color coded sentence we can see diagrammed the comic dance step. While I think there are funnier lines — “I had rather be a giantess, and lie under Mount Pelion” — the former is significant for what it does linguistically: it shows a speaker entertaining and then rejecting a perspective on her own situation (that of Falstaff) while comparing it with another (her own). The uncertainty strings (orange) such as “know not,” “doubt” and the indefinite “some” contribute to this mock searching rhetoric. Self-disclosure strings such as “myself” and “makes me” anchor the reality testing exercise to the speaker, who must make explicit her own place in the sentence as the object of doubt, while the oppositional reasoning strings such as “never” and “not” mark the mobility of this speakers perspective: I will try this toying perspective on my honesty, seeing myself as Jack Falstaff does, but will reject it soon enough. The reason that this passage is so highly rated on these two factors has something to do with the multiplication of perspectives that are being juggled onstage: there are two individuals here — Mistress Page and Mistress Ford — who are, as it were, rising above an imbedded perspective contained in Falstaff’s letter, commenting upon that perspective, and then rejecting it. Each time a partition in reality (a level) is broached in the stage action and dialogue, comic language appears.

We can oppose this most comic piece of writing — again, according to PCA — to its opposite in linguistic terms, a piece that contains what the comic one lacks and lacks what the comic one has. Here, then, is a portion of the “most historical” piece of Shakespeare, from Richard II 1.3:

Here we see the formal settings of royal display, a herald offering Mowbray’s formal challenge — no surprise this exemplifies history, a genre in which the nation and its kings are front and center. Yet where the passage really begins to rack up points is in its use of descriptive words, which are underlined in yellow. Chairs, helmets, blood, earth, gentle sleep, drums, quite confines…we don’t think of history as the genre of objects and adjectives, but linguistically it is. Inclusive strings, in the olive colored green, are perhaps less surprising given our previous analyses. We expect kings to speak about “our council” and what “we have done.” But notice that such language is quite difficult to use in comedy: even in a passage of collusion, where we would expect Mistress page and Mistress Ford to be using first person plural pronouns, the language tends to pivot off of first person singular perspectives. The language of “we” really isn’t a part of comedy.

I am less surprised to find, at this finer grained level of analysis, words from official life (what Docuscope tags as Commonplace Authority, in bright green) associated with history, since these are context specific. More interesting is the presence of the purple words, which Docuscope tags as person properties. These are high in history, but show up in comedy as well, as you can see on the loading biplot above. This marked up passage is also useful because it shows us something we’d want to disagree with: you don’t have to be Saul Kripke to see that a proper name like Henry is an imperfect designator of persons, particularly because other proper names such as Richard do not get counted under this category by Docuscope. We live with the imperfections, unless it appears that there are so many mentions of the name Henry in the plays that this entire LAT category must be discounted.

Posted in Shakespeare | Tagged , , , , | 1 Comment

Clustering the Plays Without Principal Components

Folio plays clustered using all Language Action Types, Non-Standardized Data

Folio plays clustered using all Language Action Types, Non-Standardized Data

In comparison to the previous post, where we were using the plays’ scores on Principal Components to create clusters, here we are just using the percentage counts of the plays on all of the Language Action Types, the lowest level of aggregation in Docuscope’s taxonomy of words or strings of language. There are 101 Language Action Types or LATs, which is to say, buckets of words or strings of words that David Kaufer has classified as doing a certain kind of linguistic or rhetorical work in a text. I have made a table of examples of these types, taken from the George Eliot novel Middlemarch, which can be downloaded here.

I find this diagram more than a little unnerving. It is quite accurate in terms of received genre judgments — notice that almost all of the Folio history plays (in green) are correct — and there are nice clusterings of both tragedies (tan) and comedies (red). Henry VIII, which is here identified as a late play (blue), is placed in the cluster full of other late plays (including Coriolanus, which could just as easily have been coded blue). And plays with a similar tone — Titus, Lear, and Timon — are all grouped together as tragedies, separate from the other tragedies that are placed together further above. The strange pairing that repeats here from the Principal Component clusterings is Tempest plus Romeo and Juliet, something which merits further inquiry.

Why should a mechanical algorithm looking at distances between counts of things produce a diagram this accurate? I’m not really sure. The procedure involves arraying each of the 36 plays in a multidimensional space depending on its percentage score on each of the things being rated here — the LAT categories. So, if “Motion” strings are one category, you can imagine an X axis with the scores of all the plays on “Motion,” with a Y axis rating all the plays on “Direct Address” as below:

Direct Address and Motion Scores in two Dimensions

Direct Address and Motion Scores in two Dimensions

Now think about adding another score — First Person — to the third dimension, which will give us a spatial distribution of the plays and their scores on each of these three LATs:

Direct Address, First Person and Motion Scores of Folio Plays

Direct Address, First Person and Motion Scores of Folio Plays

Now, there are distances between all of the points here and various methods (single linkage, complete linkage, Ward’s) for expressing the degree to which items arranged in such a space can be grouped together in a hierarchy of filiation or likeness. If you multiply out all of the things being scored in this analysis — that is, all 101 Language Action Types — you end up with a multidimensional space that is unvisualizable. But there are still distances among items in this multidimensional space, distances that can be placed into the algorithms for producing the hierarchy of likeness. That is what is going on — using Ward’s procedure with non-standardized data — in the visualization at the beginning of this post.

As I’ve said before, a picture is nice, but just because you can reproduce a human classification with an algorithm doesn’t mean you’ve made any progress. You have to be able to show what’s going on in a text — which words are doing what things some or most of the time — before you can call your work an analysis. Perhaps that’s another reason why I find a diagram like this unnerving: I cannot work back from it to a passage in a text.

By standardizing the data, we get the following re-arrangement. I am unsure how to categorize the benefits of data standardization in this case, but think this is a comparatively less compelling diagram:

Clustering of Folio Plays using Standardized Data

Clustering of Folio Plays using Standardized Data

Posted in Shakespeare | Tagged , | 3 Comments