Tag: LATs

  • Visualizing Linguistic Variation with LATtice

    The transformation of literary texts into “data” – frequency counts, probability distributions, vectors – can often seem reductive to scholars trained to read closely, with an eye on the subtleties and slipperiness of language. But digital analysis, in its massive scale and its sheer inhuman capacity of repetitive computation, can register complex patterns and nuances that might be beyond even the most perceptive and industrious human reader. To detect and interpret these patterns, to tease them out from the quagmire of numbers without sacrificing the range and the richness of the data that a text analysis tool might accumulate can be a challenging task. A program like DocuScope can easily take an entire corpus of texts and sort every word and phrase into groups of rhetorical features. It produces a set of numbers for each text in the corpus, representing the relative frequency counts for 101 “language action types” or LATs. Taken together, the numbers form a 101 dimensional vector that represents the rhetorical richness of a text, a literary “genetic signature” as it were.

    Once we have this data, however, how can we use it to compare texts, to explore how they are similar and how they differ? How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts. But it is precisely this high dimensionality that accounts for the richness of the data that DocuScope produces, so it is important to be able preserve it and to make comparisons at the level of individual LATs.

    LATtice addresses this problem by producing multiple visualizations in tandem to allow us to explore the same underlying LAT data from as many perspectives and in as much detail as possible. It reads a data file from DocuScope and draws up a grid, or a heatmap representing “similarity” or “difference” between texts. The heatmap, based on the Euclidean distance between vectors, is drawn up based on a color coding scheme where darker shades represent texts that are “closer” or more similar and lighter shades represent texts further apart or less similar according to DocuScope’s LAT counts. If there are N texts in the corpus, LATtice draws up an “N x N” grid where the distance of each text from every other text is represented. Of course, this table is symmetrical around the diagonal and the diagonal itself represents the intersection of each text with itself (a text is perfectly similar to itself, so the bottom right “difference” panel shows no bars for these cases). Moving the mouse around the heatmap allows one to quickly explore the LAT distribution for each text-pair.

    Screenshot of LATtice

    While the main grid can reveal interesting relationships between texts, it hides the underlying factors that account for differences or similarities, the linguistic richness that DocuScope counts and categorizes so meticulously. However, LATtice provides multiple, overlapping, visualizations to help us explore the relationship between any two texts in the corpus at the level of individual LATs. Any text-pair on the grid can be “locked” by clicking on it, allowing the user to move to the LATs to explore them in more detail. The top right panel shows how LATs from both the texts relate to each other. The text on the X axis of the heatmap is represented in red and the one on the Y axis is represented in blue in the histogram for side by side comparison. All the other panels follow this red-blue color coding for the text-pair. The bottom panel displays only the LATs whose counts are most dissimilar. These are the LATs we will want to focus on in most cases as they account most for the “difference” between the texts in DocuScope’s analysis. If a bar in this panel is red it signifies that for this LAT, the text on the X axis (our ‘red’ text) had a higher relative frequency count while a blue bar signals that the Y axis text (our ‘blue’ text) had a higher count for a particular LAT. This panel lets us quickly explore exactly on what aspects texts differ from each other. Finally, LATtice also produces a scatterplot as a very quick way of looking at “similarity” between texts. It plots LAT readings of the two texts against each other and color codes the dots to indicate which text has a higher relative frequency for a particular LAT (grey dots indicate that both LATs have the same value). The “spread” of dots gives a rough indication of difference or similarity between texts: a larger spread indicates dissimilar texts and dots clustering around the diagonal indicate very similar texts.

    You can try LATtice out with two sample data-sets by clicking on the links below. The first is drawn from the plays of Shakespeare which are in this case arranged in rough chronological order. As Hope and Witmore’s work has demonstrated, the possibilities opened up by applying DocuScope to the Shakespeare corpus are rich and hopefully exploring the relationship between individual plays on the grid will produce new insights and new lines on inquiry. The second data-set is experimental – it tries to use DocuScope not to compare multiple texts but to explore a single text – Milton’s Paradise Lost – in detail. It might give us insights about how digital techniques can be applied on smaller scales with well-curated texts to complement literary close-reading. The poem was divided into sections based on the speakers (God, Satan, Angels, Devils, Adam, Eve) and the places being described (Heaven, Hell, Paradise). These chunks were then divided into roughly three hundred line sections. As an example, we might notice straightaway that speakers and place descriptions seem to have very distinct characteristics. Speeches are broadly similar to each other as are place descriptions. This is not unexpected, but what accounts for these similarities and differences? Exploring the LATs helps us approach this question with a finer lens. Paradise, for example, is full of “sense objects” while Godly and angelic speech does not refer to them as often. Does Adam refer to “authority” more when he speaks to Eve? Does Satan’s defiance leave a linguistic trace that distinguishes him from unfallen angels? Hopefully LATtice will help us explore and answer such questions and let us bring DocuScope’s data closer to the nuances of literary reading.


    Finally, a few technical notes: The above links should load LATtice with the appropriate data-sets. Of course, you will need to have Java installed on your machine and to have applets enabled in your browser. You can also download LATtice and the sample data-sets, along with detailed instructions, as stand-alone applications for the following platforms:

    There are a few advantages to doing this. First, the standalone version offers an additional visualization panel which represents the distribution of LATs as box-and-whisker plots and shows where the text-pair’s frequency counts stand relative to the rest of the corpus. Secondly, the standalone application can make use of the entire screen, which can be a great advantage for larger and higher resolution monitors.

  • The Funniest Thing Shakespeare Wrote? 767 Pieces of the Plays

    Press to Play: 767 Pieces of Shakespeare in Scaled PCA Space

    Now for something a little different. I mentioned before that we can conduct similar analyses on pieces of the plays rather than the plays as a whole. In this experiment, I have been working with 1000 word chunks of Shakespeare plays, which allows me to use many more variables in the analysis. (This was the technique that Hope and I used in our 2007 article on Tragicomedy.) Obviously the plays weren’t written to be read, much less analyzed, in identically sized pieces: the procedure is artificial through and through. It does allow us, however, to see things that Shakespeare does consistently throughout different genres, things that happen repeatedly throughout an entire play rather than just the beginning or end. Another caveat: we partitioned the plays starting at the beginning of each text, making the first 1000 words the first “piece.” This results in a loss of some of the playtext at the end, since any remainder that is less than 1000 words is dropped. In future analyses, we will take evenly spaced 1000 word samples from beginning to end, partitioning losses in between. There are no perfect answers here when it comes to dividing the plays into working units. So this is a first installment.

    The video above (press to play) is a three dimensional JMP plot of 767 pieces of Shakespeare in a dataspace of three scaled Principal Components (1, 4, and 9) which I have chosen based on their power to sort the plays using in the Tukey Test. (See Tukey results for PCs 1 and 4.) When you run the video capture, you’ll see a series of dots that are color coded based on generic differences: red is comedy, green history, blue is late plays and orange tragedies. Early in the capture, I move an offscreen slider that creates a series of chromatic “halos” or elipsoid bubbles around neighboring dots: these halos envelop dot groupings as they meet certain contiguity thresholds. You see the two major clusters I am interested in here, histories and comedies, forming in the lower left and upper right respectively. (Green on lower left, red on upper right.) Interestingly enough, the see-saw effect we saw in our analysis of entire plays is repeated here: comedies and histories are the most easily separated, because whenever Shakespeare is using strings associated with comedy, he can’t or won’t simultaneously use strings associated with history (and vice versa). Linguistic weight cannot be placed both sides of this particular generic fulcrum at once.

    Now the resulting encrusted object, which I have rotated in three dimensions, is a lot less elegant than the object we would be contemplating were to do discriminant analysis of these groups. I am saving Discriminant Analysis for a later post. For all its imperfections, Principal Component Analysis is still going to give us some results or linguistic patterns we can make sense of, which is the ultimate measure of success here. I think it’s worth appreciating the spatial partitioning here in all of its messiness: the multicolored object presents both a pattern that we are familiar with — comedies and histories really do flock to opposite ends of the containing dataspace — and some jagged edges that show the imperfections of the analysis. Imperfections are good: we want to find exceptions to generic rules, not just confirmations of a pattern.

    Looking at the upper right hand quadrant, we see the items that are high on both PC1 and PC4. In this analysis we are using Language Action Types or LATs, the finest grained categories that Docuscope uses (it has 101 of them). We will want to ask which specific LATs are pushing items into the different areas here, and to do so, I have produced the following loading biplot:

    A loadings biplot gives information about components in spatial form, showing our different analytic categories (LAT’s such as “Common Authorities,” “DenyDisclaim,” “SelfDisclosure,” etc.) as red arrows or vectors. To read this diagram, consider the two components individually. What makes an item high on PC1? Since PC1 is rated on the horizontal axis, we scan left to right for the vectors or arrows that are at the extremes. To my eye, SelfDisclosure, FirstPer[son] and DirectAddress are the most strongly “loaded” on this component, which means that any piece that has a relatively high score on these variables will be favored by this component and thus pushed to he right had side of a scatterplot (see below). Conversely, any item that is relatively low in the words that fall under categories such as Motions, SenseProperty, Sense Object, and Inclusive will be pushed to the left. Notice that the two variables SelfDisclosure and SenseObject are almost directly opposed: the loadings biplot is telling us here that, statistically at least, the use of this one type of word (or string of words) seems to preclude the use of its opposite. This would be true of all the longer vector arrows in the diagram that extend from opposite sides of the origin.

    We can then do the same thing with the vertical axis, which represents PC4. Here we see that LangRef [Language Reference], DenyDisclaim and Uncertainty strings are used in opposition to those classed under the LAT Common Authority. If an item scores high on PC4 (which most comedies do), it will be high in LangRef, Uncertainty and DenyDisclaim strings while simultaneously lacking Common Authority strings. So what about the vectors that bisect the axes, for example, DenyDisclaim, which appears to load positively on both PC1 and PC2? This LAT is shared by the two components: it does something for both. We can learn a lot by looking at this diagram, since — once we’ve decided that these components track a viable historical or critical distinction among texts — it shows us certain types of language “schooling together” in the process of making this distinction. DirectAddress and FirstPer [or, First Person], Autobio and Acknowledge thus tend to go together here (lower right), as do Motions, SenseProperties, and Sense Objects (upper left).

    In fact, the designer of Docuscope saw these LATs as being related, which is why elsewhere he aggregated them together into larger “buckets” such as Dimensions or Clusters, the latter being the aggregate we used in our analysis of full plays. What we’re seeing here is a kind of “schooling of like LATs in the wild,” where words that are grouped together on theoretical grounds are associating with one another statistically in a group of texts. If the intellectual architecture of Docuscope’s categories is good, this schooling should happen with almost any biplot of components, no matter what types of texts they discriminate. The power of this combination of Principal Components, then, is that it aligns the filiations and exclusions of the underlying language architecture with genres that we recognize, and will hopefully suggest theatrical or narrative strategies that support these recognizable divisions.

    The loadings biplot shows us how the variables in our analysis are pushing items in the corpus into different regions of a dataspace. We can now populate that dataspace with the 767 pieces of Shakespeare’s plays, rating each of them on the two components. Here is how the plays appear in a plot of scaled Component 1 against Component 2, again, color coded with the scheme used above:

    Notice the pattern we’ve seen before: comedies (here represented in red) are opposite histories (green) in diagonal quadrants. In general, they don’t mingle. The upper right hand quadrant, which is where the comedies tend to locate, contains the first item that I’d like to discuss: the red dot labelled Merry Wives (circa 2.1). This dot represents a piece of the first scene, second act of The Merry Wives of Windsor. As the item that rates highest on both PC1 and PC4 — components which the Tukey Test shows us to be best at discriminating comedy — this piece of The Merry Wives of Windsor is the most comic 1000 word passage that Shakespeare wrote. Here is an excerpt:

    “I’ll entertain myself like one that I am not acquainted withal; for, sure, unless he know some strain in me, that I know not myself, he would never have boarded me in this fury.” In this color coded sentence we can see diagrammed the comic dance step. While I think there are funnier lines — “I had rather be a giantess, and lie under Mount Pelion” — the former is significant for what it does linguistically: it shows a speaker entertaining and then rejecting a perspective on her own situation (that of Falstaff) while comparing it with another (her own). The uncertainty strings (orange) such as “know not,” “doubt” and the indefinite “some” contribute to this mock searching rhetoric. Self-disclosure strings such as “myself” and “makes me” anchor the reality testing exercise to the speaker, who must make explicit her own place in the sentence as the object of doubt, while the oppositional reasoning strings such as “never” and “not” mark the mobility of this speakers perspective: I will try this toying perspective on my honesty, seeing myself as Jack Falstaff does, but will reject it soon enough. The reason that this passage is so highly rated on these two factors has something to do with the multiplication of perspectives that are being juggled onstage: there are two individuals here — Mistress Page and Mistress Ford — who are, as it were, rising above an imbedded perspective contained in Falstaff’s letter, commenting upon that perspective, and then rejecting it. Each time a partition in reality (a level) is broached in the stage action and dialogue, comic language appears.

    We can oppose this most comic piece of writing — again, according to PCA — to its opposite in linguistic terms, a piece that contains what the comic one lacks and lacks what the comic one has. Here, then, is a portion of the “most historical” piece of Shakespeare, from Richard II 1.3:

    Here we see the formal settings of royal display, a herald offering Mowbray’s formal challenge — no surprise this exemplifies history, a genre in which the nation and its kings are front and center. Yet where the passage really begins to rack up points is in its use of descriptive words, which are underlined in yellow. Chairs, helmets, blood, earth, gentle sleep, drums, quite confines…we don’t think of history as the genre of objects and adjectives, but linguistically it is. Inclusive strings, in the olive colored green, are perhaps less surprising given our previous analyses. We expect kings to speak about “our council” and what “we have done.” But notice that such language is quite difficult to use in comedy: even in a passage of collusion, where we would expect Mistress page and Mistress Ford to be using first person plural pronouns, the language tends to pivot off of first person singular perspectives. The language of “we” really isn’t a part of comedy.

    I am less surprised to find, at this finer grained level of analysis, words from official life (what Docuscope tags as Commonplace Authority, in bright green) associated with history, since these are context specific. More interesting is the presence of the purple words, which Docuscope tags as person properties. These are high in history, but show up in comedy as well, as you can see on the loading biplot above. This marked up passage is also useful because it shows us something we’d want to disagree with: you don’t have to be Saul Kripke to see that a proper name like Henry is an imperfect designator of persons, particularly because other proper names such as Richard do not get counted under this category by Docuscope. We live with the imperfections, unless it appears that there are so many mentions of the name Henry in the plays that this entire LAT category must be discounted.