Tag: counting_things

  • Visualizing Linguistic Variation with LATtice

    The transformation of literary texts into “data” – frequency counts, probability distributions, vectors – can often seem reductive to scholars trained to read closely, with an eye on the subtleties and slipperiness of language. But digital analysis, in its massive scale and its sheer inhuman capacity of repetitive computation, can register complex patterns and nuances that might be beyond even the most perceptive and industrious human reader. To detect and interpret these patterns, to tease them out from the quagmire of numbers without sacrificing the range and the richness of the data that a text analysis tool might accumulate can be a challenging task. A program like DocuScope can easily take an entire corpus of texts and sort every word and phrase into groups of rhetorical features. It produces a set of numbers for each text in the corpus, representing the relative frequency counts for 101 “language action types” or LATs. Taken together, the numbers form a 101 dimensional vector that represents the rhetorical richness of a text, a literary “genetic signature” as it were.

    Once we have this data, however, how can we use it to compare texts, to explore how they are similar and how they differ? How can we return from the complex yet “opaque” collection of floating point numbers to the linguistic richness of the texts they represent? I wrote a program called LATtice that lets us explore and compare texts across entire corpora but also allows us to “drill down” to the level of individual LATs to ask exactly what rhetorical categories make texts similar or different. To visualize relations between texts or genres, we have to find ways to reduce the dimensionality of the vectors, to represent the entire “gene” of the text within a logical space in relation to other texts. But it is precisely this high dimensionality that accounts for the richness of the data that DocuScope produces, so it is important to be able preserve it and to make comparisons at the level of individual LATs.

    LATtice addresses this problem by producing multiple visualizations in tandem to allow us to explore the same underlying LAT data from as many perspectives and in as much detail as possible. It reads a data file from DocuScope and draws up a grid, or a heatmap representing “similarity” or “difference” between texts. The heatmap, based on the Euclidean distance between vectors, is drawn up based on a color coding scheme where darker shades represent texts that are “closer” or more similar and lighter shades represent texts further apart or less similar according to DocuScope’s LAT counts. If there are N texts in the corpus, LATtice draws up an “N x N” grid where the distance of each text from every other text is represented. Of course, this table is symmetrical around the diagonal and the diagonal itself represents the intersection of each text with itself (a text is perfectly similar to itself, so the bottom right “difference” panel shows no bars for these cases). Moving the mouse around the heatmap allows one to quickly explore the LAT distribution for each text-pair.

    Screenshot of LATtice

    While the main grid can reveal interesting relationships between texts, it hides the underlying factors that account for differences or similarities, the linguistic richness that DocuScope counts and categorizes so meticulously. However, LATtice provides multiple, overlapping, visualizations to help us explore the relationship between any two texts in the corpus at the level of individual LATs. Any text-pair on the grid can be “locked” by clicking on it, allowing the user to move to the LATs to explore them in more detail. The top right panel shows how LATs from both the texts relate to each other. The text on the X axis of the heatmap is represented in red and the one on the Y axis is represented in blue in the histogram for side by side comparison. All the other panels follow this red-blue color coding for the text-pair. The bottom panel displays only the LATs whose counts are most dissimilar. These are the LATs we will want to focus on in most cases as they account most for the “difference” between the texts in DocuScope’s analysis. If a bar in this panel is red it signifies that for this LAT, the text on the X axis (our ‘red’ text) had a higher relative frequency count while a blue bar signals that the Y axis text (our ‘blue’ text) had a higher count for a particular LAT. This panel lets us quickly explore exactly on what aspects texts differ from each other. Finally, LATtice also produces a scatterplot as a very quick way of looking at “similarity” between texts. It plots LAT readings of the two texts against each other and color codes the dots to indicate which text has a higher relative frequency for a particular LAT (grey dots indicate that both LATs have the same value). The “spread” of dots gives a rough indication of difference or similarity between texts: a larger spread indicates dissimilar texts and dots clustering around the diagonal indicate very similar texts.

    You can try LATtice out with two sample data-sets by clicking on the links below. The first is drawn from the plays of Shakespeare which are in this case arranged in rough chronological order. As Hope and Witmore’s work has demonstrated, the possibilities opened up by applying DocuScope to the Shakespeare corpus are rich and hopefully exploring the relationship between individual plays on the grid will produce new insights and new lines on inquiry. The second data-set is experimental – it tries to use DocuScope not to compare multiple texts but to explore a single text – Milton’s Paradise Lost – in detail. It might give us insights about how digital techniques can be applied on smaller scales with well-curated texts to complement literary close-reading. The poem was divided into sections based on the speakers (God, Satan, Angels, Devils, Adam, Eve) and the places being described (Heaven, Hell, Paradise). These chunks were then divided into roughly three hundred line sections. As an example, we might notice straightaway that speakers and place descriptions seem to have very distinct characteristics. Speeches are broadly similar to each other as are place descriptions. This is not unexpected, but what accounts for these similarities and differences? Exploring the LATs helps us approach this question with a finer lens. Paradise, for example, is full of “sense objects” while Godly and angelic speech does not refer to them as often. Does Adam refer to “authority” more when he speaks to Eve? Does Satan’s defiance leave a linguistic trace that distinguishes him from unfallen angels? Hopefully LATtice will help us explore and answer such questions and let us bring DocuScope’s data closer to the nuances of literary reading.


    Finally, a few technical notes: The above links should load LATtice with the appropriate data-sets. Of course, you will need to have Java installed on your machine and to have applets enabled in your browser. You can also download LATtice and the sample data-sets, along with detailed instructions, as stand-alone applications for the following platforms:

    There are a few advantages to doing this. First, the standalone version offers an additional visualization panel which represents the distribution of LATs as box-and-whisker plots and shows where the text-pair’s frequency counts stand relative to the rest of the corpus. Secondly, the standalone application can make use of the entire screen, which can be a great advantage for larger and higher resolution monitors.

  • The Musical Mood of the Country

    This morning the New York Times published a story today about a group of mathematicians who are counting types of words in popular songs in order to get a handle on something like the mood of the country. In trying to data-mine mood, they do what all people who count things do: move from something that you can quantify empirically to something that you can’t. We do this as well when we move from “types of words” or Docuscope strings in Shakespeare plays to “genre.” The strings are empirically countable — they are either there in an established corpus or they aren’t — but one must argue for any connection between what is counted and what such counts represent (genre, mood, etc.). The point I have tried to make on this blog is that the connection is interpretive, and so relies on the hermeneutic skills of the one proposing the link.

    In the abstract for the paper, recently published in the Journal of Happiness Studies, they write that: “Among a number of observations, we find that the happiness of song lyrics trends downward from the 1960s to the mid 1990s while remaining stable within genres, and that the happiness of blogs has steadily increased from 2005 to 2009, exhibiting a striking rise and fall with blogger age and distance from the Earth’s equator.” This is an interesting finding, particularly the part about blogger age and distance from the equator. One of the selling-points of their analysis is that the data they have obtained is voluntarily supplied, and so perhaps less subject to the social pressures that accompany surveying. I would want to know, on this score, whether a song-title (for example) is subject to other types of pressures. For example, the songwriter is not just “reporting” an inner state by naming a song in a particular way — take the Ramones song, “I Wanna Be Sedated” for example — but offering this title to an audience. Song-names are rhetorical, and so subject to a different set of pressures than “reporting.” There is another kind of self-interference here that doesn’t seem to be taken into account.

    One of the lead researchers on the paper, Peter Sheridan Dodds, argues that data supplied voluntarily on the web can serve as a kind of “remote sensor of well-being.” (I remember hearing similar arguments made about baby names a while back; you don’t have to pay for them and they’re important: therefore they are a good measure of national feeling and trends.) For example, teenagers appear to be the least happy because they more frequently use words such as “sick,” “hate” and “stupid.” Wouldn’t it be more interesting to track how the use of these words (or absence of them) compares to groups of populations that teenagers themselves describe as “unhappy?” My inclination here would be to use data-mining techniques to assay and re-describe classifications made by a given social group in terms that they may not necessarily be aware of. Then the factual claim would be: when teenagers describe someone as happy, that person is x% less likely to use words like “sick,” “hate” and “stupid.”

    I can imagine the authors of the Music-Mood study making the following set of claims:

    Claim 1) Research on web-logs, lyrics and other sources of expression show that words like “sick,” “hate” and “stupid” occur more frequently in a representative group of works by teenagers. This would be the empirical claim.

    Claim 2) People who are experiencing a mood such as “well-being” are less likely to mention words like “sick,” “hate,” and “stupid” in unprompted work such as songwriting or blogging. This is an interpretive claim that must be argued for.

    Claim 3) Teenagers are less likely than others to be experiencing a mood of well-being. This is logically true if you accept 1 and 2.

    Now, what’s interesting about 2 — the interpretive claim — is that it could be made without numbers. In a sense, you either believe this or you don’t. Which begs the question, what exactly are the numerical claims doing in this argument? What if claim 2 is “kind of true,” or “true only among certain people”? Would this mean that “kind of a lot” of teenagers are unhappy?

    I would be more comfortable saying that teenagers use more of the following words (“hate,” “stupid”), and that a close look at the contexts in which they use them (which can never be comprehensive) suggests that their use is connected to mood in the following way (e.g., their use allows teenagers to gain social attention by citing negative emotions, their use indicates depression, their use indexes the presence of Goth subculture, etc.). But I would want to know how the words are used rather than simply making inferences from the fact that they occur. The counter-argument here is that the law of large numbers guarantees that even if there is a wide variation of uses of the words (granting, in effect, that not all occurrences are “reports” of mood), there is nevertheless a broad enough pattern to make a generalization. Fair enough, but what numbers are you going to use to make the generalization?

    I’m all for the empirical investigation of abstract concepts like happiness, genre, authorial intent. These higher order concepts don’t come from outer space: we create them to capture some suite of characteristics we find in reality or in ourselves. But the Music-Mood analysis lacks a crucial ingredient: an explicit human judgment about the classes that are being measured by the tokens that are being counted. Unless you make that judgment explicit — saying something like “x% of people who experience what persons y and z would describe as ‘well-being’ also produce unprompted work containing these words — you are really just saying that “a lot” of people who we think are happy do this.

    Naming something with a word is a way of creating a class of things (as long as that word is not a proper name), and it is classes of things that are correlated quantitatively using statistics: quantities of classes of words in classes of works, for example. In any such analysis, the classes themselves cannot be derived empirically. They have to be specified in advance by appealing to experience, common sense, expertise, or the like. What troubles me about the Musical Mood analysis here is that the rationale for membership in the class of words indicating “well-being” is not spelled out, and perhaps never could be. I would rather ask someone — an expert? a teenager? — to name people who experience well-being and then do one of Matt Jockers’ most-frequent-word analyses on their lyrics or blogs in order to get at the underlying pattern. It’s fine to begin with a set of words whose occurrence indicates (to you) a feeling 0f well-being, but without knowing quantitatively how indicative they are, the numbers are just another kind of adjective. You might as well read a bunch or web pages and decide for yourself. 

    My guess is that you would conclude that teenagers write like teenagers rather quickly.