Month: December 2009

  • Clustering the Plays Without Principal Components

    Folio plays clustered using all Language Action Types, Non-Standardized Data
    Folio plays clustered using all Language Action Types, Non-Standardized Data

    In comparison to the previous post, where we were using the plays’ scores on Principal Components to create clusters, here we are just using the percentage counts of the plays on all of the Language Action Types, the lowest level of aggregation in Docuscope’s taxonomy of words or strings of language. There are 101 Language Action Types or LATs, which is to say, buckets of words or strings of words that David Kaufer has classified as doing a certain kind of linguistic or rhetorical work in a text. I have made a table of examples of these types, taken from the George Eliot novel Middlemarch, which can be downloaded here.

    I find this diagram more than a little unnerving. It is quite accurate in terms of received genre judgments — notice that almost all of the Folio history plays (in green) are correct — and there are nice clusterings of both tragedies (tan) and comedies (red). Henry VIII, which is here identified as a late play (blue), is placed in the cluster full of other late plays (including Coriolanus, which could just as easily have been coded blue). And plays with a similar tone — Titus, Lear, and Timon — are all grouped together as tragedies, separate from the other tragedies that are placed together further above. The strange pairing that repeats here from the Principal Component clusterings is Tempest plus Romeo and Juliet, something which merits further inquiry.

    Why should a mechanical algorithm looking at distances between counts of things produce a diagram this accurate? I’m not really sure. The procedure involves arraying each of the 36 plays in a multidimensional space depending on its percentage score on each of the things being rated here — the LAT categories. So, if “Motion” strings are one category, you can imagine an X axis with the scores of all the plays on “Motion,” with a Y axis rating all the plays on “Direct Address” as below:

    Direct Address and Motion Scores in two Dimensions
    Direct Address and Motion Scores in two Dimensions

    Now think about adding another score — First Person — to the third dimension, which will give us a spatial distribution of the plays and their scores on each of these three LATs:

    Direct Address, First Person and Motion Scores of Folio Plays
    Direct Address, First Person and Motion Scores of Folio Plays

    Now, there are distances between all of the points here and various methods (single linkage, complete linkage, Ward’s) for expressing the degree to which items arranged in such a space can be grouped together in a hierarchy of filiation or likeness. If you multiply out all of the things being scored in this analysis — that is, all 101 Language Action Types — you end up with a multidimensional space that is unvisualizable. But there are still distances among items in this multidimensional space, distances that can be placed into the algorithms for producing the hierarchy of likeness. That is what is going on — using Ward’s procedure with non-standardized data — in the visualization at the beginning of this post.

    As I’ve said before, a picture is nice, but just because you can reproduce a human classification with an algorithm doesn’t mean you’ve made any progress. You have to be able to show what’s going on in a text — which words are doing what things some or most of the time — before you can call your work an analysis. Perhaps that’s another reason why I find a diagram like this unnerving: I cannot work back from it to a passage in a text.

    By standardizing the data, we get the following re-arrangement. I am unsure how to categorize the benefits of data standardization in this case, but think this is a comparatively less compelling diagram:

    Clustering of Folio Plays using Standardized Data
    Clustering of Folio Plays using Standardized Data