In comparison to the previous post, where we were using the plays’ scores on Principal Components to create clusters, here we are just using the percentage counts of the plays on all of the Language Action Types, the lowest level of aggregation in Docuscope’s taxonomy of words or strings of language. There are 101 Language Action Types or LATs, which is to say, buckets of words or strings of words that David Kaufer has classified as doing a certain kind of linguistic or rhetorical work in a text. I have made a table of examples of these types, taken from the George Eliot novel *Middlemarch*, which can be downloaded here.

I find this diagram more than a little unnerving. It is quite accurate in terms of received genre judgments — notice that almost all of the Folio history plays (in green) are correct — and there are nice clusterings of both tragedies (tan) and comedies (red). *Henry VIII*, which is here identified as a late play (blue), is placed in the cluster full of other late plays (including *Coriolanus*, which could just as easily have been coded blue). And plays with a similar tone — *Titus*, *Lear*, and *Timon* — are all grouped together as tragedies, separate from the other tragedies that are placed together further above. The strange pairing that repeats here from the Principal Component clusterings is *Tempest* plus Ro*meo and Juliet*, something which merits further inquiry.

Why should a mechanical algorithm looking at distances between counts of things produce a diagram this accurate? I’m not really sure. The procedure involves arraying each of the 36 plays in a multidimensional space depending on its percentage score on each of the things being rated here — the LAT categories. So, if “Motion” strings are one category, you can imagine an X axis with the scores of all the plays on “Motion,” with a Y axis rating all the plays on “Direct Address” as below:

Now think about adding another score — First Person — to the third dimension, which will give us a spatial distribution of the plays and their scores on each of these three LATs:

Now, there are distances between all of the points here and various methods (single linkage, complete linkage, Ward’s) for expressing the degree to which items arranged in such a space can be grouped together in a hierarchy of filiation or likeness. If you multiply out all of the things being scored in this analysis — that is, all 101 Language Action Types — you end up with a multidimensional space that is unvisualizable. But there are still distances among items in this multidimensional space, distances that can be placed into the algorithms for producing the hierarchy of likeness. That is what is going on — using Ward’s procedure with non-standardized data — in the visualization at the beginning of this post.

As I’ve said before, a picture is nice, but just because you can reproduce a human classification with an algorithm doesn’t mean you’ve made any progress. You have to be able to show what’s going on in a text — which words are doing what things some or most of the time — before you can call your work an analysis. Perhaps that’s another reason why I find a diagram like this unnerving: I cannot work back from it to a passage in a text.

By standardizing the data, we get the following re-arrangement. I am unsure how to categorize the benefits of data standardization in this case, but think this is a comparatively less compelling diagram:

## 4 Comments

Matthew Wilkins asks a good question over at Work Product:

‘In his earlier work using principal components, he found that Othello clustered with the comedies. Using the new method reported today (based on “language action types”), that’s not the case. Or is it? When Witmore “standardizes” the texts, Othello returns to the comedies (it’s closest to Twelfth Night and Measure for Measure). So my question is: What is “standardization,” and why should it have so great a negative effect on clustering accuracy? (Othello isn’t the only play that changes places under standardization; as Witmore observes, the standardized results are much less eerily perfect than the nonstandardized ones.)’

My answer: When the counts for the texts were standardized in principal component analysis (the previous post), variables with comparatively high means and large standard deviations played a more prominent role in the first component (which tries to account for as much variation as possible). For example, in this data set, Description strings — “hand,” “hear,” “sweet,” and “blood” being the most frequent tokens in this category — comprise the largest percentage of words classed by Docuscope, with mean percentage score (of a given text) of 7.46 and a standard deviation of 1.00. Emotion tokens, on the other hand — “love,” “death,” “heaven,” and “dead” — have a mean score of 4.4 and a standard deviation of 0.53. When the scores are standardized, there is a smoothing effect of sorts. Whereas in the unstandardized analysis, PC1 loaded Description strings much higher than it did Emotion strings, standardization “shortened” the Description vector and lengthened the Emotion vector, bringing them closer together in terms of direction. So what does this mean?

When standardization allows us to compare correlations of variables corrected for scale (big ranges with little ones), we see that certain variables like Emotion — whose presence in gross percentages is not as great as that of others — becomes more important. They claim a larger place in the landscape. And this balancing or scaling of different ranges of variation allows us to see more subtle relationships between variables that are connected, but in a “big with little, big against little” way.

Now with the Ward’s clustering above, I think what we are seeing is that when we let the high mean, high standard deviation variables dominate the analysis, we get groupings that reflect the genre judgments of critics (in this case, Shakespeare’s editors and those who believe in a class of plays called “late plays”). But when we standardize, more subtle forms of proximity appear, proximity that takes into account the relative presence or absence of low frequency tokens (for example, Reasoning or Narrating strings). When you make these low frequency items more prominent in the analysis, the dots in our multidimensional space will be pushed further apart in ways that might have been negligible when these low frequency items were considered as raw, unscaled frequencies.

Now here’s the catch. Both scaled and unscaled PCA places Othello with the comedies. The visualization I posted from August 20th happened to be unscaled, but essentially the results are the same on this score. Similarly, a hierarchical clustering of the plays using Ward’s on unscaled principal components (PCA done on the covariance matrix) again places Othello and Twelfth Night together. Almost the same thing happens when PCA is performed on the correlation matrix (i.e., scaled data): Othello attaches to Twelfth Night and Much Ado About Nothing, which were paired in a previous pass. So when we’re just rating the items in terms of proximity on scores using Ward’s, standardization approximates the results of PCA (standardized or not); only in the case where we don’t standardize do we get Othello popping out of the comedies and grouping with the tragedies, where we would expect it to.

The answer, I think, is that Othello looks like a tragedy from far away but like a comedy when you look for more subtle correlations, which can be found either by boosting the low frequency variables like Reasoning or Narration through standardization or through PCA, which effectively emphasizes these low frequency items when it looks for its second, orthogonal component.

Another way of saying this: if you give the greatest weight to things you would see clearly from “far away” — as if you were looking at someone walking toward you in the distance and trying to figure out who it was — the similarities in Description and First person usage alone, because visible from a distance, would suggest to you that you are approaching a tragedy. But if you got a little closer and could see other things — Emotion strings, Narration and Reasoning strings — you would assume fairly quickly that you were dealing with a comedy. What’s remarkable is how quickly Othello jumps out as a comedy once these “close up” items come into focus, as evidenced by it’s almost immediate pairing with Twelfth Night in most of the analyses using Ward’s.

In the absence of a known clustering, there’s really no way to find the “right” clustering measure.

Different clustering algorithms make different assumptions about the generative model involved. For instance, hierarchical clustering with complete link assumes a Gaussian centroid generating elements of a cluster close to that centroid (sometimes taking variance into account), whereas hierarchical clustering with single link assumes a random walk process where new elements are generated as being near any previous element (rather than near the centroid). Ward’s average link approach is somewhere in between. Which one is “right” depends on what you want and how you think the data’s generated.

Typically, if you scale features based on something like z scores (converting each dimension to have zero mean and unit variance), then you get a different result than in the unscaled cases for exactly the reasons you mention.

If you convert the items being clustered to unit length, it removes scale considerations. You need to do this to get sensible results for inputs of different lengths.

If you really do take variance into account (e.g. by using something like KL distance over a posterior estimate such as a Dirichlet for multivariate count data), then you downweight poorly estimated features (typically those with low counts).

You might also want to check out some probabilistic multi-dimensional clusterers like latent Dirichlet allocation (LDA). They tend to be a little easier to interpret than PCA or SVD-based approaches.

Thanks MICHAEL WITMORE excellent very knowledgeable article. I honestly enjoyedmost of the posts and your different point of view.I m so interesting for this. Thanks! Principal component analysis (PCA) is a widely used statistical technique for unsupervised dimension reduction. K-means clustering is a commonly used data clustering for performing unsupervised learning tasks. Here we prove that principal components are the continuous solutions to the discrete cluster membership indicators for K-means clustering.

Yeah! the result provides new insights to the observed effectiveness of PCA-based data reductions, beyond the conventional noise-reduction explanation that PCA, via singular value decomposition, provides the best low-dimensional linear approximation of the data. On learning, the result suggests effective techniques for K-means data clustering. DNA gene expression and Internet newsgroups are analyzed to illustrate our results. Experiments indicate that the new bounds are within 0.5-1.5% of the optimal values. Thanks for the information.

## 4 Trackbacks

[…] Jump to Comments Michael Whitmore has a new post up at Wine Dark Sea on further clustering results using Docuscope on Shakespeare’s plays. I don’t have much […]

[…] using Ward’s method on unscaled data. The technique is the same as the one that produced the most effective genre clustering of Shakespeare’s plays. I am thus using what I know of a particular mathematical technique as […]

[…] Witmore’s similar clustering studies using Docuscope. See also this draft version of Witmore and Hope’s forthcoming piece in […]

Winedarksea…[…] something about winedarksea[…]…