Tag: pca

A Map of Early English Print
Michael Witmore & Jonathan Hope

[caption: PCA biplot of 61,315 texts from the TCP corpus, rated on features counted by Docuscope version 3.21 in an implementation created by the Mellon funded “Visualizing Early Print” project at the University of Wisconsin, Madison. Axis and quadrant labels shown here, along with the experiment that led to the color highlighting, are explained below. The full dataset for the analyses presented in this blogpost can be found here.]

Over 61,000 texts were transcribed by the TCP project, everything from hunting manuals to weapon inventories to lyric poems and plays. Important work is being done on this corpus, and it is clear that we are nowhere near exhausting the possible analyses that can be conducted on a dataset of this size (well over 1Bn words). One of the greatest challenges to working with the corpus is that the metadata for its contents — information about the texts that have been transcribed — is inconsistent or absent. If we want to characterize types of writing with the help of statistics, we must first label the items to be compared. To distinguish scientific writing from what was written for the stage, say, we must first ask someone to apply these labels to the relevant items. And labeling involves interpretation. A major challenge we face as researchers peering into this collection, then, is that of identifying what is being compared in the absence of human curated groups.

There is another way into the problem, which is so focus solely on the correlations or dependencies among variables — the fact that certain measured features track with and away from one another. We can do this in an unsupervised way, without reference to any human generated metadata. Here we find techniques such as principal component analysis (PCA), which we have used in our own studies. Other unsupervised techniques that do not rely on human generated ground truths would be word embeddings (Word2Vec, Glove) and cluster analysis (K-means, etc). In our past work with Shakespeare, we used unsupervised techniques to see whether this “hands off” exploration of patterns in the data lined up with groupings that humans have already made. We conducted that research on dozens of items rather than a corpus of tens of thousands. That research suggested to us that dependencies among text features track reasonably well with the domain judgments of literary experts. We have always wanted do this type of analysis on a larger scale, but have to this point been stymied by lack of human created metadata for genre.

One solution to this lack of metadata would be to take a smaller sample of labeled data grouped by genre, date, or some other criterion, and train an algorithm to identify other, unlabeled members of the group. This technique, called semi-supervised machine learning, leverages human judgment and speeds search in spaces where full metadata is unavailable. Most people take advantage of this partial leveraging of human insights when they search the internet. This method is not appealing to us, however, because it introduces a circularity in the work. Yes, it would be useful to train a classifier that finds poetry in a corpus where no one thought to look. But our goal is not to improve search. Rather, we want to understand, quantitatively and rhetorically, what it is at the level of textual features that would lead a human being to apply a label, as literary critics inevitably do. It’s the behavior that is interesting.

This extended blog post asks how our initial approach of comparing unsupervised analysis with independent human judgements would play out if we were to make such comparisons on a larger scale. Such comparisons present a challenge when every item in the TCP corpus has not yet been assigned genre labels by a human whose judgment we trust. A way forward presented itself, however, when the Mellon “Visualizing English Print” project at the University of Wisconsin funded the curation of a sub-corpus of 1080 texts, each of which was assigned a genre label by a team with domain expertise. These texts were drawn randomly from both the EEBO-TCP and ECCO-TCP, 40 per decade, beginning in 1530 and ending in 1799. Taking these 1080 texts as our starting point, we conducted an unsupervised analysis of the the subcorpus using PCA on a group of pre-selected features we had used before — Docuscope’s Language Action Types (LATs).^{^[1]} Those principal components were highly interpretable, leading us to propose two statistically-derived, feature-based oppositions that characterize the entire subcorpus. Each principal component expresses the tendency for texts have some features while lacking others. A map of these tendencies appears at the top of this post. What follows in the next section is an analysis that supports the interpretation of this first, map-like diagram.

Since these oppositions or axes are mathematically orthogonal, we take advantage of the further opportunity to explore a “four corner” distribution of text types across the two initial rhetorical oppositions. Here too we find that the corner combinations of paired traits (two from each axis) are also interpretable. In confirming those intuitions, we want to suggest again that — on a larger scale than before — unsupervised analysis of defined textual features aligns well with human judgments that never focused on those features. The fact that two independent ways of grouping texts converge is interesting in and of itself, and offers new opportunities to think about generic variation in early modern print texts.

But we make a second finding that is more interesting. The patterns arrived at through an unsupervised analysis of 1080 texts are just as discernible, just as dominant, in the full corpus of 61,000 texts. The original four-corner distribution of the labelled 1080 texts is preserved when PCA is performed on the full corpus. This continuity suggests that the map we created of the smaller group — one that shows four basic types of writing distributed around two basic oppositions — can be used to characterize the rhetorical dynamics present in the full TCP.

Each of these aspects of the exploration are taken up below in sections treating (1) unsupervised PCA of a subcorpus and our characterization of the PCs using examples; (2) the superimposition of labeled genres onto these derived components; (3) a study of combinations of the PCs that can be used to create a four quadrant rhetorical map of the corpus; and (4) an application of this map to the full TCP. Our goal in publishing the blog post is to demonstrate how one might use unsupervised techniques to redescribe dynamics in a corpus that lacks full metadata, and to do so “from the bottom up,” using examples. The resulting document is long, but we feel it captures every link in the chain of reasoning, including the examples that came to inform our interpretive choices.

Principal Components 1 and 2: Abstract/Experiential, Intersubjective/Extrasubjective

Principal component analysis or PCA is a well understood statistical technique for describing the dominant directions of variance in a dataset. The technique is often viewed as rudimentary in comparison to other techniques of dimension reduction. This is because PCA assumes linear relationships among the features counted in texts: it draws straight lines in a dataspace, whereas more custom “classifiers” can curve around exceptions. But PCA’s requirement that components be independent (mathematically orthogonal) means we can understand the relationships among components in a way that is geometrically intuitive. In this section we try to understand the components derived from the 1080 corpus and then map them onto a space that we can interpret.

PCA works by drawing a new axis or basis in a space containing as many dimensions as there are features counted. If we are measuring 115 different features, each of which is a Docuscope’s LAT, the new axis (PC1) is drawn in a 115 dimensional space, oriented in such a way as to maximize the “spread” of items across the component. A principal component, then, is a mathematical artifact that functions as a kind of binary recipe. It says, “Here are certain features that texts have in abundance (in order of prevalence) while simultaneously lacking others (in order of relative absence), described in purely mathematical terms.”

The first principal component (PC1) produced in the analysis describes a pattern that we call Abstract/Experiential. Texts scoring high on PC1 tend to relate matters according to the necessity of argument, logical entailment, or moral obligation. We call texts scoring high on PC1 Abstract, and those that score low Experiential. PC1 represents the difference between what is and what must be; it covers what gets related via the mediation of concepts and arguments (Abstract) versus a world of sensory experience (Experiential). Abstract texts work to guide readers through relationships of logical entailment or social obligation with some argumentative apparatus. Experiential texts, on the other hand, tend to relate person-to-object or person-to-event relationships in the first person, but not person-to-person relationships.

The second principal component (PC2) spans a continuum we are calling Intersubjective/Extrasubjective. Dynamics along this span are uncorrelated with the action we see on the first (described by PC1). Intersubjective texts depict the inner lives of people who are coming into contact with one another, whereas Extrasubjective texts convey impersonal relationships among abstractions and/or physical objects — relationships that are not disclosed via the inner life of a specific person. Intersubjective texts tell us why people are doing what they are doing, and disclose information as it relates to evolving intentions and circumstances. Extrasubjective texts, by contrast, present a world whose existence sits at arm’s length from the inner life of any particular onlooker or speaker. Extrasubjective texts assume the givenness of concepts/objects that are then placed in some kind of logical/spatial relationship with one another.

Both components are represented in the biplot below, which spreads items out according to their scores on each. So that readers can have a sense of the the placement of the examples we are about to discuss, we highlight the position of the example texts discussed below in this section:

[caption: Scatterplot of 1080 corpus items as rated on Principal Components 1 and 2. Examples (treated below) of items scoring high and low on PC1 (x-axis) and PC2 (y-axis) are highlighted. Loadings of LATs associated with high and low scores on PC1 and PC2 are listed below the named poles of the component axes. PC1 and PC2 explain 7.1% and 5.2% of the variance in the corpus.]

Readers will notice a series of names under each of the directions in the biplot (“Self Disclosure .22” under the Intersubjective pole of PC2, for example). These names identify LATs that contribute positively or negatively to an item’s score on a component. Position on the x-axis, for example, is a function of an item’s having and/or lacking the LATs listed at either end of the axis. The LATs contributing to a text’s high score on the Abstract pole of PC1 — the LATs that move it to the right of the plot — are Specifiers, GenericEvents, ReasonBackward, and CommonAuthorities. Conversely, LATs lowering a text’s score on PC1, moving it to the Experiential pole at left, are Motions, FirstPerson, SenseObject, and SenseProperty. So too, Intersubjective texts at the top of the plot are characterized by high scores on SelfDisclosure, SubjectivePercept, Autobiography, and FirstPerson, whereas Extrasubjective texts have high scores on AbstractConcepts, Numbers, SenseObjects, and CommonAuthorities.

The names we have given to the component axes here are interpretive. We arrived at these interpretations by using a tool created by the VEP project — the SlimTV viewer — which offers color coded versions of the texts that correspond to the different features counted in the analysis. That tool allowed us to inspect the relevant (strongly loaded) LATs in context using example texts at the far ends of both axes. We move now to explore twelve of those examples. The views we present use color coding to call out the LATs (words and sequences of words) that are pushing items to the far end of the axes, making them good examples. The SlimTV viewer allows interactions through a browser, so readers can independently consult full, LAT-highlighted HTML versions of the twelve examples discussed below by following the links. We present an abundance of examples (and links to full tagged text) so that readers can understand how we arrived at these distinctions — Abstract/Experiential, Intersubjective/Extrasubjective. Readers are invited to skip ahead or explore further as necessary.

Abstract example 1: George Berkeley, Passive Obedience (1712), tagged text, plain text.

[caption: LATs with high positive loadings on PC1 include (blue) Specifiers, (red) GenericEvents, (green) ReasonBackward, and (purple) CommonAuthorities.]

Berkeley’s treatise on the rational grounds for submitting to civil power guides its reader through a series of intellectual possibilities, using abstract nouns to shorthand the particular political situations (“cases,” “Occasions”) that he wants to subsume under general headings. This is the flow of logical argumentation, where the process of reasoning itself is managed through the use of (blue) Specifiers that show where the narrator is in the argument (“concerning a”, “in which every”, “concerning a”). Red GenericEvents words detach a moral or political action from any specific actor (“to be” done, “Actions”) so that it can be related to broader obligations (“Doctrine,” “necessity, “the Common Weal”), which Docuscope tags as (purple) CommonAuthority. A verb, “premised,” tagged as (green) ReasonBackward, connects a prior argument to a more recent one. An exemplary Abstract text, Berkeley’s treatise uses these characteristic words and phrases to coordinate a flow of ideas within and for a well-ordered mind.

Abstract example 2: William Prynne [attributed], The Long Parliament twice defunct (1660), tagged text, plain text.

[caption: LATs with high positive loadings on PC1 include (blue) Specifiers, (red) GenericEvents, (green) ReasonBackward, and (purple) CommonAuthorities.]

In Prynne’s treatise on the dissolution of the Long Parliament, we see the author placing a set of (purple) CommonAuthority nouns — “Government,” “Fundamental laws” — in logical relation to one another via the (green) ReasonBackward words (“because”, “as it is”). Prynne provides a window onto the process of ideas unfolding in a well-regulated mind, where the flow is motivated by relations of logical entailment rather than a contingent sequence of historical events. The scope of reference is restricted or directed with specifiers (“the whole”, “all the rest”, “part of the”), words that are necessary here because all reasons do not apply to all things. The language in this passage works to keep concepts from descending into particulars. That general elevation is accomplished through the use of (purple) CommonAuthority nouns on the one hand and — on the other — analogies that refer to no specific historical situation. He refers to housebuilding in general, for example, not the construction of a specific gatehouse in Blackfriars. Prynne’s text manages the attention of the reader by making sure particulars of any one person’s experience do not come to qualify his abstract claims.

Abstract example 3: Edward Fowler, Certain Propositions, By which the Doctrine of the H[oly] Trinity Is So Explained (1694), tagged text, plain text.

[caption: LATs with high positive loadings on PC1 include (blue) Specifiers, (red) GenericEvents, (green) ReasonBackward, and (purple) CommonAuthorities.]

Edward Fowler is a Gloucestershire Bishop who, in Certain Propositions, sets out to defend claims about the Trinity from criticism. In doing so, he coordinates the objections of his opponent with his own refutations, adducing distinctions from another text to warrant his response. Not surprisingly given the context, (purple) CommonAuthority words — “God” and “Scripture” — are prominent, as are (blue) Specifiers that identify precisely which parts of an earlier argument he is addressing (“in reference to,” “with the,” “each of which”, “the whole”). The most frequently tagged phrase under (red) GenericEvents is “to be,” a passive construction that allows the writer to introduce a judgment (“justly to be here Charged”) without himself taking ownership of that judgment.

Experiential example 1: Jerningham, Yarico to Inkle (1766), tagged text, plain text.

[caption: LATs with strong negative loadings on PC1 include (blue) Motions, (red) FirstPerson, (green) SenseObject, and (purple) SenseProperty.]

Yarico to Inkle is a “heroic epistle” by a late 18th century English nobleman; it is delivered in the voice of Yarico, a female African slave who rescues a European mariner, only to have him turn on her and sell her (and their child) into slavery. This first textual example of the Experiential pole lacks the logical connectives and abstraction of its opposites above; what holds together the flow of items as expressed in the text is a consciousness that characterizes entities based on sensory properties. At one moment, we are focused on the ear and what it hears — “accents”, “waves”, the “limpid Stream.” These words co-occur with (blue) Motion words that generate the sensory layer being reported. Here is the stream that “glides”, the sail that flies, the “tempest-beaten” side, the “flow” of language from lips. The verbs highlighted in the example indicate a change in state of a physical thing, often a body, as opposed to some more generic reference to an event (as we saw in the examples on the other end of PC1). It should be no surprise that the (blue) motion verbs accompany other items tagged as (green) SenseObjects — “stream”, “bower”, “sea”, “lips”. Finally, the (red) FirstPerson tokens anchor this flow of sensation to the narrator who is the collecting center of that sensory experience.

Experiential example 2: Kane O’Hara, The Golden Pippin (1773), tagged text, plain text.

[caption: LATs with strong negative loadings on PC1 include (blue) Motions, (red) FirstPerson, (green) SenseObject, and (purple) SenseProperty.]

The Golden Pippin is an eighteenth century burletta written by the Irish composer Kane O’Hara. In Pippin, characters express outsized feelings about events in the narrative through individual arias, all reliant on the use of (red) FirstPerson LATs (here, “I” and “my”). As one might expect in a story about an apple (“pippin”), there are plenty of opportunities to name (green) SenseObjects and their (purple) SenseProperties in vivid language (“windows,” “wifes,” “puppets,” “Sky,” “Sultanas,” “Pippin,” “dripping,” “running,” “tripping” tagged with SenseProperty and SenseObject). But the language also conjures things and actions that are not being enacted onstage. The singer here, a fantastical character named Momus, describes how he will torment others — “On wires I dance ‘em all.” A text in which actions (past, imaginary, or future) are recited rather than enacted will require this form of sensorily rich language characteristic of the Experiential pole of PC1. The aria belongs at the Experiential pole of the distinction because, formally, it is a first person super narrative of sorts. In it, the singer sets out the chess pieces and then puts into motion to illustrate what has happened, might happen, or will happen.

Experiential example 3: H. Bate Dudley, Airs, Ballads…The Blackamoor Washed White (1776), tagged text, plain text.

[caption: LATs with strong negative loadings on PC1 include (blue) Motions, (red) FirstPerson, (green) SenseObject, and (purple) SenseProperty.]

The several airs in this text, once sung as part of a comic opera that included Sarah Siddons, express the vicissitudes of lovers as they progress toward nuptuals. The pastoral scene rendered by words such as “lyre,” “tree,” “skies,” and “beechen” (SenseObject, SenseProperty) make sense as part of the recital of plot events; but there is also sensory language being used to characterize the singer’s situation metaphorically — a schoolboy stealing sweet honey from a bee. The sensory language and (blue) Motions, then, are used not simply to communicate plot events that may not have been enacted on stage, but also to paint internal feelings by analogizing them to a sensory scene. The resources of (red) FirstPerson, which supports the sensitized reporting of actions, extend beyond a simple need to report events that advance a narrative; that language can also be used to strike an attitude toward plot developments by recasting them as another sensory scene.

Intersubjective example 1: Thomas Holcroft, Anna St. Ives (1792). Tagged text, plain text.

[caption: LATs with high positive loadings on PC2 include (blue) SelfDisclosure, (red) SubjectivePercept, (green) Autobiography, and (purple) FirstPerson.]

Anna St. Ives is the first British Jacobin novel; it was written by Thomas Holcroft. The narrator is describing an encounter with a Mr. Henley who is reluctant to begin a conversation about the narrator’s relationship with his daughter. Each of the two characters discloses something of his inner life; the entire exchange is related from the perspective of one person who both recounts and interprets it in (purple) FirstPerson. For example, the narrator wants the reader to know that he’s anxious for Mr. Henley to say what’s on his mind (“My own wish that he should be explicit was eager”), which then motivates the exchange from the standpoint of the narrator (“my own wish”). The meeting of minds in recounted dialogue is mediated through (red) SubjectivePercept words such as “dissuade,” “hesitated,” and “minds,” along with (blue) SelfDisclosure words (“my own”, “I desired”, “I think”.) In comparison with passages that exemplify the opposite pole of this pattern, these words mark the fact that the actions and pressures in the scene are internal to someone, not something. Autobiography features (green) register a character’s awareness of a life history (“I had”, “when I”), implying a subject whose past states and relationships (“my daughter”) are recallable in the present. (Autobiographical features imply a social, developmental subject.) As a representative of the Intersubjective pole, then, this passage from Holcroft depicts the inner life of one “me” making contact with another.

Intersubjective example 2: Samuel Richardson, Clarissa (1748). Tagged text, plain text.

[caption: LATs with high positive loadings on PC2 include (blue) SelfDisclosure, (red) SubjectivePercept, (green) Autobiography, and (purple) FirstPerson.]

In this passage from Richardson’s Clarissa, the eponymous narrator is telling Miss How about the hostile reception she received from her own family when she was called home from an earlier visit to Miss How’s house. In the recounted scene, Clarissa’s family forces her to justify the fact that she has been spending time in the presence of a Mr. Lovelace, her brother’s sworn enemy, while at the How residence. Clarissa must first narrate what happened in Miss How’s company and then state what her real intentions were, a move that forces her to alternate between (green) Autobiography (“I was”) and (blue) SelfDisclosure (“I would”). The alternation between “I was” and “I would” occurs frequently in the novel (twice in this passage); it demonstrates the narrator’s tendency to pivot between recollection of past events and vocalized reaction to those events. Words that imply judgments about, or interpretations of, a social situation — (red) SubjectivePercept (“like to have”, “even”, “voluntary”, “tacit”) — give insight into relationships among actors in a social situation. The abundance of accusative case “me” tagged as (purple) “FirstPerson”, moreover, mark the fact that the things being described are placed in relation to the narrator — to a narrating “me.” In using these features together, Richardson provides a rich view onto a consciousness that is both socially aware and able to render that awareness through recounted action.

Intersubjective example 3: Fanny Burney, Cecilia, vol. 3 (1782). Tagged text, plain text.

[caption: LATs with high positive loadings on PC2 include (blue) SelfDisclosure, (red) SubjectivePercept, (green) Autobiography, and (purple) FirstPerson.]

In this passage from Fanny Burney’s Cecilia, Cecilia and Mrs. Belfield discuss Cecelia’s interaction with Mrs. Belfield’s daughter, Henny. This passage provides another example of a “meeting of minds,” one in which a speaker (Mrs. Belfield) speculates on the intentions that informed “the little accident that happened when I saw you before,” when an “odd thing” (red SubjectivePercept) happened. The (green) Autobiographical features are necessary, even in small amounts, because Mrs. Belfield needs to relate a past event (“when I saw”) to the thinking she now unfolds, a sequence that discloses her own perspective on the story (“I mean to say”). Inevitably, that process is colored by subjective judgments (in red, “got the upper hand”, “just as well”) which are themselves anchored in the (purple) FirstPerson pronouns used by the quoted speaker.

Extrasubjective example 1: Edward Donovan, The Natural History of British Insects (1801). Tagged text, plain text.

[caption: LATs with strong negative loadings on PC2 include (blue) AbstractConcepts, (red) Numbers, (green) SenseObject (green), and (purple) CommonAuthorities.]

Edward Donavan’s Natural History is the first example of the Extrasubjective pole of PC2, and illustrates the tendency to relate objects of experience without explicitly locating them in a particular consciousness; those objects are available to any person whatever, who in the case of these texts is the reader. As one might expect in a natural history text, there are plenty of declarative sentences. Here (red) Numbers are used to coordinate bibliographic sources (“Vol. 111”) and to coordinate reference points (“two others on oaks”), which has the effect of locating authority outside the narrator. SenseObject items (green) are picking out concrete things in the natural world (“Habitat,” “willows,” “moth,” “abdomen”). AbstractConcepts (blue) are abstract nouns (“species,” “English,” “country,” “descriptions”) as well as symbols (p.). The (purple) CommonAuthority phrase — “the general” — situates the quoted description outside the realm of private opinion, just as the “We” pulls the frame of authority outside of the embedded narrator. This passage is very much of a piece with the emerging rhetoric of the scientific report, hewing to Bishop Sprat’s rhetorical ideal (in his praise of the Royal Society) of trying to keep words tied to things.

Extrasubjective example 2: Decree, Charles by the Grace of God (1633). Tagged text; plain text.

[caption: LATs with strong negative loadings on PC2 include (blue) AbstractConcepts, (red) Numbers, (green) SenseObject (green), and (purple) CommonAuthorities.]

This Extrasubjective passage differs from the first in that the decree relates official actions and information relevant to those actions; it exceeds the consciousness of any one narrator or individual because the affairs of state are by definition larger than any one person. The passage references only a few (green) SenseObjects (“penny”), and when it does, the writer uses it as a means of coordinating amounts of time and resources (years, pennies, Feast) with the authorities who control those resources (purple CommonAuthority words such as “Parliament,” “Sherrifs,” “regalities,” “Kingdom”). As is the case with many Royal decrees concerned with resources, amounts need to be spelled out so that they can be understood and honored, which is why there are so many (red) items tagged with Number.

Extrasubjective example 3: Robert Norman, A discourse on the variation of the cumpas (1581). Tagged text, plain text.

[caption: LATs with strong negative loadings on PC2 include (blue) AbstractConcepts, (red) Numbers, (green) SenseObject (green), and (purple) CommonAuthorities.]

Robert Norman was a sixteenth century navigator who discovered magnetic inclination. In his 1581 Discourse on the compass, he reports observations that show variations along the horizontal plane of the magnetic needle. The highlighted passage shows, once again, the quick succession of (red) Numbers and (blue) AbstractConcepts — “variation,” “account,” and “observation” — words that refer to abstract relationships (differences in angles) or intellectual actions (observing, accounting). References to specific measurements that are the subject of argument are then anchored to a (green) SenseObject — the “Sun” — and an AbstractConcept (“Horizon”), both of which are the phenomenal supports for the abstract reasoning on display here. Norman’s Discourse illustrates the Extrasubjective pole of PC2 because the concepts it appeals to are the objects of geometric demonstration and so by definition not unique to the consciousness of an individual at a given moment. This is not to say that a living person did not observe the things related by Norman; presumably they were. The passage is Extrasubjective, however, because Norman’s manner of relating his observations shows them to be products of a mind whose interests arise from the interaction of physical objects and concepts, not the social relations of unique moral beings.

Distribution of Genres Across PC1 and PC2

We have focused in the foregoing analysis on the specific language (LATs) whose distribution suggests two places where the energy in the system lives, energy being a metaphor for significant correlations or dependencies among multiple features. PC1 and PC2 explain 7.1% and 5.2% of the variance in the corpus (respectively), providing a feature-based map of two powerful dynamics or oppositions that separate texts in the 1080 subcorpus. No act of human labeling contributed to the positioning of items in this space. In this section, we pause to ask how a set of independent interpretive judgments about literary genre might map onto the statistically derived PCA space discussed in the last section. Do genres fall out along lines that correspond to the axes as we have presented them?

The VEP project made this type of comparison possible, furnishing subcorpora that were labelled by domain experts according to recognized genres and subgenres. In the case of the 1080 Corpus, we are able to use a set of hand-curated labels for a chronologically balanced sub-set of the corpus.^{^[2]} All 1080 items are classified into 31 genre categories ranging from things such as “lists” to “drama” to “autobiography” to “narrative verse.” When we look at the mean score of those 31 genres on each of the two components, we find that 21 of the 31 groups of items score significantly higher or lower than the grand (group) mean on PCs 1 and 2. Two of the genres — “drama” and “lists” — had significantly higher or lower scores on two components at once, participating in two patterns simultaneously.^{^[3]} Below is a chart showing genres that fall significantly above or below the group mean on the two components:

And here, for reference, is a chart showing the LATs that correspond to each of the ends of the principal components:

We have several observations to offer on the these groupings of genres around the two axes. First, the groupings of human labeled groups along these axes make a certain amount of sense. It is not hard to accept that “legal decrees” are Abstract — that they use CommonAuthority words in sequences that specify logical, legal, or moral obligations. Nor is it surprising that texts characterized as “narrative verse” are full of FirsPerson words. It is also unsurprising that “fictional prose” and “drama” texts would be Intersubjective, while “science” and “medical” texts are Extrasubjective. (Links to sample texts, labeled by genre and classed according to the four poles, can be found in an Appendix below.) There is, then, some intuitive overlap between the statistical oppositions and the arrangement of texts across those oppositions by genre.

We also observe that verse forms of all kinds, regardless of subject, tend toward the Experiential pole. Even when the subject matter is largely the same — as with “religious verse” and “religious prose” — the use of verse throws an item to the Experiential side of PC1. It is not surprising to learn that early modern verse is full of images and is expressive (anchored in the first person). But it is interesting to see a close link between verse forms of all kinds and sensory language. That pairing tells us that, even when verse is relating concepts or interior states of mind, it reaches for sensuous objects and events in order to render them. Verse does not prosecute an argument; rather, it sets things out for a reader to experience.

A third observation deals with narrative, which spans both sides of the continuum of PC2. Fictional prose, drama, and autobiography are all narrative forms. They favor the Intersubjective pole. But history and nonfictional prose also rely on narrative sequence, and they are Extrasubjective. The fact that narrative spans both ends of PC2 suggests that narrative sequence can be used to accomplish two pragmatically different tasks — relating interpersonal events in the social world (Intersubjective), or relating events that have a purely physical or “historic” and so impersonal character (Extrasubjective). The distinction represented by PC2, then, seems more basic than any distinction we might propose between texts that employ narrative sequence and those that do not.

A final observation deals with the fact that so few genres score significantly higher or lower on both principal components. This is the case only with lists and drama, both of which are formally rigid in ways that perhaps other texts in the corpus are not. Leaving stage directions aside, drama texts consist entirely of speeches in the first person, whereas lists are table-like and are meant to be scanned discontinuously as well as read serially. It is not clear why two of the most formally constrained types of text in the subcorpus participate in both patterns whereas others participate in only one. It makes sense, however, that plays would be an extreme version of the mixture of Experiential and Intersubjective, while lists are a stark combination of the Extrasubjective and Experiential. The advantage of having a map of the whole space is that an item in one quadrant can be related to all the others. Drama is drama, for example, because it is not Abstract and not Extrasubjective (in the way we have been using these terms). Descriptions of sets of items, then, can be positioned within a larger cultural field.

Texts in Four Quadrants: A Combinatory Look at Text Types

As the prior section suggests, the corners of our PCA plot are worth interpreting as regions where two independent patterns intersect. We have chosen to name these quadrants where items are participating in two patterns at once and thus contain both sets of distinguishing LATs while lacking both sets belonging to the opposite corner. Our interpretation of the items in these corners, based on an inspection of the texts themselves, yields the following diagram in which corners represent distinctive combinations:

[PCA plot of 1080 corpus with as subset of selected items highlighted in the “corners” where a text participates in both patterns. Corner regions are labeled.]

We characterize the corners of this PCA in terms of the stance or actions they take. Those actions can be “Urging” (Abstract and Intersubjective), “Explaining” (Abstract and Extrasubjective), “Describing” (Experiential and Extrasubjective), and “Imagining” (Experiential and Intersubjective). After surveying examples in each, we discuss the significance of this fourfold division for our thinking about early modern texts and ask how these labels are different from the genre or period labels we usually use to talk about texts in groups. Selected items that participate in two patterns at once are highlighted with a different colors. In a subsequent section, those selections and corresponding colors will “carry over” into a projection of the entire corpus.

Urging. Beginning at the upper right hand corner of the plot, texts we identify as “Urging” tend make arguments, engaging the reader agonistically from a bounded point of view. (Rhetorically, they rely on both logos and ethos.) Three texts that represent “Urging” are found below, with LATs contributing to their location highlighted:

George Berkeley, A defense of free thinking in mathematics… (1735), tagged text

Jean Calvin, Sermons… (1560), tagged text

George Halifax, Some cautions offered (1695), tagged text

Explaining. Unlike “Urging” texts, “Explaining” texts locate their own authority in objects or impersonal concepts rather than an individual. Texts in this quarter tend to be scientific writings or legal decrees and proclamations — texts in which an impersonal authority or method (geometry, empiricism, theology, the crown) is making the connections between entities encountered by the reader.

Thomas Hobbes, Elements of philosophy (1565), tagged text

Benjamin Robins, A discourse concerning the nature of Newton’s method… (1735), tagged text

Charles Hutton, The force of fired gunpowder… (1778), tagged text

Describing. “Describing” texts tend to be more inert rhetorically, enumerating physical events and objects rather than subsuming them under organizing concepts. Texts in this quarter include natural histories and lists (location surveys, catalogues of military equipment, geographical antiquities), which tend to be dialectically unstructured. A cluster of Latin texts can be found in the lower left, placed there because of the high degree of “AbstractConcept” words that Docuscope recognizes in them.

Hannah Woolley, The queen-like closet…of rare Receipts (1670), tagged text

Thomas Chaloner, A short discourse of…Nitre (1584), tagged text

Cornelis Antoniszoon, The safegard of sailors or…common navigations (1605), tagged text

Imagining. Texts that “Imagine” are mostly fiction. Plays, burlettas, and operas are located almost exclusively in this quarter. This clustering of fictional genres in the space is perhaps unsurprising, since each of these genres is a mix of the Intersubjective and Experiential tendencies surveyed above, including the tendency to first person pronouns (in spoken dialogue) and the abundance of physical detail (essential when not all action can be shown):

Susanna Centlivre, A wife well managed… (1715), tagged text, plain text

J. G. Holman, Abroad at home: a comic opera… (1796), tagged text, plain text

Isaac Bickerstaff, Love in a village (1763), tagged text, plain text

To review before moving to the next section, we have used a nonsupervised statistical technique to call attention to language (LATs) that can help us describe dependencies we see in the corpus. Having identified two basic patterns — Abstract/Experiential, Intersubjective/Extrasubjective — we have shown how these patterns can be used to sort genres derived independently by human study and expertise. The distribution of the human labeled items according to genre made intuitive sense when plotted according to the patterns (PCs 1 and 2) arrived at independently by nonsupervised means. Returning to those two initial patterns, then, we explored the corpus in terms of combinations of each, defining four rhetorical tendencies in the corpus: “Urging,” “Explaining,” “Describing,” and “Imagining.” We now discuss how this interpretation of the corners of the PCA plot for 1080 texts might help us understand the dynamics of the full TCP corpus.

1080 Versus the TCP Corpus, Dynamics Across Time in the TCP

The principal components used to provide insight into the 1080 corpus also capture patterns that hold for all the texts in the TCP corpus. To demonstrate this, we have taken a plot of the 1080 PCA scatterplot, highlighting by color items in the corners, and retained those color codings for a PCA biplot that is now built on the full corpus. While the order of components shifted — the Intersubjective/Extrasubjective is now the first principal component rather than the second — the loading of LATs on both components is almost identical. One can have a visual intuition into the scaling of this pattern in the following diagram, which compares the position of items in the corners of the 1080 corpus scatterplot and preserves their color designation in a PCA scatterplot of the nearly 61K items in the full corpus:

[caption: Left: A PCA plot for a subcorpus of 1080 items with selected items in quadrants highlighted. At right, a PCA plot of the full corpus, with color-codings from the subcorpus at left retained. Because components 1 and 2 switch, the “Explaining” and “Imagining” quadrants have exchanged positions.]

While some of the loadings have shifted, rotating the pattern clockwise slightly, the original highlighted items still distribute into recognizably orthogonal sectors. The overall pattern persists.^{^[4]} We can note here that the shift in rotation and in the order of components is due to the fact that the 1080 corpus randomly sampled 40 items from each decade spanning 1530-1799, which gave a greater weight to decades with fewer items and reduced the weight of items in decades with more titles. The PCA plot at right captures all texts in every decade, which results in a stronger opposition between Intersubjective and Extrasubjective texts. But the overall set of oppositions remains the same; the positions of the “Imagining” and “Urging” corners have flipped.

Having shown that the components are interpretable, we can discuss their movements as they fluctuate decade by decade. Recall that PC1 defines the Intersubjective/Extrasubjective opposition, while PC2 defines the Abstract/Experiential opposition. Here are the means and standard deviation measurements of those components by decade:

[caption: means and standard deviations for items measured on PCs 1 and 2 by decade from 1530-1799. PC1, in blue, is now the Intersubjective/Extrasubjective pattern. PC2, in red, is the Abstract/Experiential pattern. Note that while the components were derived from data covering the full TCP date range from the 1470s to 1820, this view excludes the decades where sample size was quite small.]

What becomes clear immediately is that the corpus is unevenly curated. Some of the major time shifts in components are caused by different mixes of texts that were transcribed based on the underlying bibliographies — Pollard and Redgrave STC 1, Thomason Tracts, Wing’s STC 2, ECCO, Evans. These bibliographies cover different periods within the overall sequence of 1420-1820. Within each decade, for example, survival rates will differ based on whether that decade is early or late in the sequence. Selection principles governing which items within those constituent bibliographies were transcribed into the TCP are also inconsistent. And in at least one instance (Thomason Tracts), texts survived because an individual made a concerted effort to collect and save certain types of materials.

Looking at the period from 1640-1699, we see that these decades contain many more items than the prior decades, evidenced by the smaller standard deviation bands surrounding the mean measures during this period. This change is due partly to historical events, since the English Civil War led to a profusion of political pamphlets, and so, a larger number of discrete documents that could be transcribed and measured. The change in the number and types of texts captured during this period also reflects that fact that political tracts from the period were systematically collected by a printer named Thomason. These “Thomason Tracts” (1640-1661) were transcribed as part of the EEBO-TCP project, and so represent a distinct curatorial tradition and bibliographical source. Major shifts in measurements on the principal components occur in these two decades (discussed below), after which point the corpus represents mainly items from the Wing STC catalogue (1641-1700), which does not favor political tracts and is comprehensive. Beginning in 1700, the corpus transitions to items drawn from the ECCO-TCP project (Eighteenth Century Collections Online) and items from Evans (early American literature).^{^[5]} The coverage and generic range of titles decreases suddenly in 1700, which we see in the pronounced shifts in PCs 1 and 2 at this point.

Differences in curation cannot be the only factor at work in these time based shifts, however. Around 90% of the available unique titles printed during 1475-1700 were transcribed for the TCP. If the contributing bibliographies were reasonably comprehensive, then at least some of the movement we see during these years is due to underlying cultural factors. The transition between the 1630s and the 1640s for example — bibliographically, from Pollard and Redgrave’s Short Title Catalogue to Thomason and Wing — is culturally significant because it marks the transition to open military hostilities connected with the Civil War. In the 1640s, for example, we see a sharp movement toward the Abstract and Extrasubjective patterns described above. Taken together, these shifts move texts toward the “Explaining” quarter of our map. There might be historical reasons for such a movement: the response to the Civil War in print was, in effect, to litigate the conflict through declarations and decrees, particularly in the first decade of 1640-49. The Licensing Order of 1643 effectively made Parliament a censor for print publications, and this ongoing censorship would have narrowed the range of what was published. That shift is quite visible in the plot above.

During the second decade of the conflict (1650-59), texts remain at an all time high for their measurement on the Abstract pattern (PC2), but now they are more Intersubjective (rising PC2), suggesting movement toward the “Urging” corner in the PCA plot. A crude interpretation of this sequence of shifts would be that civil conflicts occurring in the earlier decade (1640-49) needed to be defended “legalistically” in print — essentially, an explaining task. From 1650-59, however, arguments appear to become more subjective, grounded in moral exhortation rather than impersonal decrees. The regicidal sequence, one might say, begins with explanation, and ends with exhortation.

The sudden rise in explanation following a political upheaval returns in the immediate aftermath of the Revolution of 1688, when once again we see a shift toward “Explaining” (falling PC1 toward the extrasubjective; rising PC2 toward the abstract) during the decade 1690-99. Here we may be seeing the effects of the expiration of the Licensing Order in 1694, which allows a greater diversity of items into print once Parliamentary control lapses. The shift might also reflect changes in political and cultural climate that follows the installation of William and Mary in 1689, a “bloodless” revolution that would have to be explained procedurally in print.

A second noteworthy movement occurs at the beginning of the eighteenth century, just as the TCP corpus begins to be populated by a mix of American items (from Evans) and the texts gathered in ECCO. Here the change in inclusion criteria helps us account for the dramatic shifts in PCs 1 and 2, since only certain items from a much larger possible corpus were transcribed. Beginning in 1700, we wee a sharp movement toward the “Imagining” quarter of the PCA space, with texts becoming simultaneously more Intersubjective and Experiential. Some of this pattern must be explained by the tendency to include famous works of fiction in the corpus by its creators and sponsors. But this period also coincides with the rise of the epistolary novel and fictional prose, forms that regularly relate lived action through the bounded perspective of social beings. The rise in this pattern may thus also reflect the increasing presence (and cultural success) of the novel over the course of the eighteenth century. That source is being added to the “signal,” even if we may not be seeing the full spectrum of printed texts from the period.

Toward the end of the eighteenth century, the TCP texts retain a high level of Intersubjectivity with respect to earlier decades in the corpus, but become even more Experiential (i.e, the movement of PC2 in red continues down). By 1799, TCP texts are more likely to express the values and judgment of an individual speaker, but are also much more likely to express those judgments with respect to physical objects and actions in the world. The textual world in this part of our sample is perhaps more clearly one of social experience; actions following each other causally as in a story or experimental trial, rather than concepts enumerated in sequence (first, second, third) as they would be in a more abstract presentation.

Language stressing causes, effects, and consequences appears to increase over the course of the eighteenth century, reflecting perhaps an empiricism now being expressed in medical texts, natural philosophy, and psychology (also increasingly present in the corpus). Decade by decade measurements of a LAT called Consequence — a LAT that tags phrases such as “an effect of”, “in consequence of,” “resulting from” — suggest how this type of writing and thinking is manifesting itself in print:

[caption: Mean measurements and standard deviations for the LAT Consequence in texts grouped by decade.]

Manual inspection of those texts scoring high on this measure throughout the eighteenth century shows them to be practical texts about medicine, the human body, and human conduct (often warning of the connection between conduct and ailments).^{^[6]} A detailed study with new genre metadata might confirm or disprove the hypothesis that a strain of Consequence-rich writing is connected to science and the practical focus of American publishing, both of which figure prominently in the texts selected for transcription within these decades. Fictional prose texts — the newcomer on the scene in terms of imaginative writing — may also use this feature more than drama, its imaginative predecessor. It is equally possible that the rise in Consequence language simply reflects the fact that this LAT counts later expressions of causal or consequential thinking, whereas earlier instances go uncounted.

Conclusion

We have shown that we can apply what we learned from a small part of the TCP to the entire corpus, even when metadata for the whole does not exist. Part of this work was interpretive; we characterized and named the patterns PCA found in the smaller corpus after exploring relevant features in example texts. The work was also confirmatory. When we introduced genre labels produced independently (via reading) into a feature space derived from PCA, genres distributed intelligibly across that space. Our final task was to see if this “map” could be understood in light of bibliographical traditions and historical events. We saw, first, the effects of uneven corpus curation and competing bibliographic traditions in decade by decade comparisons. We also noted the possible effects of large scale political, intellectual and cultural shifts — the Licensing Act of 1643, the English Civil War and Revolution of 1688; the rise of science and “practical” writing; the development of imaginative prose fiction in the epistolary novel.

If cultural effects can indeed be seen with the aid of quantitative proxies, we should acknowledge that there is a pragmatic side to communicating in print that is by nature repeatable, but also adaptable. As the literary critic I. A. Richards argued almost a century ago, texts have particular means they use to create their effects within readers, and those means are stable enough to study. It is no accident that few writers in the seventeenth century choose to publish lists of military assets in the form of a dialogue, or that legal proclamations were not set in blank verse. Such choices would be impractical in the Richardsonian sense.^{^[7]} That is not to say they are impossible choices. But they rarely happen.

When, on the other hand, a writer chooses what seems to us an unusual strategy — say, when Margaret Cavendish decides to convey metaphysical truths about nature in a verse romance, as she does in Natures picture drawn by fancies pencil (1671) — we can begin to understand her novelty or lack of novelty as a writer, doing so with an actionable vocabulary we could never have created through selected reading.

[Caption: PCA plot of 61K items with two texts by Margaret Cavendish highlighted. Natures picture drawn by fancies pencil (1671) sits, predictably, in the Imaginative quadrant of the map, whereas Observations on experimental philosophy (1666), which includes The Blazing World, sits in the “Urging” quarter, showing differences in textual features and location for a single writer.]

Individual writers employ strategies from different regions of the map, and those differences suggest the constrained diversity of approaches they bring to making meaning. It also speaks to the diversity of concerns driving them to write. Milton and Defoe, for example, produced texts that distribute in different corners of the PCA map. Shakespeare did not.

The story of individual writers is one of greater and lesser variation, but the story of the corpus is one of stability and sameness. Early modern writers and publishers do certain things predictably, something we already knew, but that we can now describe more richly “at the level of the sentence.” Predictability is a function of constraints, some of which apply consistently, some of which loosen over time. Ideologies, for example, are slow to change. Biological limits to human attention are real and change with evolutionary pressures. Most economic and political practices transcend generations. These forces go to work on any act of composition, publication, and even consumption of print; they cannot be dodged. There should, then, be some stability and predictability on the level of the whole, which is what we believe we are seeing here.

But constraints do shift, a fact that can be grasped with the study of many more texts than anyone can read. We have tried to create markers for such changes, and to offer interpretations (“urging,” “explaining”) that link those markers to bibliographic traditions and to changes in early modern cultural life (wars, literary fashions).^{^[8]} Getting to this point required us to build patterns from the ground up, first from unsupervised statistical analysis, then from contextual, sentence-level interpretation of examples. Now that the corpus is at least minimally interpretable, we do not believe the resulting map and its directions will change significantly. Others can interpret the patterns and the events they correspond to differently, but on a certain level, the findings we present here are descriptive. Certain words and phrases are used more often in the absence of others. Certain periods of time favor different distributions of those patterns. That story is in the numbers. But someone has to look at those words, give a name to those patterns, and explain what they might be accomplishing. We see the latter as the main contribution of this study. Any number of obstacles present themselves to scholars trying to do such work, but we hope we have surmounted at least some of them with the tools available to us.

Appendix: Sample texts from 1080 corpus, labeled by genre, and grouped according to favored poles of PCs 1 and 2.

Items from genres that score high on the Abstract pole of PC1 include:

Balthazar Gerbier, To the honorable… (1646), tagged text (argument)

Daniel Defoe, An enquiry into the danger… (1712), tagged text (argument)

A proclamation that strangers… (1539), tagged text (legal decree)

A declaration of the lords… (1642), tagged text (legal decree)

Thomas Walcot, The Trial of Capt. Thomas… (1683), tagged text (legal prose)

Thomas Pain[e], Definition of a Constitution… (1791), tagged text (legal prose)

Niccolo Machiavelli, Machivael’s [sic] discourses… (1663), tagged text (nonfictional prose)

Oliver Goldsmith, An enquiry into the present… (1759), tagged text (nonfictional prose)

Cicero, Those five questions… (1561), tagged text (philosophy)

Adam Smith, The theory of moral sentiments… (1759), tagged text (philosophy)

Philipp Melanchthon, The confession of the faith… (1536), tagged text (religious prose)

William Penn, A key opening a way… (1693), tagged text (religious prose)

William Fulke, A sermon preached… (1571), tagged text (sermon)

George Keith, A sermon preached… (1700), tagged text (semon)

Items from genres that score high on the Experiential pole of PC1 include:

Richard Tarlton, A pretty new ballad… (1592), tagged text (ballad)

A lover’s complaint… (1615), tagged text (ballad)

William Shakespeare, The merry wives… (1630), tagged text (drama)

Susanna Centlivre, The wonder: a woman keeps… (1714), tagged text (drama)

William Drummond, Tears on the death of Meliades… (1613), tagged text (elegy)

Thomas Holcroft, Elegies… (1777), tagged text (elegy)

An extraordinary collection… (1693), tagged text (list)

Thomas Gray, A supplement to the tour… (1787), tagged text (list)

Alexander Mongtomerie, The cherrie and the slaye… (1597), tagged text (narrative verse)

Henry Carey, The grumbletonians… (1727), tagged text (narrative verse)

John Lyly, A whip for an ape… (1589), tagged text (poetry)

Alexander Pope, Eloisa to Abelard… (1719), tagged text (poetry)

Matthew Parker, The whole Psalter…(1567), tagged text (religious verse)

Philip Doddridge, Hymns founded on various texts (1755), tagged text (religious verse)

Alexander Craig, The amorose song… (1606), tagged text (verse collection)

Hannah Cowley, The poetry of Anna Matilda (1788), tagged text (verse collection)

Items from genres that score high on the Intersubjective side of PC2 include:

Charlotte Charke, A narrarive of the life… (1755), tagged text (autobiography)

Laetitia Pilkington, Memoirs… (1748), tagged text (autobiography)

William Shakespeare, The merry wives… (1630), tagged text (drama)

Susanna Centlivre, The wonder: a woman keeps… (1714), tagged text (drama)

Samuel Richardson, Pamela… (1741), tagged text (fictional prose)

Fanny Burney, Cecelia… (1782), tagged text (fictional prose)

Items from genres that score high on the Extrasubjective side of PC2 include:

John Smith, An accidence…. (1626), tagged text (education)

Charles Mowet, A direction to the husbandman… (1634), tagged text (education)

Jachim Camararius, [The history of strange wonders], (1561), tagged text (history)

Edmund Burke, A short account… (1766), tagged text (history)

An extraordinary collection… (1693), tagged text (list)

Thomas Gray, A supplement to the tour… (1787), tagged text (list)

Thomas Chaloner, A short discourse… (1584), tagged text (medicine)

Robert Boil [Boyle], Medicinal experiments… (1692), tagged text (medicine)

Henry Chettle, A true bill…(1603), tagged text (nonfictional prose)

Kinki Abenezrah, An everlasting prognostication… (1625), tagged text (nonfictional prose)

Church of England, Articles to be enquired… (1554), tagged text (religious decree)

Church of England, Orders set down… (1629), tagged text (religious decree)

Robert Norman, The new attractive… (1581), tagged text (science)

Oliver Goldsmith, An history of the earth… (1774), tagged text (science)
1. The VEP project created an implementation of Docuscope that can be used as an online utility with user supplied texts; it also supplied pre-computed LAT percentage counts for all TCP texts, using a spelling standardized, SimpleText corpus. ↑
2. On the creation of the 1080 subcorpus, see http://graphics.cs.wisc.edu/WP/vep/vep-early-modern-1080/. The genre labels used here were the precursor to those deployed in the VEP 1080 subcorpus; they are not available for download on the VEP site. We have made them available via our master .csv file. ↑
3. Note that there are other ways to establish the sorting power of the principal components against received genres. The decision limit for our analysis of means (ANOM) was 0.05. ↑
4. The persistence of that pattern should not be surprising if the 1080 sample is a close to random sample of the full corpus. ↑
5. The EEBO portion of the corpus covered 90% of the available unique titles, with a principled exclusion of only a few classes of items — illegible, largely non-textual (pictures, music, math), largely numeric (most almanacs, which also tended to the illegible), and largely non-English (this excluding many large expensive works like multilingual dictionaries and the Walton Polyglot. ECCO-TCP texts (eighteenth century), on the other hand, tended to favor authors whose works straddled the 17th and 18th centuries. Individual ECCO-TCP partners also requested particular items — medical texts, and works of fiction with Irish connections. About a third of the Evans texts (early American) were transcribed and so added to the TCP corpus based on recommendations of “important” works by the American Antiquarian Society. This account of the contents of the TCP was provided by Paul Schaffner, whose assistance the authors gratefully acknowledge. ↑
6. The frequent presence of medical texts post 1800, transcribed at the request of some of the TCP institutions, is clearly affecting this measurement. ↑
7. I.A. Richards, Practical Criticism (London: Kegan Paul, 1930). See also Sharon Marcus and Stephen Best, “Surface Reading: In Introduction,” Representations 108(1): 1-21. ↑
8. We have, in other words, taken a bounded set of features and made them proxies for constraining historical events or developments (cataloguing variations, differing object survival rates, genre developments, political upheaval). On the feature/proxy distinction, see Michael Witmore, “Latour, The Digital Humanities, and the Divided Kingdom of Knowledge,” DOI: 10.1353/nlh.2016.0018. ↑
April 23, 2019
Finding “Distances” Between Shakespeare’s Plays 2: Projecting Distances onto New Bases with PCA

It’s hard to conceive of distance measured in anything other than a straight line. The biplot below, for example, shows the scores of Shakespeare’s plays on the two Docuscope LATs discussed in the previous post, FirstPerson and AbstractConcepts:

Plotting the items in two dimensions gives the viewer some general sense of the shape of the data. “There are more items here, less there.” But when it comes to thinking about distances between texts, we often measure straight across, favoring either a simple line linking two items or a line that links the perceived centers of groups.

The appeal of the line is strong, perhaps because it is one dimensional. And brutally so. We favor the simple line because want to see less, not more. Even if we are looking at a biplot, we can narrow distances to one dimension by drawing athwart the axes. The red lines linking points above — each the diagonal of a right triangle whose sides are parallel to our axes — will be straight and relatively easy to find. The line is simple, but its meaning is somewhat abstract because it spans two distinct kinds of distance at once.

Distances between items become slightly less abstract when things are represented in an ordered list. Scanning down the “text_name” column below, we know that items further down have less of the measured feature that those further up. There is a sequence here and, so, an order of sorts:

If we understand what is being measured, an ordered list can be quite suggestive. This one, for example, tells me that The Comedy of Errors has more FirstPerson tokens than The Tempest. But it also tells me, by virtue of the way it arranges the plays along a single axis, that the more FirstPerson Shakespeare uses in a play, the more likely it is that this play is a comedy. There are statistically precise ways of saying what “more” and “likely” mean in the previous sentence, but you don’t need those measures to appreciate the pattern.

What if I prefer the simplicity of an ordered list, but want nevertheless to work with distances measured in more than one dimension? To get what I want, I will have to find some meaningful way of associating the measurements on these two dimensions and, by virtue of that association, reducing them to a single measurement on a new (invented) variable. I want distances on a line, but I want to derive those distances from more than one type of measurement.

My next task, then, will be to quantify the joint participation of these two variables in patterns found across the corpus. Instead of looking at both of the received measurements (scores on FirstPerson and AbstractConcepts), I want to “project” the information from these two axes onto a new, single axis, extracting relevant information from both. This projection would be a reorientation of the data on a single new axis, a change accomplished by Principal Components Analysis or PCA.

To understand better how PCA works, let’s continue working with the two LATs plotted above. Recall from the previous post that these are the Docuscope scores we obtained from Ubiqu+ity and put into mean deviation form. A .csv file containing those scores can be found here. In what follows, we will be feeding those scores into an Excel spreadsheet and into the open source statistics package “R” using code repurposed from a post on PCA at Cross Validated by Antoni Parellada.

A Humanist Learns PCA: The How and Why

As Hope and I made greater use of unsupervised techniques such as PCA, I wanted a more concrete sense of how it worked. But to arrive at that sense, I had to learn things for which I had no visual intuition. Because I lack formal training in mathematics or statistics, I spent about two years (in all that spare time) learning the ins and outs of linear algebra, as well as some techniques from unsupervised learning. I did this with the help of a good textbook and a course on linear algebra at Kahn Academy.

Having learned to do PCA “by hand,” I have decided here to document that process for others wanting to try it for themselves. Over the course of this work, I came to a more intuitive understanding of the key move in PCA, which involves a change of basis via orthogonal projection of the data onto a new axis. I spent many months trying to understood what this means, and am now ready to try to explain or illustrate it to others.

My starting point was an excellent tutorial on PCA by Jonathon Shlens. Schlens shows why PCA is a good answer to a good question. If I believe that my measurements only incompletely capture the underlying dynamics in my corpus, I should be asking what new orthonormal bases I can find to maximize the variance across those initial measurements and, so, provide better grounds for interpretation. If this post is successful, you will finish it knowing (a) why this type of variance-maximizing basis is a useful thing to look for and (b) what this very useful thing looks like.

On the matrix algebra side, PCA can be understood as the projection of the original data onto a new set of orthogonal axes or bases. As documented in the Excel spreadsheet and the tutorial, the procedure is performed on our data matrix, X, where entries are in mean deviation form (spreadsheet item 1). Our task is then to create a 2×2 a covariance matrix S for this original 38×2 matrix X (item 2); find the eigenvalues and eigenvectors for this covariance matrix X (item 3); then use this new matrix of orthonormal eigenvectors, P, to accomplish the rotation of X (item 4). This rotation of X gives us our new matrix Y (item 5), which is the linear transformation of X according to the new orthonormal bases contained in P. The individual steps are described in Shlens and reproduced on this spreadsheet in terms that I hope summarize his exposition. (I stand ready to make corrections.)

The Spring Analogy

In addition to exploring the assumptions and procedures involved in PCA, Shlens offers a suggestive concrete frame or “toy example” for thinking about it. PCA can be helpful if you want to identify underlying dynamics that have been both captured and obscured by initial measurements of a system. He stages a physical analogy, proposing the made-up situation in which the true axis of movement of a spring must be inferred from haphazardly positioned cameras A, B and C. (That movement is along the X axis.)

Shlens notes that “we often do not know which measurements best reflect the dynamics of our system in question. Furthermore, we sometimes record more dimensions than we actually need!” The idea that the axis of greatest variance is also the axis that captures the “underlying dynamics” of the system is an important one, particularly in a situation where measurements are correlated. This condition is called multicollinearity. We encounter it in text analysis all the time.

If one is willing to entertain the thought that (a) language behaves like a spring across a series of documents and (b) that LATs are like cameras that only imperfectly capture those underlying linguistic “movements,” then PCA makes sense as a tool for dimension reduction. Shlens makes this point very clearly on page 7, where he notes that PCA works where it works because “large variances have important dynamics.” We need to spend more time thinking about what this linkage of variances and dynamics means when we’re talking about features of texts. We also need to think more about what it means to treat individual documents as observations within a larger system whose dynamics they are assumed to express.

Getting to the Projections

How might we go about picturing this mathematical process of orthogonal projection? Shlens’s tutorial focuses on matrix manipulation, which means that it does not help us visualize how the transformation matrix P assists in the projection of the original matrix onto the new bases. But we want to arrive at a more geometrically explicit, and so perhaps intuitive, way of understanding the procedure. So let’s use the code I’ve provided for this post to look at the same data we started with. These are the mean-subtracted values of the Docuscope LATs AbstractConcepts and FirstPerson in the Folger Editions of Shakespare’s plays. To get started, you must place the .csv file containing the data above into your R working directory, a directory you can change using the the Misc. tab. Paste the entire text of the code in the R prompt window and press enter. Within that window, you will now see several means of calculating the covariance matrix (S) from the initial matrix (X) and then deriving eigenvectors (P) and final scores (Y) using both the automated R functions and “longhand” matrix multiplication. If you’re checking, the results here match those derived from the manual operations documented the Excel spreadsheet, albeit with an occasional sign change in P. In the Quartz graphic device (a separate window), we will find five different images corresponding to five different views of this data. You can step through these images by keying control-arrow at the same time.

The first view is a centered scatterplot of the measurements above on our received or “naive bases,” which are our two docuscope LATs. These initial axes already give us important information about distances between texts. I repeat the biplot from the top of the post, which shows that according to these bases, Macbeth is the second “closest” play to Henry V (sitting down and to the right of Troilus and Cressida, which is first):

Now we look at the second image, which adds to the plot above a line that is the eigenvector corresponding to the highest eigenvalue for the covariance matrix S. This is the line that, by definition, maximizes the variance in our two dimensional data:You can see that each point is projected orthogonally on to this new line, which will become the new basis or first principal component once the rotation has occurred. This maximum is calculated by summing the squared distances of each the perpendicular intersection point (where gray meets red) from the mean value at the center of the graph. This red line is like the single camera that would “replace,” as it were, the haphazardly placed cameras in Shlens’s toy example. If we agree with the assumptions made by PCA, we infer that this axis represents the main dynamic in the system, a key “angle” from which we can view that dynamic at work.

The orthonormal assumption makes it easy to plot the next line (black), which is the eigenvector corresponding to our second, lesser eigenvalue. The measured distances along this axis (where gray meets black) represents scores on the second basis or principal component, which by design eliminates correlation with the first. You might think of the variance along this line is the uncorrelated “leftover” from the that which was captured along the first new axis. As you can see, intersection points cluster more closely around the mean point in the center of this line than they did around the first:

Now we perform the change of basis, multiplying the initial matrix X by the transformation matrix P. This projection (using the gray guide lines above) onto the new axis is a rotation of the original data around the origin. For the sake of explication, I highlight the resulting projection along the first component in red, the axis that (as we remember) accounts for the largest amount of variance:

If we now force all of our dots onto the red line along their perpendicular gray pathways, we eliminate the second dimension (Y axis, or PC2), projecting the data onto a single line, which is the new basis represented by the first principal component.

We can now create a list of the plays ranked, in descending order, on this first and most principal component. This list of distances represents the reduction of the two initial dimensions to a single one, a reduction motivated by our desire to capture the most variance in a single direction.

How does this projection change the distances among our items? The comparison below shows the measurements, in rank order, of the far ends of our initial two variables (AbstractConcepts and FirstPerson) and of our new variable (PC1). You can see that the plays have been re-ordered and the distances between them changed:

Our new basis, PC1, looks like it is capturing some dynamic that we might connect to the what the creators of the First Folio (1623) labeled as “comedy.” When we look at similar ranked lists for our initial two variables, we see that individually they too seemed to be connected with “comedy,” in the sense that a relative lack of one (AbstractConcepts) and an abundance of the other (FirstPerson) both seem to contribute to a play’s being labelled a comedy. Recall that these two variables showed a negative covariance in the initial analysis, so this finding is unsurprising.

But what PCA has done is combined these two variables into a new one, which is a linear combination of the scores according to weighted coefficients (found in the first eigenvector). If you are low on this new variable, you are likely to be a comedy. We might want to come up with a name for PC1, which represents the combined, re-weighted power of the first two variables. If we call it the “anti-comedy” axis — you can’t be comic if you have a lot of it! — then we’d be aligning the sorting power of this new projection with what literary critics and theorists call “genre.” Remember that by aligning these two things is not the same as saying one is the cause of the other.

With a sufficient sample size, this procedure for reducing dimensions could be performed on a dozen measurements or variables, transforming that naive set of bases into principal components that (a) maximize the variance in the data and, one hopes, (b) call attention to the dynamics expressed in texts conceived as “system.” If you see PCA performed on three variables rather than two, you should imagine the variance-maximizing-projection above repeated with a plane in the three dimensional space:

Add yet another dimension, and you can still find the “hyperplane” which will maximize the variance along a new basis in that multidimensional space. But you will not be able to imagine it.

Because principal components are mathematical artifacts — no one begins by measuring an imaginary combination of variables — they must be interpreted. In this admittedly contrived example from Shakespeare, the imaginary projection of our existing data onto the first principal component, PC1, happens to connect meaningfully with one of the sources of variation we already look for in cultural systems: genre. A corpus of many more plays, covering a longer period of time and more authors, could become the basis for still more projections that would call attention to other dynamics we want to study, for example, authorship, period style, social coterie or inter-company theatrical rivalry.

I end by emphasizing the interpretability of principal components because we humanists may be tempted to see them as something other than mathematical artifacts, which is to say, something other than principled creations of the imagination. Given the data and the goal of maximizing variance through projection, many people could come up with the same results that I have produced here. But there will always be a question about what to call the “underlying dynamic” a given principal component is supposed to capture, or even about whether a component corresponds to something meaningful in the data. The ongoing work of interpretation, beginning with the task of naming what a principal component is capturing, is not going to disappear just because we have learned to work with mathematical — as opposed to literary critical — tools and terms.

Axes, Critical Terms, and Motivated Fictions

Let us return to the idea that a mathematical change of basis might call our attention to an underlying dynamic in a “system” of texts. If, per Shlens’s analogy, PCA works by finding the ideal angle from which to view the oscillations of the spring, it does so by finding a better proxy for the underlying phenomenon. PCA doesn’t give you the spring, it gives you a better angle from which to view the spring. There is nothing about the spring analogy or about PCA that contradicts the possibility that the system being analyzed could be much more complicated — could contain many more dynamics. Indeed, there nothing to stop a dimension reduction technique like PCA from finding dynamics that we will never be able to observe or name.

Part of what the humanities do is cultivate empathy and a lively situational imagination, encouraging us to ask, “What would it be like to be this kind of person in this kind of situation?” That’s often how we find our way into plays, how we discover “where the system’s energy is.” But the humanities is also a field of inquiry. The enterprise advances every time someone refines one of our explanatory concepts and critical terms, terms such as “genre,” “period,” “style,” “reception,” or “mode of production.”

We might think of these critical terms as the humanities equivalent of a mathematical basis on which multidimensional data are projected. Saying that Shakespeare wrote “tragedies” reorients the data and projects a host of small observations on a new “axis,” as it were, an axis that somehow summarizes and so clarifies a much more complex set of comparisons and variances than we could ever state economically. Like geometric axes, critical terms such as “tragedy” bind observations and offer new ways of assessing similarity and difference. They also force us to leave things behind.

The analogy between a mathematical change of basis and the application of critical terms might even help explain what we do to our colleagues in the natural and data sciences. Like someone using a transformation matrix to re-project data, the humanist introduces powerful critical terms in order to shift observation, drawing some of the things we study closer together while pushing others further apart. Such a transformation or change of basis can be accomplished in natural language with the aid of field-structuring analogies or critical examples. Think of the perspective opened up by Clifford Geertz’s notion of “deep play,” or his example of the Balinese cock fight, for example. We are also adept at making comparisons that turn examples into the bases of new critical taxonomies. Consider how the following sentence reorients a humanist problem space: “Hamlet refines certain tragic elements in The Spanish Tragedy and thus becomes a representative example of the genre.”

For centuries, humanists have done these things without the aid of linear algebra, even if matrix multiplication and orthogonal projection now produce parallel results. In each case, the researcher seeks to replace what Shlens calls a “naive basis” with a motivated one, a projection that maps distances in a new and powerful way.

Consider, as a final case study in projection, the famous speech of Shakespeare’s Jacques, who begins his Seven Ages of Man speech with the following orienting move: “All the world’s a stage, / And all the men and women merely players.” With this analogy, Jacques calls attention to a key dynamic of the social system that makes Shakespeare’s profession possible — the fact of pervasive play. Once he has provided that frame, the ordered list of life roles falls neatly into place.

This ability to frame an analogy or find an orienting concept —the world is a stage, comedy is a pastoral retreat, tragedy is a fall from a great height, nature is a book — is something fundamental to humanities thinking, yet it is necessary for all kinds of inquiry. Improvising on a theme from Giambattista Vico, the intellectual historian Hans Blumenberg made this point in his work on foundational analogies that inspire conceptual systems, for example the Stoic theater of the universe or the serene Lucretian spectator looking out on a disaster at sea. In a number of powerful studies — Shipwreck with Spectator, Paradigms for a Metaphorology, Care Crosses the River — Blumenberg shows how analogies such as these come to define entire intellectual systems; they even open those systems to sudden reorientation.

We certainly need to think more about why mathematics might allow us to appreciate unseen dynamics in social systems, and how critical terms in the humanities allow us to communicate more deliberately about our experiences. How startling that two very different kinds of fiction — a formal conceit of calculation and the enabling, partial slant of analogy — help us find our way among the things we study. Perhaps this should not be surprising. As artifacts, texts and other cultural forms are staggeringly complex.

I am confident that humanists will continue to seek alternative views on the complexity of what we study. I am equally confident that our encounters with that complexity will remain partial. By nature, analogies and computational artifacts obscure some things in order to reveal other things: the motivation of each is expressed in such tradeoffs. And if there is no unmotivated view on the data, the true dynamics of the cultural systems we study will always withdraw, somewhat, from the lamplight of our descriptive fictions.

July 6, 2015
Adjacencies, Virtuous and Vicious, and the Forking Paths of Library Research

Folger Secondary Stacks, western view

Browsable stacks – shelves of books that you can actually look at, pull off the shelf, read a while, and put back. They’re wonderful. Folger readers regularly comment on the fact that they can walk freely through the stacks of the secondary collection, which in our case means books published after 1830. That collection is arranged by Library of Congress call number, and many know the system intuitively after years of library work. (I frequently find myself in the PRs and PNs.)

Recently I was looking through section PN6420.T5 for books on early modern proverbs, a topic I have been writing about for years. I was looking for Morris Palmer Tilley’s collection, A Dictionary of Proverbs in England in the Sixteenth and Seventeenth Centuries (Michigan: University of Michigan Press, 1950). There it was, right where it was supposed to be: a landmark piece of scholarship that is the first source for anyone interested in the topic. Yet this was only the first stop. On the shelves above and below this important source were about 30 other books on the subject, some of which I began to explore. Some very useful books turned up next to the one I had initially intended to find. Some of them have even turned up in my footnotes, the ultimate test, perhaps, of a book’s usefulness to a scholar.

Stack browsers are on the lookout for this kind of happy accident. You go into the stacks looking for this book, but another one, more interesting, happens to be nearby. Now you can have a look, nibble around the edges of the promising title, which is an excellent form of procrastination if you are stuck or unready to begin writing. Having done my share of meandering in open stacks, I am intrigued when readers describe these moments of discovery – which after all are part of the natural progression of research – as happy accidents or the products of chance. Aren’t accidents things that you cannot, by definition, bring about or encourage?

The fact remains that libraries are set up to make such accidents happen. They arrange books on the shelves in a certain way – not at random, but on a plan designed to increase the likelihood that, nearby the book you think you want, there will be others you also want to read. When someone says, “and then I happened upon this great book,” they may be describing the advantages of the library’s structured arrangement of books by (say) subject matter. Partly an effect of a classification system, partly one of the physical arrangement of the space, Libraries are designed to promote “lucky finds.”

Such “encouragable accidents” are really the consequence of a simple principle that governs the entire space of the library: that of structured adjacency. As I will try to show in a moment, this principle can be seen at work in both the physical spaces of the stacks and the digital discovery spaces designed to give us access to the collection. The root of the word adjacency is the Latin verb jacere, which means to throw. When books appear side by side on a library shelf, their adjacency is not a product of chance: they have been placed (hopefully not thrown) together so that one is next to another of similar kind. How might one structure such adjacencies? One technique would be to shelve books by size. In some medieval monasteries, books of a similar size were placed on the same shelf. In addition to saving shelf space (think about it), this arrangement located collection access in the mind of the librarian or keeper who knew where different titles were. These collections weren’t designed to be browsed, so the principle made sense.

Now think of a modern, browsable stack of books arranged along the Library of Congress call number model. Here the principle of access exists in two places: the launching point of the card catalogue (which tells you where in the stacks to start looking) and then on the shelves themselves, where books on similar subjects are grouped together. The idea here is to use the intellectual scaffolding of subject cataloguing to structure the physical space of the collection. With respect to subject, physical adjacencies on the shelf become virtuous instead of vicious.

What is a virtuous adjacency? It is a collocation of two items likely to appeal to any-user-whatever whose item search is itself structured along principles which the cataloguing supports: usually author, date, title, subject, although there are many other forms of search. It doesn’t matter who you are or how deep your knowledge of the subject is: if you know enough to find one book on proverbs, you can find many in the Library of Congress system, because you are helped along by the arrangement in the physical space of the library. That arrangement is principled and intentional. It is virtuous.

But every virtuous adjacency can quickly become vicious, and this is because virtue (as I’m calling it) resides in the principles that inform any given reader’s search for a book. Suppose I know about Tilley’s book on proverbs, and I know it by title. Once I am pointed to that book by the catalogue, I go and look at it, and I see some terrific proverbs about apes, for example, “To make her husband her ape.” I start to think about this. Maybe what I’m really interested in is how the behavior of apes helps people think about the nature of mimicry and mimesis in the early modern period. (Early modern references to apes are often veiled references to the mimetic power of artists, who “ape” nature.)

Proverb from Tilley’s A Dictionary of Proverbs in England

Now the principle that governs the space flips. What I need to do is go to H. W. Janson’s magnificent Apes and Ape Lore in the Middle Ages and Renaissance, which has the call number GR730.A6 J3. What made the first adjacency surrounding “books about proverbs” virtuous was the collocation of books in space by subject. That was where the manufactured serendipity happened. But now that very principle of adjacency has become an impediment – it has become vicious – because Tilley is not surrounded by books about apes. I could search again under the latter subject, but that would not be adjacency, it would be search. We advert to catalogues in order to re-orient ourselves within the physical universe of books-on-shelves, or the virtual space of digital collections. But we cannot simply wander into that next thing that meets our new interest. To do this, I really would have to be lucky: “Oh look, there’s Jansen’s book on apes, just lying across the aisle….”

The moral of this story – or is it the proverb? – is that “every virtuous adjacency is also vicious.” When it comes to the arrangement of books, virtue is relative: it depends upon what the researcher thinks he or she is looking for, a thinking that often changes in the course of research. Once you’ve flipped from proverbs to apes, the physical arrangement of books on shelves is not going to help you. The virtuous arrangement that allowed you to lay your hands on that first book (“hey, my favorite book on proverbs!”) is now working against you (“shouldn’t I be looking at books about apes?”).

As we gain greater access to the contents of books; as digitized books and their machine actionable contents become more and more arrangeable with the assistance of mathematical principles like the topic model, the physical space of search is being transformed into something more plastic, even Borgesian. While the physical space of the library cannot be re-plotted whenever the research forks out onto another garden path, researchers have more options in the virtual space of text searching to find cut-throughs. There is a problem here, of course, which is that in such a virtual world of association, there are infinite pathways for association. It becomes more challenging to figure out where to go next when you could go anywhere.

But there may be other ways to multiply virtuous pairings given the tools that librarians of the future will create. Instead of starting with Tilly and then hoping that I’ll be lucky enough to bump into Jansen, I might rely on my mobile device to reach into the contents of the book I’m interested in now and, based on a principle of adjacency I supply, rearrange all the books in the library around that first book in concentric layers of immediacy of different types – layers that might allow readers to move from one virtuous adjacency to the next. There is no way around the virtuous/vicious symmetry, since it is precisely that symmetry which makes research necessary: in exploring the connection between these five books on proverbs, you are giving up the opportunity to think about that other, really, really good book about apes. (You can tell I wish I’d found Jansen earlier.) What makes an adjacency for one research task virtuous makes that adjacency vicious for the next.

That’s why answers to research questions do not turn up instantly. You have to decide when to shift directions, and the physical layout of library stacks according to a single principle of adjacency (e.g., subject cataloguing) is going to sustain some inquiries while simultaneously shutting down others. No amount of dynamic text search is going to put an end to the virtuous/vicious circle: their pairing represents a real constraint on knowledge – the fact that thinking is progressive, and moves on discrete pathways – rather than a technological or physical limitation to be overcome.

That is not to say that there aren’t new ways of mapping adjacencies among digitized texts. Abstract models of the contents of books such as topic models, however, do offer us other pathways in the research process; they are an additional principle of adjacency that we can invoke if we don’t want to “jump the hedge” by consulting a book’s footnotes (say) and then searching for new items based on the titles referenced there. (On topic models, see Ted Underwood’s very helpful blog post.) We have been using topic models in the Wisconsin VEP project to look at our collections of texts, and they do seem to open up adjacencies that we would never have thought about. (An upcoming blog post will deal with the relationship between the novel and English moral philosophy.) A topic model can suggest, for any given book or passage, another book or passage that might be relevant for reasons only a user could recognize (but might not be able spontaneously to supply). As with other techniques of dimension reduction (e.g., PCA, factor analysis), there may be more topics than we can name or recognize: a topic does not become a principle of association until a human being recognizes and affirms that principle in action.

If libraries are gardens with many forking paths, the hedges that separate those paths are absolutely real. Even a fully virtual, instantly re-arrangeable virtual rendering of our shelf spaces will not put an end to vicious adjacencies, since they too will become virtuous if research takes a new turn. Our challenge is not a physical one; it’s not even computational. In a future library where any two books could be placed alongside one another in an instant, we might never find anything we want to read.

The task of library research is not simply that of poking around clusters of items on a shelf, or more grandly, finding ways of reclustering books continuously in hopes of finding the ultimate, virtuous arrangement. There is no Leibnizian, maximally virtuous arrangement of books, and never will be. (Leibniz must have hit upon this melancholy thought when he was librarian at Wolfenbuttel.)

But there are more or less definite lines of thought, each on its way to becoming other, equally definite, lines of thought. There is no point in celebrating the fact that such lines can fork off in an infinite number of directions. We know already that a researcher can only follow one of them at a time.

July 8, 2014