Category: Visualizing English Print (VEP)

  • Visualizing English Print, 1530-1800, Genre Contents of the Corpus

    Screen Shot 2013-12-10 at 10.57.35 AM


    Some features of the corpus, visualized here over time. Many of the linguistic and topical trends that we find in this data set will express the state of the corpus at a given moment in time. I have divided up the time series into groups containing three decades apiece. The visualization above displays the relative number of titles that we have classed according to the genre — what we are calling “Derived Genre” on the Y axis. Note that the size of plots in the stacked bar graphs does not represent the size of the texts involved: vertical size represents the number of items in that time period that have been so labeled. The names at left represent the order in which the different genres have been stacked, from bottom to top. At right, an alphabetical list of the genres designated by their colors.

    An obvious trend that will be born out in the analyses that follow: the corpus contains a lot of what we are calling Religious Prose items from 1530-1709, but after this point, the number of Religious Prose items declines. After 1709, we see  more drama (after a compression during the closure of the theater’s mid-century) in the aqua color, along with more Nonfictional Prose (cyan), Fictional Prose (rose), and perhaps too Verse Collections (magenta). These shifts in the relative proportions of different genres should not be taken as representative of everything printed during each of these periods. The corpus itself is a small sample of that larger field. But because the selection of texts was random, excluding only items of less than 500 words, we expect there to be some trends within the corpus that are representative of larger trends in print culture. As the project develops, we should have a better sense of just what you can learn from 1080 texts in a field that is much larger.

    Since we measured word types (according to Docuscope Junior LATs or topics) as a proportion of all words in a given text, the length of individual texts should not significantly affect the distribution of those types of words. Nevertheless, it helps to see the lengths of different kinds of texts. Our text tagger treats the spaces between words as tokens so that word combinations can be accommodated or excluded; the sum of tokens on the y-axis thus includes these with the words and punctuation:

    Screen Shot 2013-12-10 at 11.16.56 AM

    Unsurprisingly, Religious Prose texts tend to be quite long, whereas Ballads are almost always short. Fictional Prose texts are reasonably long, as are Biographical texts. All of this is in keeping with what I, at least, would expect from texts belonging to these genres. It should be noted that the genre designations you see in our spreadsheet are subjective and so arguable. We asked a member of our team, Jason Whitt, to apply them. Someone else looking at this corpus might come up with different designations, and we understand that there will be debate on this score. You can see the full list of our classifications by consulting the “Genre” column in the spreadsheet of the corpus.

    Noting the relative decline in the amount of items designated as Religious Prose in the corpus over time, I became curious about a seemingly parallel decline in words that bear the DSJ tag “Common Authorities” — words mentioning entities invested with some type of communally sanctioned power (God, king, church, etc.). You can see the declining proportion of such words in texts as the decades roll on. In the chart below, the position of dots on the y-axis shows the percentage of tokens in each text (each dot) that were given the “Common Authorities” tag. Please note that dramatic texts in this graph have been colored red.

    Screen Shot 2013-12-10 at 10.18.31 AM

    As always, we like to see the tokens in action, and so I offer a sample text that is high in this variable. The text is, The Epiphanie of the Church (1590), and can be consulted below.

    A00748_1590_Theepiphanieofthechu.txt

  • Visualizing English Print, 1530 -1800: The Corpus, Tag Sets, and Topics

    Screen Shot 2013-11-27 at 4.28.21 PM
    Visualization of the corpus using a topic model. A prototype of this corpus exploration and visualization tool, Serendip, was in use this fall at the Folger Shakespeare Library. It was designed by Eric Alexander.

    Here begins a series of posts on a larger dataset we have been studying at Wisconsin under the auspices of “Visualizing English Print, 1530-1800,” a Mellon funded research project that brings together computer scientists and literary scholars from several institutions — UW, Madison, Strathclyde University (U.K.), and the Folger Shakespeare Library. A profile of the project members appears here.

    I would like to begin by posting the data set that we are working with, which consists of texts drawn from the EEBO-TCP corpus. We assembled the corpus by drawing 40 texts at random from 27 decades within the corpus, beginning with the decade 1530-1539 and ending with 1790-1799. Texts under 500 words in length were excluded from selection. We knew that this selection could not be truly random, since the TCP project selected texts for transcription that it felt would be of interest to scholars. (Full disclosure: I am on the TCP Executive Board.) But we did want to frustrate the natural urge to pick texts we knew and liked: that would limit the kind of lexical and generic variation we want to study.

    These texts are now being tagged and analyzed according to several different schemes. Members of our team have created tools for visualizing patterns of variation within the corpus, tools that we want eventually to share with others. There are two techniques I want to discuss: analysis of texts using a tagging scheme derived from the original Docuscope (what we call “Docuscope Junior”) and analysis of texts using a topic model. The former was implemented by Mike Gleicher, the latter by Eric Alexander.

    A .csv.zip file showing the texts and their scores on both the DSJ variables and then the topics (each in ascending alphabetical order) can be downloaded at the following location: 1080forWDS. A zipped copy of a folder containing HTML pages of all the documents in the dataset, tagged with DSJ, can be downloaded here. (Note that some of these files, for example Ralegh’s History of the World, are very large and may swamp your browser.) These pages were generated by a utility called Ubiqu+ity, a Wisconsin-designed tagging tool that implements the DSJ tagging scheme. Ubiqu+ity can also tag a corpus with a user-defined tagging scheme, which should enable us to try different schemes on the same corpus.

    Now a first picture of the corpus, showing a Docuscope LAT (Language Action Type) that increases steadily over time in the corpus:

    Screen Shot 2013-11-27 at 3.28.49 PM

    Perhaps this is not a surprising trend, but we were pleased to see something right away that made sense. Red dots in this graph are dramatic texts, whereas the blue texts are all others. Here is an example of a text encoded by DSJ that is high on this variable. You can scroll down on your HTML browser at left, click on SubjectivePercept, and all of the items so tagged will be highlighted. The text is Abroad and at home: A comic opera, in three acts. Now performing at the Theatre-Royal, Covent-Garden. By J. G. Holman, title page publication date 1796, which is the red dot significantly elevated above the trend line, at the far right. You can view it here:

    K035834.000_1796_AbroadandathomeAcomi_Docuscope

    Please note that the texts passed through DSJ were modernized using VARD 2, developed at Lancaster. The topic model was implemented on texts that were not modernized, although data from both trials are included in the .csv file above.