Guest Post: Don’t go breaking (up) my genre: visualizing genre against attributes

Danielle Griffin is a research assistant on her third co-op term at The Life of Words. This is the first of a few posts based on her last work-term report,”Comparative Data Visualizations of Textual Features in the OED and the Life of Words Genre 3.0 Tagging System”. Danielle’s report won the Quarry Integrated Communication Co-op English Award

During my last work placement as a full-time research assistant at The Life of Words last fall, I began toying with ways to situate our genre categorization system (known to us as G3.0) within genre theory more generally.  Drawing on work by David Lee (see his extensive analysis of genre and register characteristics of the BNC corpus) and Ted Underwood’s work with HathiTrust’s corpus, I took bottom-up approach to genre, by describing and characterizing texts, rather than categorizing them.

To do this, I identified a set of 22 evenly weighted, (mostly) non-exclusive features, or attributes, that might apply to any text. Each text in my data set, derived from a list of OED sources, was then marked according to these criteria, so that each text has its own 22-feature combination (represented as True or False for each). From there I built some data visualizations, to have a look at the relationships between us human taggers and G3.0 tagging conventions, among the G3.0 genre categories themselves, and between the OED and textual genre generally.

Here’s what looks a bit like a family tree, twisted around into a circle. It’s a branched diagram representing the implied “family” relationship among items, based upon their similarities and differences. Really, this is just a made-up hierarchical dendrogram, rotated on the point of the first node for easy reading. It’s unidirectional, clockwise, meaning the end of the tree, at 3 o’clock, is actually the farthest point from the beginning, just below it). [You can click on the image to make it bigger, or download a pdf]

For now, it’s best to ignore the colours of the branches [see my next post], and focus on the individual endpoints of the dendrogram. Each is one text in the data set, represented by a unique ID number around the circumference. Those IDs are colour coded, with each colour corresponding to the G3.0 tag we assigned to it in the course of tagging OED quotations. For example, the pink chunk of IDs at about 3 o’clock on the dendrogram have all been tagged in the G3.0 “Verse Drama” category. From 4-5 o’clock on the dendrogram, the chunk of faded blue IDs are all tagged “Poetry.”

However, if you squint, you can see one faded blue guy kicking around within that pink “Verse Drama” chunk (#191 if you’ve got the PDF open.) Read More

How Indigenous American words came into English

I’ve been deep in the OED documentation of borrowings and loanwords for my look at “tramlines” [see my previous post, and look out for a few more to come] and OED’s treatment of foreign, about to be naturalized, and naturalized words. I got curious about some of the Indigenous American words in my dataset, and […]

||-Tripping over tramlines-||

“Tramlines”, icydk, are those upright parallel bars that OED1 and OED2 editors used to indicate that a word was “alien or not fully naturalized”. So, for instance, zeitgeist you may recognize as a word of German origin, not infrequently heard in English. In OED1 (1928) it appeared as ||Zeitgeist, and this mark was preserved on […]

Guest Post: Cataloguing the Catalogue

Cosmin Dszurdsza is a research assistant at The Life of Words. In my last guest post I discussed problematic magazine classifications. Now, once again, a periodical publication proves to be an exciting and difficult genre identification challenge. The kind of text I will be dealing with today is the “catalogue” (filtered out of our data […]

Three conferences this summer

After a baby-related travelling hiatus of a couple three years, TLOW is hitting the road this summer, with stops at Ryerson University in Toronto (just barely down the road, really) at the end of May, for the Canadian Society for Digital Humanities meeting at CFHSS Congress; then off to Barbados and the University of the […]

One last round with metadata from Hathi and Underwood

In “Hathi’s Automatic Genre Classifier” and “Hathi Genre Again – Zero Recall“, I ran a couple of experiments comparing genre categories assigned by human taggers working on the Life of Words OED mark-up project to two sources of genre metadata associated with the HathiTrust Digital Library. The first post looked at data from the automatic […]

Poetry Competition Time

As part of our OMRI funding, LOW runs an annual poetry competition, open to all high school students in Ontario. Last year’s pilot run had a few dozen submissions, from which we picked one winner, two runners up, and twelve honorable mentions, all collected in our 2016 Anthology. Last year’s theme was “write a poem […]

Shakespeare’s Earliest Citations in the OED

No author’s representation in the OED has received more comment than Shakespeare’s: if you ever come across a mention of OED citation evidence, more than likely it’s being used to substantiate (sometimes challenge or qualify) a claim that Shakespeare invented the most English words, or made up the most new meanings for existing words, or […]

OED Subject Matter

In my last post I described using HathiTrust’s Solr Proxy API to fetch Hathi genre metadata for OED quotations. But genre is not the only metadata that Hathi sends back down the intertubes when I ask it a question. For most works, I also get a Library of Congress Classification code for the volume. This […]

Hathi Genre Again – Zero Recall

In “Hathi’s Automatic Genre Classifier” [17.01.06] I compared the consolidated automatic genre metadata for a subset of HathiTrust Digital Library texts (available here) to the genre classifications arrived at for human-inspected works as part of the OED quotation tagging project under-way at The Life of Words. My process there was pretty closely supervised, but the […]