Guest Post: Don’t go breaking (up) my genre: visualizing genre against attributes

Danielle Griffin is a research assistant on her third co-op term at The Life of Words. This is the first of a few posts based on her last work-term report,”Comparative Data Visualizations of Textual Features in the OED and the Life of Words Genre 3.0 Tagging System”. Danielle’s report won the Quarry Integrated Communication Co-op English Award

During my last work placement as a full-time research assistant at The Life of Words last fall, I began toying with ways to situate our genre categorization system (known to us as G3.0) within genre theory more generally.  Drawing on work by David Lee (see his extensive analysis of genre and register characteristics of the BNC corpus) and Ted Underwood’s work with HathiTrust’s corpus, I took bottom-up approach to genre, by describing and characterizing texts, rather than categorizing them.

To do this, I identified a set of 22 evenly weighted, (mostly) non-exclusive features, or attributes, that might apply to any text. Each text in my data set, derived from a list of OED sources, was then marked according to these criteria, so that each text has its own 22-feature combination (represented as True or False for each). From there I built some data visualizations, to have a look at the relationships between us human taggers and G3.0 tagging conventions, among the G3.0 genre categories themselves, and between the OED and textual genre generally.

Here’s what looks a bit like a family tree, twisted around into a circle. It’s a branched diagram representing the implied “family” relationship among items, based upon their similarities and differences. Really, this is just a made-up hierarchical dendrogram, rotated on the point of the first node for easy reading. It’s unidirectional, clockwise, meaning the end of the tree, at 3 o’clock, is actually the farthest point from the beginning, just below it). [You can click on the image to make it bigger, or download a pdf]

For now, it’s best to ignore the colours of the branches [see my next post], and focus on the individual endpoints of the dendrogram. Each is one text in the data set, represented by a unique ID number around the circumference. Those IDs are colour coded, with each colour corresponding to the G3.0 tag we assigned to it in the course of tagging OED quotations. For example, the pink chunk of IDs at about 3 o’clock on the dendrogram have all been tagged in the G3.0 “Verse Drama” category. From 4-5 o’clock on the dendrogram, the chunk of faded blue IDs are all tagged “Poetry.”

However, if you squint, you can see one faded blue guy kicking around within that pink “Verse Drama” chunk (#191 if you’ve got the PDF open.) Read More

