Vortrag
Vortrag im Rahmen des Forschungsschwerpunktes "Text & Edition - Editorik"
Mark Faulkner (Dublin)
Towards Medieval Big Data: corpora, metadata and methodologies for early English
Wann: 24.10.2024, 17:00 c.t.
Wo: Hörsaal 34
Zoom:
https://univienna.zoom.us/j/61043846425?pwd=R3B0TVZDb1NBa24xc3E5NENwNHlwQT09
Im Anschluss an Vortrag und Gespräch (welche wieder auf unserem Blog dokumentiert werden) laden wir zu Wein und Brötchen ein.
Abstract
Philology laid the groundwork for our understanding of the grammar of Old and Middle English, and thereby how these varieties are understood today. The laborious collection of examples by generations of scholars, often nineteenth-century German PhD students, from individual (often long) texts, provided the fundamental evidence base on which the generalisations of grammars and handbooks still rest.
Yet as Jenset and McGillivray (2017: 8-10) have argued, philology is fundamentally an 'example-based' approach, in which argumentation typically rests not on a full or even representative array of relevant forms a text or corpus offers, but a small subset of those forms, collected by an expert slowly reading through a text. Taking its cue from the observation of Crane, Bamman and Jones (2013: 53) that 'all philological inquiry, whether classical or otherwise, is now a special case of corpus linguistics', this paper outlines innovations in corpora, metadata and methodologies that allow us to enhance the small datasets typical of traditional philology, assessing, say, Old English h- not through four spellings (as Jordan 1974: §195 does), but through 380,000.
The paper begins by outlining methodological developments over the last five years, that now permit us to speak of a "corpus philology". These techniques allow for assessing the language of a single text against the Dictionary of Old English Corpus on the basis of a predetermined set of unusual spellings; assessing the language of a whole manuscript against the same corpus on the basis of predictable spelling variation; assessing the language of a whole century using tagged corpora; and assessing the chronology of specific variants in the Dictionary of Old English Corpus to help better date undated texts. It then turns to current attempts to develop semi-automated techniques for extracting all spellings of particular phonological segments from the entire Dictionary of Old English Corpus.
Further refinement of this "corpus philology", the paper argues, also requires developments in metadata and corpora. The paper describes Searobend: Linked Metadata for English Language Texts 1000-1300, which aims to provide a new approach to and model for metadata about works, texts, manuscripts and scribes. It then shows how the machine-learning-based Handwritten Text Recognition (HTR) can facilitate the speedy construction of new corpora that, by allowing for texts to be transcribed with far more granularity than has been traditional when editing medieval works, will allow for a raft of questions, currently unanswerable, to be addressed.
The paper finishes by returning again to methodologies, describing attempts to use hierarchical cluster analysis of linguistic profiles of different Old English texts to identify texts whose language is alike, suggesting the clusters that emerge may have value as an exploratory tool and also in the dating of undated texts. It closes with some reflections on size, and how bigger data might challenge some of the core assumptions of traditional philology and the way we see Old and Middle English.
Biography
Mark Faulkner is Ussher Associate Professor of Medieval Literature at Trinity College Dublin. He is the author of A New Literary History of the Long Twelfth Century: Language and Literature between Old and Middle English (Cambridge University Press, 2022), and the forthcoming Critical Anthology of Twelfth-Century English: Writing the Vernacular in the Transitional Period (Arc Humanities Press, 2026), and one of the editors of the three-volume, one-million word, History of Punctuation in English Literature (Cambridge University Press, 2025), dedicated to his former supervisor, M. B. Parkes. Another strand of his work, developed in a long series of articles over the last eight years, centres on bringing new quantitative precision to our understanding of the medieval textual record. He collaborates frequently with computer scientists, including linked data specialists, NLP experts, machine learning experts and computations statisticians.
His funded projects include Searobend: Linked Metadata for English Language Texts, 1000-1300 (www.searobend.ie); Wandering Books, a collaborations with historians and a geneticist to develops new techniques for localising and dating early medieval manuscripts; and Ansund: Using Machine Learning to Develop a New, Exhaustive, Open Access Corpus of Old English. He is also a member of the new Erasmus+ network, Antidote, based in Reykjavik, which offers training in advanced techniques for digital editing. At Trinity, he established and directed first the M. Phil in Medieval Studies, then the Trinity Centre for the Book, which he currently directs. He was elected a Fellow of Trinity College Dublin in 2023.