5 Data
Before analysing any data, it needs to be imported, and converted into the proper structure. In QuaDramA, we process dramatic texts in multiple stages, described below.
5.1 Origin: TEI-XML
The base format that we use (and in which we put all our annotations) is an XML format known as TEI. This format is used by most researchers doing quantitatitve drama analysis. An excellent source for dramas in the proper format is DraCor, maintained by Frank Fischer.
While we are using GerDraCor as a basis, we have added linguistic annotations to a number of plays, and integrated more plays (e.g., translations) into the corpus. This corpus can be found here.
5.2 Preprocessed data
As a first step, we process all dramatic texts using our DramaNLP pipeline. The result of this processing is a set of CSV files for each play that contains the information in the play in a format suitable for analysis with R. This repository contains two plays in the format.
File | Description |
---|---|
ID.Metadata.csv | Meta data for the play (author, title, language, …) |
ID.Characters.csv | Characters of the play, with some character specific information |
ID.Entities.csv | All discourse entities (including characters, but also all other coreference chains) |
ID.Mentions.csv | Mentions associated with characters |
ID.StageDirections.csv | The stage directions of the play |
ID.UtterancesWithTokens.csv | All character utterances of the play |
ID.Segments.csv | Information about acts and scenes |
Next to the above mentioned test
-corpus, we are providing others as well. They are all stored in git repositories in the quadrama organization on GitHub. Repositories that start with data_
are corpora.
The part after the underscore (test
in the above example) is considered to be the corpus prefix. Within a corpus, a play is identified by a unique id. Thus, test:rksp.0
is a full identified containing the corpus prefix and play id.
5.2.1 Sample data
For demo and test purposes, the DramaAnalysis
r package contains the two plays of the test
corpus. For technical reasons, the plays that are included in the package do not contain umlauts (äöüß etc.), but are restricted to ASCII characters. In these versions, all non-ASCII characters have been replaced by ASCII characters.
# Load Emilia Galotti
data(rksp.0)
# Load Miß Sara Sampson
data(rjmw.0)
<- combine(rksp.0, rjmw.0) text
5.3 Installing corpora
Installing a corpus that is available on github.com/quadrama is straightforward and can be done by entering the command installData()
into the R console.
# installation of the test corpus
installData("test")
# installation of the quadrama corpus
installData("qd")
Corpora do not necessarily have to be provided by us, however. If a compatible set of CSV files is available from another source, the function installData()
allows finer control to install data from anywhere, as long as it’s a git repository and can be cloned. See ?installData
for details on the options.
5.3.1 Data sources and credits
The qd
corpus that you have downloaded above, is the result of many hours of work and fruit of labor by different people and projects. Originally, the files have been assembled and XMLified in the TextGrid project. As the XML contained a number of issues and was not error-free, Peer Trilcke, Frank Fischer and Dario Kampkaspar have cleaned, converted and generally enhanced the files in order to do network analysis (Trilcke, Fischer, and Kampkaspar 2015). We in QuaDramA built on top of that, and re-added several translations to the corpus. For several plays, we also added coreference annotation (not yet released/published properly), which is available as part of the test
corpus. In addition, we added automatically produced linguistic annotation (parts of speech and lemma).
5.4 Collection data
In addition to the above introduced corpora, we also support smaller groups of plays called collections. A collection is just a set of texts, and can include texts from multiple corpora. Typically, these sets have names, such as “comedies”, but it does not technically matter why texts are in a collection. Technically, these collections are just vectors of ids.
Pre-defined collections can be downloaded with the function installCollectionData()
. This function clones a git repository (this one), which contains a number of plain text files that in turn contain drama ids. As before, users can feed in other sources for the collection data, enter ?installCollectionData
in the R console to get more information about options and parameters.
installCollectionData()
Once collection (i.e., a vector with ids) has been defined, it can be passed as an argument to the function loadDrama()
. The returning QDDrama
object contains all loadable plays. Many functions work similarly for single texts or text collections, but some will not. The descriptions below contain information about this.