5 Data

Before analysing any data, it needs to be imported, and converted into the proper structure. In QuaDramA, we process dramatic texts in multiple stages, described below.

5.1 Origin: TEI-XML

The base format that we use (and in which we put all our annotations) is an XML format known as TEI. This format is used by most researchers doing quantitatitve drama analysis. An excellent source for dramas in the proper format is DraCor, maintained by Frank Fischer.

While we are using GerDraCor as a basis, we have added linguistic annotations to a number of plays, and integrated more plays (e.g., translations) into the corpus. This corpus can be found here.

5.2 Preprocessed data

As a first step, we process all dramatic texts using our DramaNLP pipeline. The result of this processing is a set of CSV files for each play that contains the information in the play in a format suitable for analysis with R. This repository contains two plays in the format.

Different CSV files used in the analysis. ID is a placeholder for a unique identifier for the play
File Description
ID.Metadata.csv Meta data for the play (author, title, language, …)
ID.Characters.csv Characters of the play, with some character specific information
ID.Entities.csv All discourse entities (including characters, but also all other coreference chains)
ID.Mentions.csv Mentions associated with characters
ID.StageDirections.csv The stage directions of the play
ID.UtterancesWithTokens.csv All character utterances of the play
ID.Segments.csv Information about acts and scenes

Next to the above mentioned test-corpus, we are providing others as well. They are all stored in git repositories in the quadrama organization on GitHub. Repositories that start with data_ are corpora.

The part after the underscore (test in the above example) is considered to be the corpus prefix. Within a corpus, a play is identified by a unique id. Thus, test:rksp.0 is a full identified containing the corpus prefix and play id.

5.2.1 Sample data

For demo and test purposes, the DramaAnalysis r package contains the two plays of the test corpus. For technical reasons, the plays that are included in the package do not contain umlauts (äöüß etc.), but are restricted to ASCII characters. In these versions, all non-ASCII characters have been replaced by ASCII characters.

# Load Emilia Galotti
data(rksp.0)

# Load Miß Sara Sampson
data(rjmw.0)

text <- combine(rksp.0, rjmw.0)

5.3 Installing corpora

Installing a corpus that is available on github.com/quadrama is straightforward and can be done by entering the command installData() into the R console.

# installation of the test corpus
installData("test")

# installation of the quadrama corpus
installData("qd")

Corpora do not necessarily have to be provided by us, however. If a compatible set of CSV files is available from another source, the function installData() allows finer control to install data from anywhere, as long as it’s a git repository and can be cloned. See ?installData for details on the options.

5.3.1 Data sources and credits

The qd corpus that you have downloaded above, is the result of many hours of work and fruit of labor by different people and projects. Originally, the files have been assembled and XMLified in the TextGrid project. As the XML contained a number of issues and was not error-free, Peer Trilcke, Frank Fischer and Dario Kampkaspar have cleaned, converted and generally enhanced the files in order to do network analysis (Trilcke, Fischer, and Kampkaspar 2015). We in QuaDramA built on top of that, and re-added several translations to the corpus. For several plays, we also added coreference annotation (not yet released/published properly), which is available as part of the test corpus. In addition, we added automatically produced linguistic annotation (parts of speech and lemma).

5.4 Collection data

In addition to the above introduced corpora, we also support smaller groups of plays called collections. A collection is just a set of texts, and can include texts from multiple corpora. Typically, these sets have names, such as “comedies”, but it does not technically matter why texts are in a collection. Technically, these collections are just vectors of ids.

Pre-defined collections can be downloaded with the function installCollectionData(). This function clones a git repository (this one), which contains a number of plain text files that in turn contain drama ids. As before, users can feed in other sources for the collection data, enter ?installCollectionData in the R console to get more information about options and parameters.

installCollectionData()

Once collection (i.e., a vector with ids) has been defined, it can be passed as an argument to the function loadDrama(). The returning QDDrama object contains all loadable plays. Many functions work similarly for single texts or text collections, but some will not. The descriptions below contain information about this.

5.4.1 Defining collections

Before processing, it’s necessary to define a collection of texts, by assembling their ids in a list. These are considered to be sets of plays without internal structure (e.g., no play is marked as prototypical).