11 Word Field Analysis

What we consider a word field here may differ from specific uses in linguistics. In this context, a word field is a list of words that belong to a common theme / topic / semantic group. Multiple word fields can be assembled to create a dictionary. On a technical level, what we describe in the following works for arbitrary lists of words. A semantic relation between the words is technically not required. Thus, the following pieces of code can be used with arbitrary word lists.

For demo purposes (this is really a toy example), we will define the word field of Love as containing the words Liebe (love) and Herz (heart). In R, we can put them in a character vector, and we use lemmas:

wf_love <- c("liebe", "herz")

We will test this word field on Emilia Galotti, which should be about love.

data("rksp.0")

11.1 Single word field

The core of the word field analysis is collecting statistics about a dictionary. Therefore, we use the function called dictionaryStatisticsSingle() (single, because we only want to analyse a single word field):

dictionaryStatisticsSingle(
  rksp.0,              # the text we want to process
  wordfield=wf_love    # the word field
)
##    corpus  drama        character  x
## 1    test rksp.0           angelo  0
## 2    test rksp.0          appiani  0
## 3    test rksp.0         battista  0
## 4    test rksp.0     camillo_rota  0
## 5    test rksp.0  claudia_galotti  2
## 6    test rksp.0            conti  2
## 7    test rksp.0 der_kammerdiener  0
## 8    test rksp.0        der_prinz 10
## 9    test rksp.0           emilia  5
## 10   test rksp.0        marinelli  8
## 11   test rksp.0          odoardo  7
## 12   test rksp.0           orsina  5
## 13   test rksp.0            pirro  0

What this table shows us the number of times the characters in the play use words that appear in this list. By default, these are absolute numbers.

We can visualise these counts in a simple bar chart:

# retrieve counts and replace character ids by names
dstat <- dictionaryStatisticsSingle(rksp.0, wordfield = wf_love) %>% 
  characterNames(rksp.0)

# remove characters not using these words at all
dstat <- dstat[dstat$x>0,]

# create a bar plot
barplot(dstat$x,                      # what to plot
        names.arg = dstat$character,  # x axis labels
        las=2                         # turn axis labels
)

Apparently, the prince and Marinelli are mentioning these words a lot more than the other characters.

When comparing characters, it often (but not always) makes sense to normalize the counts according to the total number of spoken words by a character. This can be enabled by setting the argument normalizeByCharacter=TRUE. This will divide the number of words in this field by the total number of words a character speaks.

dstat <- dictionaryStatisticsSingle(rksp.0, wordfield=wf_love,
  normalizeByCharacter = TRUE   # apply normalization
) %>% characterNames(rksp.0)         # reformat character names

# remove figures not using these words at all
dstat <- dstat[dstat$x>0,]

barplot(dstat$x, 
        names.arg = dstat$character, 
        las=2            
)

11.2 Multiple Word Fields

The function dictionaryStatistics() can be used to analyse multiple dictionaries at once. To this end, dictionaries are represented as lists of character vectors. The (named) outer list contains the keywords, the vectors are just words associated with the keyword.

New dictionaries can be easily created like this:

wf <- list(Liebe=c("liebe","herz","schmerz"), Hass=c("hass","hassen"))

This dictionary contains the two entries Liebe (love) and Hass (hate), with 3 respective 2 entries. Dictionaries can be created in code, like shown above. In addition, the function loadFields() can be used to download dictionary content from a URL or a directory. By default, the function loads this dictionary from GitHub (that we used in publications), for the keywords Liebe and Familie (family). Since version 2.3.0, this dictionary is included in the package as base_dictionary and can be used right away (without internet connection). It is also the default dictionary used by dictionaryStatistics().

The function loadFields() offers parameters to load from different URLs via http or to load from plain text files that are stored locally. The latter can be achieved by specifying the directory as baseurl. Entries for each keyword should then be stored in a file named like the keyword, and ending with txt (by default, can be overwritten). See ?loadFields for details. Some of the options can also be specified through dictionaryStatistics(), as exemplified below.

The following examples use the base_dictionary, i.e., a specific version of the fields we have been using in QuaDramA.

dstat <- dictionaryStatistics(
  rksp.0,  # the text
  fieldnames = # fields we want to measure (from base_dictionary)
    c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
  normalizeByCharacter = TRUE,   # normalization by figure
  normalizeByField = TRUE     # normalization by field
)
dstat
##    corpus  drama        character        Liebe      Familie        Ratio
## 1    test rksp.0           angelo 4.707211e-05 6.061341e-05 6.828362e-05
## 2    test rksp.0          appiani 8.450546e-05 1.450870e-04 6.537871e-05
## 3    test rksp.0         battista 0.000000e+00 1.349619e-04 0.000000e+00
## 4    test rksp.0     camillo_rota 1.003613e-04 1.292324e-04 4.367575e-04
## 5    test rksp.0  claudia_galotti 7.467219e-05 2.307678e-04 3.899548e-05
## 6    test rksp.0            conti 1.531692e-04 5.379043e-05 3.635835e-05
## 7    test rksp.0 der_kammerdiener 0.000000e+00 0.000000e+00 0.000000e+00
## 8    test rksp.0        der_prinz 6.896790e-05 6.660598e-05 5.002301e-05
## 9    test rksp.0           emilia 9.004061e-05 3.304367e-04 4.702121e-05
## 10   test rksp.0        marinelli 7.894143e-05 1.524759e-04 4.580552e-05
## 11   test rksp.0          odoardo 5.635355e-05 2.257573e-04 6.267303e-05
## 12   test rksp.0           orsina 5.028230e-05 1.109950e-04 5.314227e-05
## 13   test rksp.0            pirro 0.000000e+00 1.871398e-04 5.059705e-05
##           Krieg     Religion
## 1  5.267594e-05 0.000000e+00
## 2  4.728281e-05 0.000000e+00
## 3  0.000000e+00 0.000000e+00
## 4  8.423181e-05 0.000000e+00
## 5  4.595895e-05 8.209574e-05
## 6  2.337322e-05 0.000000e+00
## 7  0.000000e+00 0.000000e+00
## 8  2.894188e-05 3.791218e-05
## 9  3.022792e-05 9.651721e-05
## 10 3.943715e-05 1.859773e-05
## 11 4.992433e-05 5.679295e-05
## 12 3.014373e-05 1.184596e-05
## 13 2.439500e-05 9.586809e-05

The variable dstat now contains multiple columns, one for each word field. We have been using the option normalizeByCharacter before. When comparing multiple fields, it often happens that they have a different size (i.e., different number of words). In this case, it makes sense to also normalize with the number of words in the word field. This is achieved by normalizeByField=TRUE. This makes the numbers very small, but they should be used in comparison anyway.

11.2.1 Bar plot by character

It is now straightforward to show the distribution of fields for a single character:

mat <- as.matrix(dstat)

barplot(mat[9,],     # we select Emilia's line
        main="Emilia's speech",
        names.arg=colnames(mat)
)

11.2.2 Bar plot by field

Conversely, we can also show who uses words of a certain field how often:

dstat <- dictionaryStatistics(
  rksp.0,  # the text
  fieldnames = # fields we want to measure (from base_dictionary)
    c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
  normalizeByCharacter = TRUE,   # normalization by figure
  normalizeByField = TRUE     # normalization by field
) %>% 
  characterNames(rksp.0)
  
mat <- as.matrix(dstat)

par(mar=c(9,4,1,1))

barplot(mat[,1],                  # Select the row for 'love'
        main="Use of love words", # title for plot
        beside = TRUE,            # not stacked
        names.arg = dstat$character,   # x axis labels
        las=2                     # rotation for labels
)

11.3 Dictionary Based Distance

Technically, the output of dictionaryStatistics() is a data.frame. This is suitable for most uses. In some cases, however, it’s more suited to work with a matrix that only contains the raw numbers (i.e., number of family words). Calculating character distance based on dictionaries, for instance. For these cases, the package provides an S3 method that extracts the numeric part of the data.frame and creates a matrix. We have used this function as.matrix() already above.

The matrix doesn’t have row names by default. The snippet below can be used to add row names.

ds <- dictionaryStatistics(rksp.0, 
                            fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
                            normalizeByCharacter=TRUE)
m <- as.matrix(ds)
rownames(m) <- ds$character
m
##                        Liebe     Familie       Ratio       Krieg     Religion
## angelo           0.004424779 0.004424779 0.007374631 0.005899705 0.0000000000
## appiani          0.007943513 0.010591350 0.007060900 0.005295675 0.0000000000
## battista         0.000000000 0.009852217 0.000000000 0.000000000 0.0000000000
## camillo_rota     0.009433962 0.009433962 0.047169811 0.009433962 0.0000000000
## claudia_galotti  0.007019186 0.016846046 0.004211511 0.005147403 0.0046794572
## conti            0.014397906 0.003926702 0.003926702 0.002617801 0.0000000000
## der_kammerdiener 0.000000000 0.000000000 0.000000000 0.000000000 0.0000000000
## der_prinz        0.006482982 0.004862237 0.005402485 0.003241491 0.0021609941
## emilia           0.008463817 0.024121879 0.005078290 0.003385527 0.0055014812
## marinelli        0.007420495 0.011130742 0.004946996 0.004416961 0.0010600707
## odoardo          0.005297234 0.016480283 0.006768687 0.005591524 0.0032371984
## orsina           0.004726536 0.008102633 0.005739365 0.003376097 0.0006752194
## pirro            0.000000000 0.013661202 0.005464481 0.002732240 0.0054644809

Every character is now represented with five numbers, which can be interpreted as a vector in five-dimensional space. This means, we can easily apply distance metrics supplied by the function dist() (from the default package stats). By default, dist() calculates Euclidean distance.

cdist <- dist(m)
# output not shown

The resulting data structure is similar to the one in the weighted configuration matrix, which means everything from Section 5.2.2 can be applied here. In particular, we can convert these distances into a network:

require(igraph)
g <- graph_from_adjacency_matrix(as.matrix(cdist), 
                                 weighted=TRUE,     # weighted graph
                                 mode="undirected", # no direction
                                 diag=FALSE         # no looping edges
                                )

This network can of course be visualised again.

# Now we plot
plot.igraph(g, 
            layout=layout_with_fr,       # how to lay out the graph
            main="Dictionary distance network",  # title
            vertex.label.cex=0.6,         # label size
            vertex.label.color="black",   # font color
            vertex.color=qd.colors[5],    # vertex color
            vertex.frame.color=NA,        # no vertex border
            edge.width=E(g)$weight*100    # scale edges according to their weight 
                                          # (since the distances are so small, we multiply)
            )  

Although this network is similar to the one shown in Section 5.2.2 (both undirected and weighted), it displays a totally different kind of information. The networks based on copresence connect characters that appear together on stage, while this network connects characters with similar thematic profile in their speech (within the limits of being able to detect thematic profiles via word fields).

11.4 Development over the course of the play

Finally, the function dictionaryStatistics() can be used to track word field for segments of the play. To do that, we use the parameter segment, and set it to either “Act” or “Scene”.

11.4.1 Individual characters

We can now plot the theme progress over the course of the play. This can be done for specific characters, as shown below.

dsl <- dictionaryStatistics(rksp.0, 
                            fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
                            normalizeByCharacter=TRUE,
                            segment="Act")

mat <- as.matrix(dsl[dsl$character=="marinelli",])

matplot(mat, type="l", col="black")
legend(x="topleft",legend=colnames(mat), lty=1:ncol(dsl))

Depending on the use case, it might be easier to interpret if the numbers are cumulatively added up. The snippet below shows how this works.

dsl <- dictionaryStatistics(rksp.0, 
                            fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
                            normalizeByCharacter=TRUE,
                            segment="Act") 

mat <- as.matrix(dsl[dsl$character=="marinelli",])

mat <- apply(mat,2,cumsum)

matplot(mat, type="l", col="black")

# Add act lines
abline(v=1:nrow(mat))

legend(x="topleft", legend=colnames(mat), lty=1:5)

11.4.2 Comparing characters

Simultaneously to the setting above, we now want to compare the development of several characters for a single word field. This unfortunately requires some reshuffling of the data, using the function reshape (from the stats package).

dsl <- dictionaryStatistics(rksp.0, 
                            fieldnames=c("Liebe"),
                            normalizeByCharacter=TRUE,
                            segment="Act") %>% 
  filterCharacters(rksp.0,
         n = 6) %>%
  characterNames(rksp.0)

dsl <- reshape(dsl, direction = "wide",  # the table becomes wider
               timevar = "Number.Act",   # the column that specifies multiple readings
               times = "Liebe",          # the number to distribute
               idvar=c("corpus","drama","character")   # what identifies a character
)

mat <- as.matrix.data.frame(dsl[,4:ncol(dsl)])
rownames(mat) <- dsl$character
mat <- apply(mat,1,cumsum)

matplot(mat, type="l", lty = 1:ncol(mat), col="black", main="Liebe")
legend(x="topleft", legend=colnames(mat), lty=1:ncol(mat))

11.5 Corpus Analysis

Finally, we will do word field analysis with a small corpus. The following snippet creates a vector with ids for plays from the Sturm und Drang period. Providing this vector as an argument for the loadDrama() function loads them all as a single QDDrama object. To reproduce this, you will need to install the quadrama corpus first, which can be done by executing installData("qd").

sturm_und_drang.ids <- c("qd:11f81.0", "qd:11g1d.0", "qd:11g9w.0", 
                         "qd:11hdv.0", "qd:nds0.0", "qd:r12k.0", 
                         "qd:r12v.0", "qd:r134.0",  "qd:r13g.0", 
                         "qd:rfxf.0",  "qd:rhtz.0", "qd:rhzq.0", 
                         "qd:rj22.0", "qd:tx4z.0", "qd:tz39.0", 
                         "qd:tzgk.0", "qd:v0fv.0", "qd:wznj.0", 
                         "qd:tx4z.0", "qd:rfxf.0")

sturm_und_drang.plays <- loadDrama(sturm_und_drang.ids)
## Warning in loadDrama(sturm_und_drang.ids): 59 spoken words are not assigned to a
## character (NA values). They have been removed to prevent subsequent issues.

The resulting table is reproduced here in readable formatting:

knitr::kable(sturm_und_drang.plays$meta)
corpus drama documentTitle language Name Pnd Translator.Name Translator.Pnd Date.Written Date.Printed Date.Premiere Date.Translation
qd 11f81.0 Clavigo de Goethe, Johann Wolfgang 118540238 NA NA 1774 1774 1774 NA
qd 11g1d.0 Götz von Berlichingen mit der eisernen Hand de Goethe, Johann Wolfgang 118540238 NA NA NA 1773 NA NA
qd 11g9w.0 Egmont de Goethe, Johann Wolfgang 118540238 NA NA 1787 1788 1791 NA
qd 11hdv.0 Stella de Goethe, Johann Wolfgang 118540238 NA NA 1776 1776 1776 NA
qd nds0.0 Ugolino de Gerstenberg, Heinrich Wilhelm von 118690949 NA NA NA 1768 1769 NA
qd r12k.0 Sturm und Drang de Klinger, Friedrich Maximilian 118563319 NA NA 1776 NA NA NA
qd r12v.0 Die Zwillinge de Klinger, Friedrich Maximilian 118563319 NA NA NA 1776 1776 NA
qd r134.0 Die neue Arria de Klinger, Friedrich Maximilian 118563319 NA NA NA NA NA NA
qd r13g.0 Simsone Grisaldo de Klinger, Friedrich Maximilian 118563319 NA NA NA NA NA NA
qd rfxf.0 Julius von Tarent de Leisewitz, Johann Anton 118571370 NA NA NA NA NA NA
qd rhtz.0 Die Soldaten de Lenz, Jakob Michael Reinhold 118571656 NA NA 1775 1776 1863 NA
qd rhzq.0 Der Hofmeister oder Vorteile der Privaterziehung de Lenz, Jakob Michael Reinhold 118571656 NA NA 1772 1774 NA NA
qd rj22.0 Der neue Menoza de Lenz, Jakob Michael Reinhold 118571656 NA NA NA NA NA NA
qd tx4z.0 Don Carlos, Infant von Spanien de Schiller, Friedrich 118607626 NA NA 1787 1787 1787 NA
qd tz39.0 Kabale und Liebe de Schiller, Friedrich 118607626 NA NA 1783 NA NA NA
qd tzgk.0 Die Verschwörung des Fiesco zu Genua de Schiller, Friedrich 118607626 NA NA 1782 NA NA NA
qd v0fv.0 Die Räuber de Schiller, Friedrich 118607626 NA NA 1780 1781 1882 NA
qd wznj.0 Die Kindermörderin de Wagner, Heinrich Leopold 11862833X NA NA 1776 NA 1777 NA

For the sake of demo, we will use the base_dictionary that is included in the R package. It contains entries for the fields Familie, Krieg, Ratio, Liebe, Religion. Typing base_dictionary in the R console shows all words in all five fields. For loading other dictionaries, see above.

Counting word frequencies on a corpus works exactly as on a single text.

dictionaryStatistics(sturm_und_drang.plays, 
                     fieldnames=names(base_dictionary), 
                     byCharacter = FALSE)
##    corpus   drama Familie Krieg Ratio Liebe Religion
## 1      qd 11f81.0     169    48   100   161       48
## 2      qd 11g1d.0     158   184   107   189      142
## 3      qd 11g9w.0     140   184   120   174       80
## 4      qd 11hdv.0     119    27    41   208       77
## 5      qd  nds0.0     293    47    51   126      107
## 6      qd  r12k.0     217    87    68   215       59
## 7      qd  r12v.0     349   143    49   192       65
## 8      qd  r134.0     136    96    92   387       75
## 9      qd  r13g.0      87   164   108   358       77
## 10     qd  rfxf.0     506   168   220   430      160
## 11     qd  rhtz.0     184    55   106    66       46
## 12     qd  rhzq.0     362    77   126   100      115
## 13     qd  rj22.0     218    51    96   116      102
## 14     qd  tx4z.0     604   380   554   748      420
## 15     qd  tz39.0     463   123   196   284      199
## 16     qd  tzgk.0     220   231   159   210      115
## 17     qd  v0fv.0     399   206   227   304      245
## 18     qd  wznj.0     270    37   132    81       95

In order to visualize this in a time line, we need to merge this table with the meta data table. This can be done easily with the merge() function. This function is quite handy in our use cases, as it can merge tables based on values in the table. In our case, we mostly want to merge tables that both have a corpus and drama column. If the two tables have columns with the same name, this is done automatically. Otherwise, one can specify the columns using the arguments by, by.x and/or by.y.

As the data contains three different types of date (written, printed, premiere), and not all plays have all dates, we create an artificial reference date by taking the earliest date possible. This is done using the apply function in the code below, and by taking the minimum value in each row.

After that, the table is ordered by this reference date, and the plotting itself can be done with regular plot() function provided by R.

# count the words (as before)
dstat <- dictionaryStatistics(sturm_und_drang.plays, 
                     fieldnames=names(base_dictionary), 
                     byCharacter = FALSE, 
                     normalizeByCharacter = TRUE)

# merge them with the meta object
dstat <- merge(dstat, sturm_und_drang.plays$meta)

# for each play, take the earliest date available
# (not all plays have all kinds of date)
dstat$Date.Ref <- apply(dstat[,c("Date.Printed", "Date.Written", "Date.Premiere")], 
                        1, min, na.rm=TRUE)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf

## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
# order them by this reference date
dstat <- dstat[order(dstat$Date.Ref),]

# plot them
plot(Liebe ~ Date.Ref,  # y ~ x
     data = dstat[dstat$Date.Ref!=Inf,],  # the data set, filtering Inf values
     pch = 4,           # we print a cross (see ?points for other options)
     xlab="Year"        # label of the x axis
)

The resulting plot shows the percentage of love-words in each play, organized by reference date. Thus, in 1776, a very “lovely” play has been published, achieving over 1.8 percent of love words (it’s Stella by Goethe). The identification of plays in this plot can be simplified if we plot not only crosses/points, but some kind of identifier. In the plot below, we use the textgrid id of the play (which we also use in QuaDramA, because it’s relatively short and still memorable).

# it makes use of the text() function much easier if we have
# a new variable for this filtered data set
dstat.filtered <- dstat[dstat$Date.Ref!=Inf,]

plot(Liebe ~ Date.Ref,  # y ~ x
     data = dstat.filtered, # the data set
     pch = 4,           # we print a cross (see ?points for other options)
     xlab="Year"        # label of the x axis
)

text(x = dstat.filtered$Date.Ref+1, 
     y = dstat.filtered$Liebe,
     labels=dstat.filtered$drama,
     cex=0.7)

For publication, one might want to replace the ids with actual names. This can be done with the function dramaNames() (not shown here).

11.6 Enriching the word fields distributionally

(coming soon)