11 Word Field Analysis
What we consider a word field here may differ from specific uses in linguistics. In this context, a word field is a list of words that belong to a common theme / topic / semantic group. Multiple word fields can be assembled to create a dictionary. On a technical level, what we describe in the following works for arbitrary lists of words. A semantic relation between the words is technically not required. Thus, the following pieces of code can be used with arbitrary word lists.
For demo purposes (this is really a toy example), we will define the word field of Love
as containing the words Liebe
(love) and Herz
(heart).
In R, we can put them in a character vector, and we use lemmas:
<- c("liebe", "herz") wf_love
We will test this word field on Emilia Galotti, which should be about love.
data("rksp.0")
11.1 Single word field
The core of the word field analysis is collecting statistics about a dictionary. Therefore, we use the function called dictionaryStatisticsSingle()
(single, because we only want to analyse a single word field):
dictionaryStatisticsSingle(
.0, # the text we want to process
rkspwordfield=wf_love # the word field
)
## corpus drama character x
## 1 test rksp.0 angelo 0
## 2 test rksp.0 appiani 0
## 3 test rksp.0 battista 0
## 4 test rksp.0 camillo_rota 0
## 5 test rksp.0 claudia_galotti 2
## 6 test rksp.0 conti 2
## 7 test rksp.0 der_kammerdiener 0
## 8 test rksp.0 der_prinz 10
## 9 test rksp.0 emilia 5
## 10 test rksp.0 marinelli 8
## 11 test rksp.0 odoardo 7
## 12 test rksp.0 orsina 5
## 13 test rksp.0 pirro 0
What this table shows us the number of times the characters in the play use words that appear in this list. By default, these are absolute numbers.
We can visualise these counts in a simple bar chart:
# retrieve counts and replace character ids by names
<- dictionaryStatisticsSingle(rksp.0, wordfield = wf_love) %>%
dstat characterNames(rksp.0)
# remove characters not using these words at all
<- dstat[dstat$x>0,]
dstat
# create a bar plot
barplot(dstat$x, # what to plot
names.arg = dstat$character, # x axis labels
las=2 # turn axis labels
)
Apparently, the prince and Marinelli are mentioning these words a lot more than the other characters.
When comparing characters, it often (but not always) makes sense to normalize the counts according to the total number of spoken words by a character. This can be enabled by setting the argument normalizeByCharacter=TRUE
. This will divide the number of words in this field by the total number of words a character speaks.
<- dictionaryStatisticsSingle(rksp.0, wordfield=wf_love,
dstat normalizeByCharacter = TRUE # apply normalization
%>% characterNames(rksp.0) # reformat character names
)
# remove figures not using these words at all
<- dstat[dstat$x>0,]
dstat
barplot(dstat$x,
names.arg = dstat$character,
las=2
)
11.2 Multiple Word Fields
The function dictionaryStatistics()
can be used to analyse multiple dictionaries at once. To this end, dictionaries are represented as lists of character vectors. The (named) outer list contains the keywords, the vectors are just words associated with the keyword.
New dictionaries can be easily created like this:
<- list(Liebe=c("liebe","herz","schmerz"), Hass=c("hass","hassen")) wf
This dictionary contains the two entries Liebe
(love) and Hass
(hate), with 3 respective 2 entries. Dictionaries can be created in code, like shown above. In addition, the function loadFields()
can be used to download dictionary content from a URL or a directory. By default, the function loads this dictionary from GitHub (that we used in publications), for the keywords Liebe
and Familie
(family). Since version 2.3.0, this dictionary is included in the package as base_dictionary
and can be used right away (without internet connection). It is also the default dictionary used by dictionaryStatistics()
.
The function loadFields()
offers parameters to load from different URLs via http or to load from plain text files that are stored locally. The latter can be achieved by specifying the directory as baseurl
. Entries for each keyword should then be stored in a file named like the keyword, and ending with txt
(by default, can be overwritten). See ?loadFields
for details. Some of the options can also be specified through dictionaryStatistics()
, as exemplified below.
The following examples use the base_dictionary
, i.e., a specific version of the fields we have been using in QuaDramA.
<- dictionaryStatistics(
dstat .0, # the text
rkspfieldnames = # fields we want to measure (from base_dictionary)
c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
normalizeByCharacter = TRUE, # normalization by figure
normalizeByField = TRUE # normalization by field
) dstat
## corpus drama character Liebe Familie Ratio
## 1 test rksp.0 angelo 4.707211e-05 6.061341e-05 6.828362e-05
## 2 test rksp.0 appiani 8.450546e-05 1.450870e-04 6.537871e-05
## 3 test rksp.0 battista 0.000000e+00 1.349619e-04 0.000000e+00
## 4 test rksp.0 camillo_rota 1.003613e-04 1.292324e-04 4.367575e-04
## 5 test rksp.0 claudia_galotti 7.467219e-05 2.307678e-04 3.899548e-05
## 6 test rksp.0 conti 1.531692e-04 5.379043e-05 3.635835e-05
## 7 test rksp.0 der_kammerdiener 0.000000e+00 0.000000e+00 0.000000e+00
## 8 test rksp.0 der_prinz 6.896790e-05 6.660598e-05 5.002301e-05
## 9 test rksp.0 emilia 9.004061e-05 3.304367e-04 4.702121e-05
## 10 test rksp.0 marinelli 7.894143e-05 1.524759e-04 4.580552e-05
## 11 test rksp.0 odoardo 5.635355e-05 2.257573e-04 6.267303e-05
## 12 test rksp.0 orsina 5.028230e-05 1.109950e-04 5.314227e-05
## 13 test rksp.0 pirro 0.000000e+00 1.871398e-04 5.059705e-05
## Krieg Religion
## 1 5.267594e-05 0.000000e+00
## 2 4.728281e-05 0.000000e+00
## 3 0.000000e+00 0.000000e+00
## 4 8.423181e-05 0.000000e+00
## 5 4.595895e-05 8.209574e-05
## 6 2.337322e-05 0.000000e+00
## 7 0.000000e+00 0.000000e+00
## 8 2.894188e-05 3.791218e-05
## 9 3.022792e-05 9.651721e-05
## 10 3.943715e-05 1.859773e-05
## 11 4.992433e-05 5.679295e-05
## 12 3.014373e-05 1.184596e-05
## 13 2.439500e-05 9.586809e-05
The variable dstat
now contains multiple columns, one for each word field. We have been using the option normalizeByCharacter
before. When comparing multiple fields, it often happens that they have a different size (i.e., different number of words). In this case, it makes sense to also normalize with the number of words in the word field. This is achieved by normalizeByField=TRUE
. This makes the numbers very small, but they should be used in comparison anyway.
11.2.1 Bar plot by character
It is now straightforward to show the distribution of fields for a single character:
<- as.matrix(dstat)
mat
barplot(mat[9,], # we select Emilia's line
main="Emilia's speech",
names.arg=colnames(mat)
)
11.2.2 Bar plot by field
Conversely, we can also show who uses words of a certain field how often:
<- dictionaryStatistics(
dstat .0, # the text
rkspfieldnames = # fields we want to measure (from base_dictionary)
c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
normalizeByCharacter = TRUE, # normalization by figure
normalizeByField = TRUE # normalization by field
%>%
) characterNames(rksp.0)
<- as.matrix(dstat)
mat
par(mar=c(9,4,1,1))
barplot(mat[,1], # Select the row for 'love'
main="Use of love words", # title for plot
beside = TRUE, # not stacked
names.arg = dstat$character, # x axis labels
las=2 # rotation for labels
)
11.3 Dictionary Based Distance
Technically, the output of dictionaryStatistics()
is a data.frame
. This is suitable for most uses. In some cases, however, it’s more suited to work with a matrix that only contains the raw numbers (i.e., number of family words). Calculating character distance based on dictionaries, for instance. For these cases, the package provides an S3 method that extracts the numeric part of the data.frame and creates a matrix. We have used this function as.matrix()
already above.
The matrix doesn’t have row names by default. The snippet below can be used to add row names.
<- dictionaryStatistics(rksp.0,
ds fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
normalizeByCharacter=TRUE)
<- as.matrix(ds)
m rownames(m) <- ds$character
m
## Liebe Familie Ratio Krieg Religion
## angelo 0.004424779 0.004424779 0.007374631 0.005899705 0.0000000000
## appiani 0.007943513 0.010591350 0.007060900 0.005295675 0.0000000000
## battista 0.000000000 0.009852217 0.000000000 0.000000000 0.0000000000
## camillo_rota 0.009433962 0.009433962 0.047169811 0.009433962 0.0000000000
## claudia_galotti 0.007019186 0.016846046 0.004211511 0.005147403 0.0046794572
## conti 0.014397906 0.003926702 0.003926702 0.002617801 0.0000000000
## der_kammerdiener 0.000000000 0.000000000 0.000000000 0.000000000 0.0000000000
## der_prinz 0.006482982 0.004862237 0.005402485 0.003241491 0.0021609941
## emilia 0.008463817 0.024121879 0.005078290 0.003385527 0.0055014812
## marinelli 0.007420495 0.011130742 0.004946996 0.004416961 0.0010600707
## odoardo 0.005297234 0.016480283 0.006768687 0.005591524 0.0032371984
## orsina 0.004726536 0.008102633 0.005739365 0.003376097 0.0006752194
## pirro 0.000000000 0.013661202 0.005464481 0.002732240 0.0054644809
Every character is now represented with five numbers, which can be interpreted as a vector in five-dimensional space. This means, we can easily apply distance metrics supplied by the function dist()
(from the default package stats
). By default, dist()
calculates Euclidean distance.
<- dist(m)
cdist # output not shown
The resulting data structure is similar to the one in the weighted configuration matrix, which means everything from Section 5.2.2 can be applied here. In particular, we can convert these distances into a network:
require(igraph)
<- graph_from_adjacency_matrix(as.matrix(cdist),
g weighted=TRUE, # weighted graph
mode="undirected", # no direction
diag=FALSE # no looping edges
)
This network can of course be visualised again.
# Now we plot
plot.igraph(g,
layout=layout_with_fr, # how to lay out the graph
main="Dictionary distance network", # title
vertex.label.cex=0.6, # label size
vertex.label.color="black", # font color
vertex.color=qd.colors[5], # vertex color
vertex.frame.color=NA, # no vertex border
edge.width=E(g)$weight*100 # scale edges according to their weight
# (since the distances are so small, we multiply)
)
Although this network is similar to the one shown in Section 5.2.2 (both undirected and weighted), it displays a totally different kind of information. The networks based on copresence connect characters that appear together on stage, while this network connects characters with similar thematic profile in their speech (within the limits of being able to detect thematic profiles via word fields).
11.4 Development over the course of the play
Finally, the function dictionaryStatistics()
can be used to track word field for segments of the play. To do that, we use the parameter segment
, and set it to either “Act” or “Scene”.
11.4.1 Individual characters
We can now plot the theme progress over the course of the play. This can be done for specific characters, as shown below.
<- dictionaryStatistics(rksp.0,
dsl fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
normalizeByCharacter=TRUE,
segment="Act")
<- as.matrix(dsl[dsl$character=="marinelli",])
mat
matplot(mat, type="l", col="black")
legend(x="topleft",legend=colnames(mat), lty=1:ncol(dsl))
Depending on the use case, it might be easier to interpret if the numbers are cumulatively added up. The snippet below shows how this works.
<- dictionaryStatistics(rksp.0,
dsl fieldnames=c("Liebe", "Familie", "Ratio", "Krieg", "Religion"),
normalizeByCharacter=TRUE,
segment="Act")
<- as.matrix(dsl[dsl$character=="marinelli",])
mat
<- apply(mat,2,cumsum)
mat
matplot(mat, type="l", col="black")
# Add act lines
abline(v=1:nrow(mat))
legend(x="topleft", legend=colnames(mat), lty=1:5)
11.4.2 Comparing characters
Simultaneously to the setting above, we now want to compare the development of several characters for a single word field. This unfortunately requires some reshuffling of the data, using the function reshape
(from the stats
package).
<- dictionaryStatistics(rksp.0,
dsl fieldnames=c("Liebe"),
normalizeByCharacter=TRUE,
segment="Act") %>%
filterCharacters(rksp.0,
n = 6) %>%
characterNames(rksp.0)
<- reshape(dsl, direction = "wide", # the table becomes wider
dsl timevar = "Number.Act", # the column that specifies multiple readings
times = "Liebe", # the number to distribute
idvar=c("corpus","drama","character") # what identifies a character
)
<- as.matrix.data.frame(dsl[,4:ncol(dsl)])
mat rownames(mat) <- dsl$character
<- apply(mat,1,cumsum)
mat
matplot(mat, type="l", lty = 1:ncol(mat), col="black", main="Liebe")
legend(x="topleft", legend=colnames(mat), lty=1:ncol(mat))
11.5 Corpus Analysis
Finally, we will do word field analysis with a small corpus. The following snippet creates a vector with ids for plays from the Sturm und Drang period. Providing this vector as an argument for the loadDrama()
function loads them all as a single QDDrama
object. To reproduce this, you will need to install the quadrama corpus first, which can be done by executing installData("qd")
.
<- c("qd:11f81.0", "qd:11g1d.0", "qd:11g9w.0",
sturm_und_drang.ids "qd:11hdv.0", "qd:nds0.0", "qd:r12k.0",
"qd:r12v.0", "qd:r134.0", "qd:r13g.0",
"qd:rfxf.0", "qd:rhtz.0", "qd:rhzq.0",
"qd:rj22.0", "qd:tx4z.0", "qd:tz39.0",
"qd:tzgk.0", "qd:v0fv.0", "qd:wznj.0",
"qd:tx4z.0", "qd:rfxf.0")
<- loadDrama(sturm_und_drang.ids) sturm_und_drang.plays
## Warning in loadDrama(sturm_und_drang.ids): 59 spoken words are not assigned to a
## character (NA values). They have been removed to prevent subsequent issues.
The resulting table is reproduced here in readable formatting:
::kable(sturm_und_drang.plays$meta) knitr
corpus | drama | documentTitle | language | Name | Pnd | Translator.Name | Translator.Pnd | Date.Written | Date.Printed | Date.Premiere | Date.Translation |
---|---|---|---|---|---|---|---|---|---|---|---|
qd | 11f81.0 | Clavigo | de | Goethe, Johann Wolfgang | 118540238 | NA | NA | 1774 | 1774 | 1774 | NA |
qd | 11g1d.0 | Götz von Berlichingen mit der eisernen Hand | de | Goethe, Johann Wolfgang | 118540238 | NA | NA | NA | 1773 | NA | NA |
qd | 11g9w.0 | Egmont | de | Goethe, Johann Wolfgang | 118540238 | NA | NA | 1787 | 1788 | 1791 | NA |
qd | 11hdv.0 | Stella | de | Goethe, Johann Wolfgang | 118540238 | NA | NA | 1776 | 1776 | 1776 | NA |
qd | nds0.0 | Ugolino | de | Gerstenberg, Heinrich Wilhelm von | 118690949 | NA | NA | NA | 1768 | 1769 | NA |
qd | r12k.0 | Sturm und Drang | de | Klinger, Friedrich Maximilian | 118563319 | NA | NA | 1776 | NA | NA | NA |
qd | r12v.0 | Die Zwillinge | de | Klinger, Friedrich Maximilian | 118563319 | NA | NA | NA | 1776 | 1776 | NA |
qd | r134.0 | Die neue Arria | de | Klinger, Friedrich Maximilian | 118563319 | NA | NA | NA | NA | NA | NA |
qd | r13g.0 | Simsone Grisaldo | de | Klinger, Friedrich Maximilian | 118563319 | NA | NA | NA | NA | NA | NA |
qd | rfxf.0 | Julius von Tarent | de | Leisewitz, Johann Anton | 118571370 | NA | NA | NA | NA | NA | NA |
qd | rhtz.0 | Die Soldaten | de | Lenz, Jakob Michael Reinhold | 118571656 | NA | NA | 1775 | 1776 | 1863 | NA |
qd | rhzq.0 | Der Hofmeister oder Vorteile der Privaterziehung | de | Lenz, Jakob Michael Reinhold | 118571656 | NA | NA | 1772 | 1774 | NA | NA |
qd | rj22.0 | Der neue Menoza | de | Lenz, Jakob Michael Reinhold | 118571656 | NA | NA | NA | NA | NA | NA |
qd | tx4z.0 | Don Carlos, Infant von Spanien | de | Schiller, Friedrich | 118607626 | NA | NA | 1787 | 1787 | 1787 | NA |
qd | tz39.0 | Kabale und Liebe | de | Schiller, Friedrich | 118607626 | NA | NA | 1783 | NA | NA | NA |
qd | tzgk.0 | Die Verschwörung des Fiesco zu Genua | de | Schiller, Friedrich | 118607626 | NA | NA | 1782 | NA | NA | NA |
qd | v0fv.0 | Die Räuber | de | Schiller, Friedrich | 118607626 | NA | NA | 1780 | 1781 | 1882 | NA |
qd | wznj.0 | Die Kindermörderin | de | Wagner, Heinrich Leopold | 11862833X | NA | NA | 1776 | NA | 1777 | NA |
For the sake of demo, we will use the base_dictionary
that is included in the R package. It contains entries for the fields Familie, Krieg, Ratio, Liebe, Religion. Typing base_dictionary
in the R console shows all words in all five fields. For loading other dictionaries, see above.
Counting word frequencies on a corpus works exactly as on a single text.
dictionaryStatistics(sturm_und_drang.plays,
fieldnames=names(base_dictionary),
byCharacter = FALSE)
## corpus drama Familie Krieg Ratio Liebe Religion
## 1 qd 11f81.0 169 48 100 161 48
## 2 qd 11g1d.0 158 184 107 189 142
## 3 qd 11g9w.0 140 184 120 174 80
## 4 qd 11hdv.0 119 27 41 208 77
## 5 qd nds0.0 293 47 51 126 107
## 6 qd r12k.0 217 87 68 215 59
## 7 qd r12v.0 349 143 49 192 65
## 8 qd r134.0 136 96 92 387 75
## 9 qd r13g.0 87 164 108 358 77
## 10 qd rfxf.0 506 168 220 430 160
## 11 qd rhtz.0 184 55 106 66 46
## 12 qd rhzq.0 362 77 126 100 115
## 13 qd rj22.0 218 51 96 116 102
## 14 qd tx4z.0 604 380 554 748 420
## 15 qd tz39.0 463 123 196 284 199
## 16 qd tzgk.0 220 231 159 210 115
## 17 qd v0fv.0 399 206 227 304 245
## 18 qd wznj.0 270 37 132 81 95
In order to visualize this in a time line, we need to merge this table with the meta data table. This can be done easily with the merge()
function. This function is quite handy in our use cases, as it can merge tables based on values in the table. In our case, we mostly want to merge tables that both have a corpus
and drama
column. If the two tables have columns with the same name, this is done automatically. Otherwise, one can specify the columns using the arguments by
, by.x
and/or by.y
.
As the data contains three different types of date (written, printed, premiere), and not all plays have all dates, we create an artificial reference date by taking the earliest date possible. This is done using the apply
function in the code below, and by taking the minimum value in each row.
After that, the table is ordered by this reference date, and the plotting itself can be done with regular plot()
function provided by R.
# count the words (as before)
<- dictionaryStatistics(sturm_und_drang.plays,
dstat fieldnames=names(base_dictionary),
byCharacter = FALSE,
normalizeByCharacter = TRUE)
# merge them with the meta object
<- merge(dstat, sturm_und_drang.plays$meta)
dstat
# for each play, take the earliest date available
# (not all plays have all kinds of date)
$Date.Ref <- apply(dstat[,c("Date.Printed", "Date.Written", "Date.Premiere")],
dstat1, min, na.rm=TRUE)
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
## Warning in FUN(newX[, i], ...): no non-missing arguments to min; returning Inf
# order them by this reference date
<- dstat[order(dstat$Date.Ref),]
dstat
# plot them
plot(Liebe ~ Date.Ref, # y ~ x
data = dstat[dstat$Date.Ref!=Inf,], # the data set, filtering Inf values
pch = 4, # we print a cross (see ?points for other options)
xlab="Year" # label of the x axis
)
The resulting plot shows the percentage of love-words in each play, organized by reference date. Thus, in 1776, a very “lovely” play has been published, achieving over 1.8 percent of love words (it’s Stella by Goethe). The identification of plays in this plot can be simplified if we plot not only crosses/points, but some kind of identifier. In the plot below, we use the textgrid id of the play (which we also use in QuaDramA, because it’s relatively short and still memorable).
# it makes use of the text() function much easier if we have
# a new variable for this filtered data set
<- dstat[dstat$Date.Ref!=Inf,]
dstat.filtered
plot(Liebe ~ Date.Ref, # y ~ x
data = dstat.filtered, # the data set
pch = 4, # we print a cross (see ?points for other options)
xlab="Year" # label of the x axis
)
text(x = dstat.filtered$Date.Ref+1,
y = dstat.filtered$Liebe,
labels=dstat.filtered$drama,
cex=0.7)
For publication, one might want to replace the ids with actual names. This can be done with the function dramaNames()
(not shown here).