12 Advanced Text Analysis
This chapter contains analyis steps that are advanced. Concretely, this means that they are not as easy to reproduce as the others.
12.1 When are characters mentioned?
The identification of character mentions is not done within the R package, but in the NLP pipeline DramaNLP, and is still in development (the NLP name for this task is coreference resolution. For demo purposes, we have included manually annotated coreferences in this R package. Thus, to reproduce the analyses presented here on new texts, one would need to annotate coreferences manually, for the time being.
When characters are speaking on stage, they are actively present. But they can also be passively present, if other characters refer to them. Both levels of presence can be extracted with the presence()
function:
# Load Emilia Galotti
data(rksp.0)
presence(rksp.0)
## corpus drama character scenes actives passives presence
## 1 test rksp.0 angelo 43 2 2 0.00000000
## 2 test rksp.0 appiani 43 5 20 -0.34883721
## 3 test rksp.0 battista 43 4 7 -0.06976744
## 4 test rksp.0 camillo_rota 43 1 1 0.00000000
## 5 test rksp.0 claudia_galotti 43 13 0 0.30232558
## 6 test rksp.0 conti 43 2 2 0.00000000
## 7 test rksp.0 der_kammerdiener 43 2 0 0.04651163
## 8 test rksp.0 der_prinz 43 17 20 -0.06976744
## 9 test rksp.0 emilia 43 7 29 -0.51162791
## 10 test rksp.0 marinelli 43 19 10 0.20930233
## 11 test rksp.0 odoardo 43 12 12 0.00000000
## 12 test rksp.0 orsina 43 6 10 -0.09302326
## 13 test rksp.0 pirro 43 4 2 0.04651163
As we can see, each character has a few numbers associated: The column actives
shows the number of scenes in which the character is actively present. This is equivalent to the information in the configuration matrix. The column passives
shows the number of scenes in which a character is mentioned. By default, this excludes the scenes in which they are present themselves, because they can be assumed to be addresses to the character (this behaviour can be changed by adding the parameter passiveOnlyWhenNotActive = TRUE
to the call of the presence
function).
A simple visualisation that shows the characters active and passive presence in one plot can be generated like this: The first line (plot()
) is responsible for the plotting of the symbols, the second line (text()
) adds the character names or ids numbers.
<- presence(rksp.0) %>%
pres characterNames(rksp.0)
plot(x=pres$active/pres$scenes,
y=pres$passive/pres$scenes,
xlim=c(0,1),
ylim=c(0,1),
xlab="Active",
ylab="Passive",
sub=dramaNames(rksp.0),
main="Character Presence")
text(x=pres$actives/pres$scenes,
y=pres$passives/pres$scenes,
labels=substr(pres$character,0,20),
pos=3,
cex=0.8)
lines(x=seq(0,0.5,0.1),seq(0,0.5,0.1), lty=3)
lines(x=1:0,y=0:1, lty=2)
In addition to the characters and their position, the plot shows a dashed and a dotted line. The dashed line indicates the maximum for one character. As one scene is either counted towards the active or passive presence, a character cannot have a presence in the top right triangle. The dotted line divides the characters in those that have more active scenes from those that have more passive scenes. As we can see here, almost all characters are placed in the left triangle, which means that their passive presence is higher than their active one. I.e., for each character, there are more scenes in which they are mentioned than scenes in which they speak.
The three characters Der Prinz, Emilia, Marinelli are all relatively close the dashed line. This indicates that they are the most present characters of the play.
12.2 Who mentions whom?
The above analysis shows mentions of characters by all (other) characters. To uncover the social relations in a play, it is of grave interest who talks about whom, i.e., who produces the mentions of Emilia, for instance.
There are multiple ways to represent this visually, of course. To get a general overview of the play, a directed network might be suitable. This will be created in the following.
# Get counts of speakers addressing other characters
<- aggregate.data.frame(
mc .0$mentions[, "corpus", with = FALSE],
rkspby = rksp.0$mentions[, c("utteranceSpeakerId", "entityId"),
with = FALSE],
FUN=length)
# Get top 7 characters based on the amount of tokens they utter
<- filterCharacters(
top_chars dictionaryStatistics(rksp.0),
.0, n = 7)$character
rksp# Only keep top 7 speaking chars
<- mc[mc$utteranceSpeakerId %in% top_chars,]
mc # Only keep mentions of top 7 chars
<- mc[mc$entityId %in% top_chars,]
mc names(mc)[names(mc)=="corpus"] <- "N"
<- acast(mc, utteranceSpeakerId~entityId,
am value.var = "N", fill = 0, drop = FALSE)
rownames(am) <- stringr::str_to_title(gsub("_", " ", rownames(am)))
colnames(am) <- stringr::str_to_title(gsub("_", " ", colnames(am)))
# Plot in a directed graph
<- graph.adjacency(am, mode = "directed",
g weighted = TRUE, diag = FALSE)
plot(g,
main="Mention Network",
sub=dramaNames(rksp.0),
vertex.size= degree(g)*3,
edge.width = E(g)$weight*0.02,
edge.arrow.size = 0.4,
edge.curved = 0.2,
vertex.color=qd.colors[6],
vertex.label.color="black",
layout = layout.circle)
The graph shows how often the displayed characters refer to each other. The wider the edge of an arrow, the more often the character mentions the character the arrow is pointing to. Larger nodes indicate that the character is often mentioned by others. Emilia is often mentioned by her mother and father, Der Prinz and Marinelli. Der Prinz and Marinelli refer to each other more than any other character in the play.