Given a frequency table (with texts as rows and words as columns),
this function calculates log-likelihood and log ratio of one set of rows against the other rows.
The return value is a list containing scores for each word. If the method
is loglikelihood
, the returned scores are unsigned G2 values. To estimate the
direction of the keyness, the log ratio
is more informative. A nice introduction
into log ratio can be found here.
keyness(ft, categories = c(1, rep(2, nrow(ft) - 1)), epsilon = 1e-100, siglevel = 0.05, method = c("loglikelihood", "logratio"), minimalFrequency = 10)
ft | The frequency table |
---|---|
categories | A factor or numeric vector that represents an assignment of categories. |
epsilon | null values are replaced by this value, in order to avoid division by zero |
siglevel | Return only the keywords above the significance level. Set to 1 to get all words |
method | Either "logratio" or "loglikelihood" (default) |
minimalFrequency | Words less frequent than this value are not considered at all |
A list of keywords, sorted by their log-likelihood or log ratio value, calculated according to http://ucrel.lancs.ac.uk/llwizard.html.
data("rksp.0") ft <- frequencytable(rksp.0, byCharacter = TRUE, normalize = FALSE) # Calculate log ratio for all words genders <- factor(c("m", "m", "m", "m", "f", "m", "m", "m", "f", "m", "m", "f", "m")) keywords <- keyness(ft, method = "logratio", categories = genders, minimalFrequency = 5) # Remove words that are not significantly different keywords <- keywords[names(keywords) %in% names(keyness(ft, siglevel = 0.01))]