This function calculates a variant of TF-IDF. The input is assumed to contain relative frequencies. IDF is calculated as follows: \(idf_t = \log\frac{N+1}{n_t}\), with \(N\) being the total number of documents (i.e., rows) and \(n_t\) the number of documents containing term \(t\). We add one to the denominator to prevent terms that appear in every document to become 0.
tfidf(ftable)
ftable | A matrix, containing "documents" as rows and "terms" as columns. Values are assumed to be normalized by document, i.e., contain relative frequencies. |
---|
A matrix containing TF*IDF values instead of relative frequencies.
data(rksp.0) ftable <- frequencytable(rksp.0, byCharacter=TRUE, normalize=TRUE) rksp.0.tfidf <- tfidf(ftable) mat <- matrix(c(0.10,0.2, 0, 0, 0.2, 0, 0.1, 0.2, 0.1, 0.8, 0.4, 0.9), nrow=3,ncol=4) mat2 <- tfidf(mat) print(mat2)#> [,1] [,2] [,3] [,4] #> [1,] 0.06931472 0.0000000 0.02876821 0.2301457 #> [2,] 0.13862944 0.2772589 0.05753641 0.1150728 #> [3,] 0.00000000 0.0000000 0.02876821 0.2589139