19 文本数据分析
R 语言官网的任务视图中有自然语言处理(Natural Language Processing)视图,它涵盖文本数据分析(Text Analysis)的内容。R 语言社区中有两本文本分析相关的著作,分别是《Text Mining with R》(Silge 和 Robinson 2017)和《Supervised Machine Learning for Text Analysis in R》(Hvitfeldt 和 Silge 2021)。
本文获取 CRAN 上发布的 R 包元数据中的描述字段,利用文本分析工具,实现 R 包主题分类。首先加载后续用到的一些文本分析和建模的 R 包。
接着,调用 tools
包的函数 CRAN_package_db()
获取 R 包元数据,为了方便后续重复使用,保存到本地。
去除重复的记录,保留 Package 和 Title 字段,该数据集共含有 22509 个 R 包的元数据。
#> Package
#> 1 AalenJohansen
#> 2 aamatch
#> 3 AATtools
#> 4 ABACUS
#> 5 abasequence
#> 6 abbreviate
#> Title
#> 1 Conditional Aalen-Johansen Estimation
#> 2 Artless Automatic Multivariate Matching for Observational\nStudies
#> 3 Reliability and Scoring Routines for the Approach-Avoidance Task
#> 4 Apps Based Activities for Communicating and Understanding\nStatistics
#> 5 Coding 'ABA' Patterns for Sequence Data
#> 6 Readable String Abbreviation
19.1 语料预处理
- 去掉换行符
\n
和单引号'
等。
- 名词和动词还原,如 models / modeling 还原为 model, methods 还原为 method 等等。
#> [1] "method" "model"
# pdb$Title_stem <- SnowballC::wordStem(pdb$Title)
tokenizers::tokenize_word_stems(x = c("methods", "models"))
#> [[1]]
#> [1] "method"
#>
#> [[2]]
#> [1] "model"
接下来,要把该数据集整理成文本分析工具可以使用的数据类型 – 制作语料。
19.2 关键词检索
关键词检索就是从语料中查询某个字词及其在语料中的位置,函数 kwic()
(函数名是 keywords-in-context 简写)用来做这个事。举两个例子,第一个例子查询语料中包含 Stan 的条目且精确匹配,返回检索词前后 3 个词。第二个例子查询语料中包含 text mining
的条目,采用正则方式匹配,返回检索词前后 2 个词。
pdb_toks <- pdb_doc |>
tokens(remove_punct = TRUE, remove_symbols = TRUE) |>
tokens_remove(stopwords("en"))
# 查询词 精确匹配
kwic(pdb_toks, pattern = "Stan", valuetype = "fixed", window = 3)
#> Keyword-in-context with 36 matches.
#> [bayesdfa, 6] Factor Analysis DFA | Stan |
#> [bayesforecast, 5] Time Series Modeling | Stan |
#> [BayesGmed, 6] Mediation Analysis using | Stan |
#> [bayesvl, 9] Networks Performing MCMC | Stan |
#> [blmeco, 13] using R BUGS | Stan |
#> [bmscstan, 7] Case Models using | Stan |
#> [bnns, 4] Bayesian Neural Network | Stan |
#> [brms, 5] Regression Models using | Stan |
#> [CARME, 3] CAR-MM Modelling | Stan |
#> [edstan, 1] | Stan |
#> [flocker, 4] Flexible Occupancy Estimation | Stan |
#> [gptoolsStan, 5] Processes Graphs Lattices | Stan |
#> [greencrab.toolkit, 2] Run | Stan |
#> [hbamr, 6] Aldrich-McKelvey Scaling via | Stan |
#> [hbsaems, 8] Estimation Model using | Stan |
#> [hsstan, 3] Hierarchical Shrinkage | Stan |
#> [measr, 5] Psychometric Measurement Using | Stan |
#> [MetaStan, 4] Bayesian Meta-Analysis via | Stan |
#> [mlts, 7] Series Models R | Stan |
#> [pcFactorStan, 1] | Stan |
#> [prome, 5] Outcome Data Analysis | Stan |
#> [rstan, 3] R Interface | Stan |
#> [rstanarm, 6] Regression Modeling via | Stan |
#> [rstanemax, 4] Emax Model Analysis | Stan |
#> [rstantools, 6] R Packages Interfacing | Stan |
#> [RStanTVA, 3] TVA Models | Stan |
#> [ssMousetrack, 7] Mouse-Tracking Experiments via | Stan |
#> [StanHeaders, 4] C Header Files | Stan |
#> [staninside, 3] Facilitating Use | Stan |
#> [StanMoMo, 4] Bayesian Mortality Modelling | Stan |
#> [survstan, 6] Regression Models via | Stan |
#> [tdcmStan, 3] Automating Creation | Stan |
#> [tmbstan, 7] Model Object using | Stan |
#> [trps, 6] Position Models using | stan |
#> [truncnormbayes, 7] Normal Distribution using | Stan |
#> [ubms, 7] Unmarked Animals using | Stan |
#>
#>
#>
#>
#>
#>
#>
#>
#>
#>
#> Models Item Response
#>
#>
#> Models Interpret Green
#>
#>
#> Models Biomarker Selection
#>
#>
#>
#> Models Paired Comparison
#>
#>
#>
#>
#>
#> using R StanTVA
#>
#>
#> Within Packages
#>
#>
#> Code TDCMs
#>
#>
#>
#>
#> Keyword-in-context with 18 matches.
#> [MadanText, 2:3] Persian | Text Mining |
#> [MadanTextNetwork, 2:3] Persian | Text Mining |
#> [malaytextr, 1:2] | Text Mining |
#> [miRetrieve, 2:3] miRNA | Text Mining |
#> [pubmed.mineR, 1:2] | Text Mining |
#> [RcmdrPlugin.temis, 3:4] Graphical Integrated | Text Mining |
#> [text2vec, 2:3] Modern | Text Mining |
#> [textmineR, 2:3] Functions | Text Mining |
#> [TextMiningGUI, 1:2] | Text Mining |
#> [tidytext, 1:2] | Text Mining |
#> [tm, 1:2] | Text Mining |
#> [tm.plugin.alceste, 8:9] Using tm | Text Mining |
#> [tm.plugin.dc, 1:2] | Text Mining |
#> [tm.plugin.europresse, 6:7] Using tm | Text Mining |
#> [tm.plugin.factiva, 6:7] Using tm | Text Mining |
#> [tm.plugin.lexisnexis, 6:7] Using tm | Text Mining |
#> [tm.plugin.mail, 1:2] | Text Mining |
#> [tmcn, 1:2] | Text Mining |
#>
#> Tool Frequency
#> Tool Co-Occurrence
#> Bahasa Malaysia
#> Abstracts
#> PubMed Abstracts
#> Solution
#> Framework R
#> Topic Modeling
#> GUI Interface
#> using dplyr
#> Package
#> Framework
#> Distributed Corpus
#> Framework
#> Framework
#> Framework
#> E-Mail Plug-in
#> Toolkit Chinese
19.3 高频词、词云
发现一些高频出现的词语或短语,通过这些高频词,这暗示一些信息。
library(quanteda.textplots)
word1 <- pdb_toks |>
dfm() |>
dfm_trim(min_termfreq = 100, termfreq_type = "count", verbose = FALSE)
# 高频词
colSums(word1) |>
sort(decreasing = T) |>
head(50)
#> data analysis models r using
#> 3704 2188 1655 1368 1069
#> model regression estimation functions tools
#> 943 903 819 812 705
#> bayesian interface time methods api
#> 686 534 533 513 510
#> linear based inference multivariate series
#> 467 408 408 401 387
#> statistical package generalized tests multiple
#> 386 369 351 351 345
#> modeling test spatial selection distribution
#> 344 333 333 330 319
#> shiny clustering learning algorithm statistics
#> 316 311 287 283 282
#> network via files random create
#> 282 270 269 260 257
#> robust distributions fast testing datasets
#> 253 253 248 240 235
#> survival method networks plots visualization
#> 234 230 226 221 217
R 语言作为一门主要用于数据获取、分析、处理、建模和可视化的统计语言,从 data、analysis、 model 和 regression 等这些高频词可以看出 R 语言面向的领域的特点。在文本分析领域,词云图是很常见的,用来展示文本中的突出信息。下图展示了词频不小于 100 的词。
19.4 关联词、短语
函数 textstat_collocations()
可以挖掘出词与词之间的关联度,下面统计出关联度很高的2个词的情况。
library(quanteda.textstats)
# 2个词 最少出现 50 次
word2 <- pdb_toks |>
tokens_select(
pattern = "^[A-Z]", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE
) |>
textstat_collocations(min_count = 50, size = 2, tolower = FALSE)
word2 |>
subset(select = c("collocation", "count", "lambda", "z")) |>
(\(x) x[order(x$count, decreasing = T), ])()
#> collocation count lambda z
#> 1 Time Series 358 8.886186 42.809101
#> 14 Data Analysis 249 1.949685 27.197032
#> 11 Regression Models 141 2.881471 29.106196
#> 10 Linear Models 129 3.160155 29.624159
#> 4 R Interface 121 4.285118 35.368242
#> 9 Machine Learning 119 8.403830 30.567338
#> 2 Generalized Linear 113 5.002594 39.092864
#> 15 Data Sets 111 4.988835 25.707833
#> 5 Sample Size 106 8.768460 35.117050
#> 3 Variable Selection 96 5.971678 38.190803
#> 18 Mixed Models 72 3.528405 23.567913
#> 22 Linear Regression 67 2.990861 21.924406
#> 8 Maximum Likelihood 65 7.893072 31.679061
#> 12 Structural Equation 65 8.790503 28.735260
#> 7 Confidence Intervals 63 8.051147 31.998266
#> 19 R Markdown 62 5.698013 22.955469
#> 6 Clinical Trials 60 6.932847 33.217354
#> 31 Monte Carlo 60 16.935814 8.450454
#> 13 Linear Mixed 59 4.772902 28.553353
#> 21 Least Squares 59 10.919997 22.517626
#> 30 Data Frames 57 7.463849 9.021014
#> 24 R Package 55 3.074332 20.339168
#> 28 Functional Data 55 2.529960 15.444194
#> 25 Component Analysis 54 4.535866 19.299476
#> 23 Read Write 53 9.105342 21.212857
#> 26 Missing Data 53 3.047534 16.573393
#> 17 Quantile Regression 52 4.748810 23.939178
#> 16 Density Estimation 51 4.676067 24.988171
#> 27 Survival Analysis 51 2.622638 16.378899
#> 29 Survival Data 51 2.052135 12.870873
#> 20 Logistic Regression 50 5.127532 22.921914
下面统计出关联度很高的3个词的情况。
# 3 个词 最少出现 20 次
word3 <- pdb_toks |>
tokens_select(
pattern = "^[A-Z]", valuetype = "regex",
case_insensitive = FALSE, padding = TRUE
) |>
textstat_collocations(min_count = 20, size = 3, tolower = FALSE)
word3 |>
subset(select = c("collocation", "count", "lambda", "z")) |>
(\(x) x[order(x$count, decreasing = T), ])()
#> collocation count lambda z
#> 13 Generalized Linear Models 69 -0.13517119 -0.43244858
#> 7 Time Series Analysis 42 1.12016111 0.74313086
#> 14 Linear Mixed Models 40 -0.18809004 -0.47728468
#> 4 Item Response Theory 36 1.90614701 1.12419922
#> 2 Small Area Estimation 35 3.11549146 1.79055963
#> 19 Principal Component Analysis 35 -2.79211006 -4.46229723
#> 8 Time Series Data 32 0.55233463 0.55552234
#> 12 Time Series Forecasting 32 -0.32631385 -0.16076916
#> 5 Hidden Markov Models 31 0.97456547 0.94494983
#> 15 Maximum Likelihood Estimation 31 -0.65446423 -0.70053361
#> 6 Structural Equation Modeling 28 1.66357871 0.81427939
#> 3 Graphical User Interface 25 2.86180134 1.37821333
#> 11 Structural Equation Models 24 -0.12463547 -0.12036645
#> 10 Kernel Density Estimation 23 -0.07627044 -0.09465991
#> 16 Sample Size Calculation 22 -1.72052709 -1.02470025
#> 9 Power Sample Size 21 0.98730566 0.48506557
#> 18 Partial Least Squares 21 -4.94125356 -4.13543783
#> 1 Mixed Effects Models 20 1.93861312 2.93500118
#> 17 Generalized Additive Models 20 -0.77558511 -1.56629285
#> 20 Generalized Linear Mixed 20 -2.67221340 -5.22754061
代码


19.5 特征共现网络
在文档 Token 化之后,函数 dfm()
构建文档特征(词)矩阵,函数 topfeatures()
提取文档-词矩阵中的主要特征(词)。
#> data analysis models r using model regression
#> 3704 2188 1655 1368 1069 943 903
#> estimation functions tools
#> 819 812 705
在文档 Token 化之后,函数 fcm()
构建特征(词)共现矩阵,统计词频,挑选前 30 个高频词。
下图以网络图方式展示词与词之间的关联度,关联度越高,词与词之间的边越宽。
19.6 词频文档统计
TF-IDF 词频与逆文档频率(term frequency-inverse document frequency weighting)
#> Document-feature matrix of: 6 documents, 6 features (83.33% sparse) and 0 docvars.
#> features
#> docs automatic multivariate matching observational studies
#> AalenJohansen 0 0 0 0 0
#> aamatch 1.362135 0.970265 1.365238 1.473917 1.197015
#> AATtools 0 0 0 0 0
#> ABACUS 0 0 0 0 0
#> abasequence 0 0 0 0 0
#> abbreviate 0 0 0 0 0
#> features
#> docs reliability
#> AalenJohansen 0
#> aamatch 0
#> AATtools 1.834562
#> ABACUS 0
#> abasequence 0
#> abbreviate 0
每个特征的文档频率
#> automatic multivariate matching observational
#> 78 398 77 49
#> studies reliability scoring routines
#> 155 39 26 39
#> approach-avoidance task apps
#> 1 10 42
特征频次作为权重
#> Document-feature matrix of: 6 documents, 6 features (83.33% sparse) and 0 docvars.
#> features
#> docs automatic multivariate matching observational studies
#> AalenJohansen 0 0 0 0 0
#> aamatch 1 1 1 1 1
#> AATtools 0 0 0 0 0
#> ABACUS 0 0 0 0 0
#> abasequence 0 0 0 0 0
#> abbreviate 0 0 0 0 0
#> features
#> docs reliability
#> AalenJohansen 0
#> aamatch 0
#> AATtools 1
#> ABACUS 0
#> abasequence 0
#> abbreviate 0
19.7 潜在语义分析
文档特征矩阵通过 SVD 分解将高维特征(15K)降至低维(10),获得文档主题矩阵与主题特征矩阵两个低秩矩阵,SVD 分解的作用在于去掉大量的噪声,取主要的信号部分(特征值非零的矩阵块)。
# 演示目的,减少特征量
pdb_dfm_core <- dfm_trim(pdb_dfm, min_termfreq = 100, termfreq_type = "count")
# 文档特征矩阵 SVD 分解
tmod <- textmodel_lsa(pdb_dfm_core[1:2000,], nd = 10)
# 10 维的特征
head(tmod$docs)
#> [,1] [,2] [,3] [,4]
#> AalenJohansen -0.002812327 -0.0037868556 -0.0041880899 -0.0057464591
#> aamatch -0.001691310 -0.0004421003 0.0010473841 -0.0016673045
#> AATtools 0.000000000 0.0000000000 0.0000000000 0.0000000000
#> ABACUS -0.001490313 0.0003064756 0.0008741824 -0.0005064783
#> abasequence -0.029499643 0.0385875953 -0.0244375049 -0.0034771015
#> abbreviate 0.000000000 0.0000000000 0.0000000000 0.0000000000
#> [,5] [,6] [,7] [,8] [,9]
#> AalenJohansen 0.002155197 0.013780943 0.0067557309 0.0114430627 -0.020865398
#> aamatch -0.001127767 -0.002070709 0.0016154237 -0.0005329059 -0.001647039
#> AATtools 0.000000000 0.000000000 0.0000000000 0.0000000000 0.000000000
#> ABACUS 0.003010378 0.001609039 0.0019505319 0.0056205304 -0.010698728
#> abasequence -0.006236351 -0.004469552 0.0003145782 -0.0021933245 0.001462097
#> abbreviate 0.000000000 0.000000000 0.0000000000 0.0000000000 0.000000000
#> [,10]
#> AalenJohansen 0.026209453
#> aamatch -0.001527705
#> AATtools 0.000000000
#> ABACUS -0.004437714
#> abasequence 0.002961663
#> abbreviate 0.000000000
# 将新的数据代入
pred <- predict(tmod, newdata = pdb_dfm_core[2001:2010, ])
# 查看前两个特征向量
pred$docs_newspace[, 1:2]
#> 10 x 2 Matrix of class "dgeMatrix"
#>
#> docs [,1] [,2]
#> BSGW -0.0472541490 -5.188679e-02
#> bshazard -0.0007970512 -6.088738e-04
#> bSi -0.0016681456 -1.570781e-03
#> bsicons 0.0000000000 0.000000e+00
#> bSims -0.0002998020 3.677796e-05
#> bsitar -0.0460884818 -3.184701e-02
#> bskyr 0.0000000000 0.000000e+00
#> BSL -0.0272898457 -3.673816e-02
#> bslib -0.0002362428 2.423737e-04
#> BsMD -0.0050720550 -4.815169e-03
19.8 文本主题探索
text2vec 包的实现 LDA(Latent Dirichlet Allocation)算法做文本主题建模。
library(text2vec)
# 创建 Tokens
pdb_tokens <- word_tokenizer(tolower(pdb$Title))
pdb_itokens <- itoken(pdb_tokens, ids = pdb$Package, progressbar = FALSE)
# 去掉停止词
pdb_v <- create_vocabulary(pdb_itokens, stopwords = stopwords("en"))
# 词频不小于 10 的词
# 文档比例不大于 0.2
pdb_v <- prune_vocabulary(pdb_v, term_count_min = 10, doc_proportion_max = 0.2)
pdb_v
#> Number of docs: 22509
#> 175 stopwords: i, me, my, myself, we, our ...
#> ngram_min = 1; ngram_max = 1
#> Vocabulary:
#> term term_count doc_count
#> <char> <int> <int>
#> 1: 2019 10 7
#> 2: 4 10 10
#> 3: active 10 10
#> 4: along 10 10
#> 5: alternating 10 10
#> ---
#> 1837: using 1069 1069
#> 1838: r 1408 1388
#> 1839: models 1657 1642
#> 1840: analysis 2306 2288
#> 1841: data 3729 3634
vectorizer <- vocab_vectorizer(pdb_v)
pdb_dtm <- create_dtm(pdb_itokens, vectorizer, type = "dgTMatrix")
lda_model <- LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(
x = pdb_dtm, n_iter = 1000,
convergence_tol = 0.001, n_check_convergence = 25,
progressbar = FALSE
)
代码

每个主题抽取 Top 的 10 个词
#> [,1] [,2] [,3] [,4] [,5]
#> [1,] "data" "data" "analysis" "models" "data"
#> [2,] "r" "analysis" "estimation" "regression" "analysis"
#> [3,] "api" "functions" "data" "linear" "statistical"
#> [4,] "interface" "statistics" "high" "model" "tests"
#> [5,] "access" "tools" "using" "generalized" "control"
#> [6,] "files" "network" "functions" "selection" "methods"
#> [7,] "client" "meta" "matrix" "bayesian" "tools"
#> [8,] "wrapper" "gene" "correlation" "estimation" "quality"
#> [9,] "read" "visualization" "based" "mixed" "plot"
#> [10,] "package" "tool" "response" "variable" "two"
#> [,6] [,7] [,8] [,9] [,10]
#> [1,] "models" "learning" "data" "data" "r"
#> [2,] "data" "based" "models" "time" "shiny"
#> [3,] "estimation" "model" "time" "series" "functions"
#> [4,] "distribution" "analysis" "model" "analysis" "package"
#> [5,] "distributions" "bayesian" "analysis" "tools" "create"
#> [6,] "sample" "using" "inference" "using" "code"
#> [7,] "analysis" "clustering" "based" "r" "interface"
#> [8,] "multivariate" "machine" "bayesian" "spatial" "ggplot2"
#> [9,] "single" "multi" "multiple" "text" "using"
#> [10,] "model" "models" "survival" "fast" "interactive"
TODO: 使用交叉验证配合 perplexity 度量获取最佳的主题个数 (Zhang, Li, 和 Zhang 2023)
19.9 文本相似性
在互联网 App 中,计算文本之间的相似性有很多应用,如搜索、推荐和广告的召回阶段,根据用户输入的文本召回相关的内容。
text2vec 包实现了 GloVe 模型 — 一种词向量表示的无监督学习算法。从语料中生成词共现矩阵,基于此训练数据,算法得出的是词向量空间中的线性子结构。GloVe 度量了词与词之间共现的可能性,词与词抽象的概念差异可以表示成向量的差异。下面继续基于这份文本数据,生成词共现矩阵,计算文本相似度。
参考 quanteda 包官网对 text2vec 包GloVe 模块的介绍。
GloVe 模型
library(text2vec)
# GloVe 模型 50 维词向量表示
glove <- GlobalVectors$new(rank = 50, x_max = 10)
# 词共现矩阵
wv_main <- glove$fit_transform(
pdb_fcm,
n_iter = 10,
convergence_tol = 0.01,
n_threads = 8,
progressbar = FALSE
)
dim(wv_main)
#> [1] 3620 50
#> [1] 50 3620
计算距离
cpp <- word_vectors["rcpp", , drop = FALSE] -
word_vectors["stan", , drop = FALSE] +
word_vectors["python", , drop = FALSE]
# 文档的余弦相似性 取一个小数据
library(quanteda.textstats)
cos_sim <- textstat_simil(
x = as.dfm(word_vectors), y = as.dfm(cpp),
margin = "documents", method = "cosine")
# 召回的相关词
head(sort(cos_sim[, 1], decreasing = TRUE), 5)
#> rcpp python Dimension ConText Online
#> 0.5180990 0.4928317 0.4536143 0.4365528 0.4194676
19.10 习题
text2vec 包内置的电影评论数据集
movie_review
中 sentiment(表示正面或负面评价)列作为响应变量,构建二分类模型,对用户的一段评论分类。(提示:词向量化后,采用 glmnet 包做交叉验证调整参数、模型)根据任务视图对 R 包的标记,建立有监督的多分类模型,评估模型的分类效果,并对尚未标记的 R 包分类。(提示:一个 R 包可能同时属于多个任务视图,考虑使用 xgboost 包)