19  文本数据分析

R 语言官网的任务视图中有自然语言处理(Natural Language Processing)视图,它涵盖文本数据分析(Text Analysis)的内容。R 语言社区中有两本文本分析相关的著作,分别是《Text Mining with R》(Silge 和 Robinson 2017)和《Supervised Machine Learning for Text Analysis in R》(Hvitfeldt 和 Silge 2021)

本文获取 CRAN 上发布的 R 包元数据中的描述字段,利用文本分析工具,实现 R 包主题分类。首先加载后续用到的一些文本分析和建模的 R 包。

library(quanteda) # 词库、分词
library(quanteda.textplots) # 词云图、共现网络图等
library(quanteda.textstats) # 查询、统计
library(quanteda.textmodels) # LSA
library(ggplot2) # 绘图
library(text2vec) # LDA 算法

接着,调用 tools 包的函数 CRAN_package_db() 获取 R 包元数据,为了方便后续重复使用,保存到本地。

pdb <- readRDS(file = "data/cran-package-db-20250801.rds")
pdb <- subset(
  x = pdb, subset = !duplicated(Package), select = c("Package", "Title")
)

去除重复的记录,保留 Package 和 Title 字段,该数据集共含有 22509 个 R 包的元数据。

head(pdb)
#>         Package
#> 1 AalenJohansen
#> 2       aamatch
#> 3      AATtools
#> 4        ABACUS
#> 5   abasequence
#> 6    abbreviate
#>                                                                   Title
#> 1                                 Conditional Aalen-Johansen Estimation
#> 2    Artless Automatic Multivariate Matching for Observational\nStudies
#> 3      Reliability and Scoring Routines for the Approach-Avoidance Task
#> 4 Apps Based Activities for Communicating and Understanding\nStatistics
#> 5                               Coding 'ABA' Patterns for Sequence Data
#> 6                                          Readable String Abbreviation

19.1 语料预处理

  • 去掉换行符 \n 和单引号 ' 等。
pdb$Title <- gsub(pattern = "\n", replacement = " ", x = pdb$Title, fixed = T)
pdb$Title <- gsub(pattern = "'", replacement = "", x = pdb$Title, fixed = T)
pdb$Title <- gsub(pattern = '"', replacement = "", x = pdb$Title, fixed = T)
  • 名词和动词还原,如 models / modeling 还原为 model, methods 还原为 method 等等。
# Token 化之后操作
# 安装 tidytext 包
# 名词
SnowballC::wordStem(words = c("methods", "models"))
#> [1] "method" "model"
# pdb$Title_stem <- SnowballC::wordStem(pdb$Title)
tokenizers::tokenize_word_stems(x = c("methods", "models"))
#> [[1]]
#> [1] "method"
#> 
#> [[2]]
#> [1] "model"

接下来,要把该数据集整理成文本分析工具可以使用的数据类型 – 制作语料。

pdb_doc <- corpus(pdb, docid_field = "Package", text_field = "Title")

19.2 关键词检索

关键词检索就是从语料中查询某个字词及其在语料中的位置,函数 kwic() (函数名是 keywords-in-context 简写)用来做这个事。举两个例子,第一个例子查询语料中包含 Stan 的条目且精确匹配,返回检索词前后 3 个词。第二个例子查询语料中包含 text mining 的条目,采用正则方式匹配,返回检索词前后 2 个词。

pdb_toks <- pdb_doc |>
  tokens(remove_punct = TRUE, remove_symbols = TRUE) |>
  tokens_remove(stopwords("en"))
# 查询词 精确匹配
kwic(pdb_toks, pattern = "Stan", valuetype = "fixed", window = 3)
#> Keyword-in-context with 36 matches.                                                               
#>           [bayesdfa, 6]            Factor Analysis DFA | Stan |
#>      [bayesforecast, 5]           Time Series Modeling | Stan |
#>          [BayesGmed, 6]       Mediation Analysis using | Stan |
#>            [bayesvl, 9]       Networks Performing MCMC | Stan |
#>            [blmeco, 13]                   using R BUGS | Stan |
#>           [bmscstan, 7]              Case Models using | Stan |
#>               [bnns, 4]        Bayesian Neural Network | Stan |
#>               [brms, 5]        Regression Models using | Stan |
#>              [CARME, 3]               CAR-MM Modelling | Stan |
#>             [edstan, 1]                                | Stan |
#>            [flocker, 4]  Flexible Occupancy Estimation | Stan |
#>        [gptoolsStan, 5]      Processes Graphs Lattices | Stan |
#>  [greencrab.toolkit, 2]                            Run | Stan |
#>              [hbamr, 6]   Aldrich-McKelvey Scaling via | Stan |
#>            [hbsaems, 8]         Estimation Model using | Stan |
#>             [hsstan, 3]         Hierarchical Shrinkage | Stan |
#>              [measr, 5] Psychometric Measurement Using | Stan |
#>           [MetaStan, 4]     Bayesian Meta-Analysis via | Stan |
#>               [mlts, 7]                Series Models R | Stan |
#>       [pcFactorStan, 1]                                | Stan |
#>              [prome, 5]          Outcome Data Analysis | Stan |
#>              [rstan, 3]                    R Interface | Stan |
#>           [rstanarm, 6]        Regression Modeling via | Stan |
#>          [rstanemax, 4]            Emax Model Analysis | Stan |
#>         [rstantools, 6]         R Packages Interfacing | Stan |
#>           [RStanTVA, 3]                     TVA Models | Stan |
#>       [ssMousetrack, 7] Mouse-Tracking Experiments via | Stan |
#>        [StanHeaders, 4]                 C Header Files | Stan |
#>         [staninside, 3]               Facilitating Use | Stan |
#>           [StanMoMo, 4]   Bayesian Mortality Modelling | Stan |
#>           [survstan, 6]          Regression Models via | Stan |
#>           [tdcmStan, 3]            Automating Creation | Stan |
#>            [tmbstan, 7]             Model Object using | Stan |
#>               [trps, 6]          Position Models using | stan |
#>     [truncnormbayes, 7]      Normal Distribution using | Stan |
#>               [ubms, 7]         Unmarked Animals using | Stan |
#>                            
#>                            
#>                            
#>                            
#>                            
#>                            
#>                            
#>                            
#>                            
#>                            
#>  Models Item Response      
#>                            
#>                            
#>  Models Interpret Green    
#>                            
#>                            
#>  Models Biomarker Selection
#>                            
#>                            
#>                            
#>  Models Paired Comparison  
#>                            
#>                            
#>                            
#>                            
#>                            
#>  using R StanTVA           
#>                            
#>                            
#>  Within Packages           
#>                            
#>                            
#>  Code TDCMs                
#>                            
#>                            
#>                            
#> 
# 查询短语 正则匹配
kwic(pdb_toks, pattern = phrase("text mining"), valuetype = "regex", window = 2)
#> Keyword-in-context with 18 matches.                                                                 
#>             [MadanText, 2:3]              Persian | Text Mining |
#>      [MadanTextNetwork, 2:3]              Persian | Text Mining |
#>            [malaytextr, 1:2]                      | Text Mining |
#>            [miRetrieve, 2:3]                miRNA | Text Mining |
#>          [pubmed.mineR, 1:2]                      | Text Mining |
#>     [RcmdrPlugin.temis, 3:4] Graphical Integrated | Text Mining |
#>              [text2vec, 2:3]               Modern | Text Mining |
#>             [textmineR, 2:3]            Functions | Text Mining |
#>         [TextMiningGUI, 1:2]                      | Text Mining |
#>              [tidytext, 1:2]                      | Text Mining |
#>                    [tm, 1:2]                      | Text Mining |
#>     [tm.plugin.alceste, 8:9]             Using tm | Text Mining |
#>          [tm.plugin.dc, 1:2]                      | Text Mining |
#>  [tm.plugin.europresse, 6:7]             Using tm | Text Mining |
#>     [tm.plugin.factiva, 6:7]             Using tm | Text Mining |
#>  [tm.plugin.lexisnexis, 6:7]             Using tm | Text Mining |
#>        [tm.plugin.mail, 1:2]                      | Text Mining |
#>                  [tmcn, 1:2]                      | Text Mining |
#>                    
#>  Tool Frequency    
#>  Tool Co-Occurrence
#>  Bahasa Malaysia   
#>  Abstracts         
#>  PubMed Abstracts  
#>  Solution          
#>  Framework R       
#>  Topic Modeling    
#>  GUI Interface     
#>  using dplyr       
#>  Package           
#>  Framework         
#>  Distributed Corpus
#>  Framework         
#>  Framework         
#>  Framework         
#>  E-Mail Plug-in    
#>  Toolkit Chinese

19.3 高频词、词云

发现一些高频出现的词语或短语,通过这些高频词,这暗示一些信息。

library(quanteda.textplots)
word1 <- pdb_toks |>
  dfm() |>
  dfm_trim(min_termfreq = 100, termfreq_type = "count", verbose = FALSE)
# 高频词
colSums(word1) |>
  sort(decreasing = T) |>
  head(50)
#>          data      analysis        models             r         using 
#>          3704          2188          1655          1368          1069 
#>         model    regression    estimation     functions         tools 
#>           943           903           819           812           705 
#>      bayesian     interface          time       methods           api 
#>           686           534           533           513           510 
#>        linear         based     inference  multivariate        series 
#>           467           408           408           401           387 
#>   statistical       package   generalized         tests      multiple 
#>           386           369           351           351           345 
#>      modeling          test       spatial     selection  distribution 
#>           344           333           333           330           319 
#>         shiny    clustering      learning     algorithm    statistics 
#>           316           311           287           283           282 
#>       network           via         files        random        create 
#>           282           270           269           260           257 
#>        robust distributions          fast       testing      datasets 
#>           253           253           248           240           235 
#>      survival        method      networks         plots visualization 
#>           234           230           226           221           217

R 语言作为一门主要用于数据获取、分析、处理、建模和可视化的统计语言,从 data、analysis、 model 和 regression 等这些高频词可以看出 R 语言面向的领域的特点。在文本分析领域,词云图是很常见的,用来展示文本中的突出信息。下图展示了词频不小于 100 的词。

代码
# 词云
set.seed(20252025)
textplot_wordcloud(word1, min_size = 1, max_size = 5)
图 19.1: 高频词的词云图

19.4 关联词、短语

函数 textstat_collocations() 可以挖掘出词与词之间的关联度,下面统计出关联度很高的2个词的情况。

library(quanteda.textstats)
# 2个词 最少出现 50 次
word2 <- pdb_toks |>
  tokens_select(
    pattern = "^[A-Z]", valuetype = "regex",
    case_insensitive = FALSE, padding = TRUE
  ) |>
  textstat_collocations(min_count = 50, size = 2, tolower = FALSE)
word2 |>
  subset(select = c("collocation", "count", "lambda", "z")) |>
  (\(x) x[order(x$count, decreasing = T), ])()
#>             collocation count    lambda         z
#> 1           Time Series   358  8.886186 42.809101
#> 14        Data Analysis   249  1.949685 27.197032
#> 11    Regression Models   141  2.881471 29.106196
#> 10        Linear Models   129  3.160155 29.624159
#> 4           R Interface   121  4.285118 35.368242
#> 9      Machine Learning   119  8.403830 30.567338
#> 2    Generalized Linear   113  5.002594 39.092864
#> 15            Data Sets   111  4.988835 25.707833
#> 5           Sample Size   106  8.768460 35.117050
#> 3    Variable Selection    96  5.971678 38.190803
#> 18         Mixed Models    72  3.528405 23.567913
#> 22    Linear Regression    67  2.990861 21.924406
#> 8    Maximum Likelihood    65  7.893072 31.679061
#> 12  Structural Equation    65  8.790503 28.735260
#> 7  Confidence Intervals    63  8.051147 31.998266
#> 19           R Markdown    62  5.698013 22.955469
#> 6       Clinical Trials    60  6.932847 33.217354
#> 31          Monte Carlo    60 16.935814  8.450454
#> 13         Linear Mixed    59  4.772902 28.553353
#> 21        Least Squares    59 10.919997 22.517626
#> 30          Data Frames    57  7.463849  9.021014
#> 24            R Package    55  3.074332 20.339168
#> 28      Functional Data    55  2.529960 15.444194
#> 25   Component Analysis    54  4.535866 19.299476
#> 23           Read Write    53  9.105342 21.212857
#> 26         Missing Data    53  3.047534 16.573393
#> 17  Quantile Regression    52  4.748810 23.939178
#> 16   Density Estimation    51  4.676067 24.988171
#> 27    Survival Analysis    51  2.622638 16.378899
#> 29        Survival Data    51  2.052135 12.870873
#> 20  Logistic Regression    50  5.127532 22.921914

下面统计出关联度很高的3个词的情况。

# 3 个词 最少出现 20 次
word3 <- pdb_toks |>
  tokens_select(
    pattern = "^[A-Z]", valuetype = "regex",
    case_insensitive = FALSE, padding = TRUE
  ) |>
  textstat_collocations(min_count = 20, size = 3, tolower = FALSE)
word3 |>
  subset(select = c("collocation", "count", "lambda", "z")) |>
  (\(x) x[order(x$count, decreasing = T), ])()
#>                      collocation count      lambda           z
#> 13     Generalized Linear Models    69 -0.13517119 -0.43244858
#> 7           Time Series Analysis    42  1.12016111  0.74313086
#> 14           Linear Mixed Models    40 -0.18809004 -0.47728468
#> 4           Item Response Theory    36  1.90614701  1.12419922
#> 2          Small Area Estimation    35  3.11549146  1.79055963
#> 19  Principal Component Analysis    35 -2.79211006 -4.46229723
#> 8               Time Series Data    32  0.55233463  0.55552234
#> 12       Time Series Forecasting    32 -0.32631385 -0.16076916
#> 5           Hidden Markov Models    31  0.97456547  0.94494983
#> 15 Maximum Likelihood Estimation    31 -0.65446423 -0.70053361
#> 6   Structural Equation Modeling    28  1.66357871  0.81427939
#> 3       Graphical User Interface    25  2.86180134  1.37821333
#> 11    Structural Equation Models    24 -0.12463547 -0.12036645
#> 10     Kernel Density Estimation    23 -0.07627044 -0.09465991
#> 16       Sample Size Calculation    22 -1.72052709 -1.02470025
#> 9              Power Sample Size    21  0.98730566  0.48506557
#> 18         Partial Least Squares    21 -4.94125356 -4.13543783
#> 1           Mixed Effects Models    20  1.93861312  2.93500118
#> 17   Generalized Additive Models    20 -0.77558511 -1.56629285
#> 20      Generalized Linear Mixed    20 -2.67221340 -5.22754061
代码
ggplot(data = word2, aes(x = count, y = reorder(collocation, count))) +
  geom_col() +
  labs(x = NULL, y = NULL)
ggplot(data = word3, aes(x = count, y = reorder(collocation, count))) +
  geom_col() +
  labs(x = NULL, y = NULL)
(a) 两个词的高频短语
(b) 三个词的高频短语
图 19.2: 高频短语

19.5 特征共现网络

在文档 Token 化之后,函数 dfm() 构建文档特征(词)矩阵,函数 topfeatures() 提取文档-词矩阵中的主要特征(词)。

# 文档词矩阵
pdb_dfm <- pdb_toks |>
  dfm()
# 主要特征
topfeatures(pdb_dfm)
#>       data   analysis     models          r      using      model regression 
#>       3704       2188       1655       1368       1069        943        903 
#> estimation  functions      tools 
#>        819        812        705

在文档 Token 化之后,函数 fcm() 构建特征(词)共现矩阵,统计词频,挑选前 30 个高频词。

# 特征共现矩阵
pdb_fcm <- pdb_toks |>
  fcm(context = "window", window = 5)
# 高频特征
pdb_feat <- colSums(pdb_fcm) |>
  sort(decreasing = TRUE) |>
  head(30) |>
  names()

下图以网络图方式展示词与词之间的关联度,关联度越高,词与词之间的边越宽。

代码
fcm_select(pdb_fcm, pattern = pdb_feat) |>
  textplot_network(min_freq = 0.5)
图 19.3: 特征共现网络

19.6 词频文档统计

TF-IDF 词频与逆文档频率(term frequency-inverse document frequency weighting)

x = dfm_tfidf(pdb_dfm, base = 2, scheme_tf = "prop")

head(x[, 5:10])
#> Document-feature matrix of: 6 documents, 6 features (83.33% sparse) and 0 docvars.
#>                features
#> docs            automatic multivariate matching observational  studies
#>   AalenJohansen  0            0        0             0        0       
#>   aamatch        1.362135     0.970265 1.365238      1.473917 1.197015
#>   AATtools       0            0        0             0        0       
#>   ABACUS         0            0        0             0        0       
#>   abasequence    0            0        0             0        0       
#>   abbreviate     0            0        0             0        0       
#>                features
#> docs            reliability
#>   AalenJohansen    0       
#>   aamatch          0       
#>   AATtools         1.834562
#>   ABACUS           0       
#>   abasequence      0       
#>   abbreviate       0

每个特征的文档频率

docfreq(pdb_dfm)[5:15]
#>          automatic       multivariate           matching      observational 
#>                 78                398                 77                 49 
#>            studies        reliability            scoring           routines 
#>                155                 39                 26                 39 
#> approach-avoidance               task               apps 
#>                  1                 10                 42

特征频次作为权重

head(dfm_weight(pdb_dfm, scheme = "count")[, 5:10])
#> Document-feature matrix of: 6 documents, 6 features (83.33% sparse) and 0 docvars.
#>                features
#> docs            automatic multivariate matching observational studies
#>   AalenJohansen         0            0        0             0       0
#>   aamatch               1            1        1             1       1
#>   AATtools              0            0        0             0       0
#>   ABACUS                0            0        0             0       0
#>   abasequence           0            0        0             0       0
#>   abbreviate            0            0        0             0       0
#>                features
#> docs            reliability
#>   AalenJohansen           0
#>   aamatch                 0
#>   AATtools                1
#>   ABACUS                  0
#>   abasequence             0
#>   abbreviate              0

19.7 潜在语义分析

文档特征矩阵通过 SVD 分解将高维特征(15K)降至低维(10),获得文档主题矩阵与主题特征矩阵两个低秩矩阵,SVD 分解的作用在于去掉大量的噪声,取主要的信号部分(特征值非零的矩阵块)。

# 演示目的,减少特征量
pdb_dfm_core <- dfm_trim(pdb_dfm, min_termfreq = 100, termfreq_type = "count")
# 文档特征矩阵 SVD 分解 
tmod <- textmodel_lsa(pdb_dfm_core[1:2000,], nd = 10)
# 10 维的特征
head(tmod$docs)
#>                       [,1]          [,2]          [,3]          [,4]
#> AalenJohansen -0.002812327 -0.0037868556 -0.0041880899 -0.0057464591
#> aamatch       -0.001691310 -0.0004421003  0.0010473841 -0.0016673045
#> AATtools       0.000000000  0.0000000000  0.0000000000  0.0000000000
#> ABACUS        -0.001490313  0.0003064756  0.0008741824 -0.0005064783
#> abasequence   -0.029499643  0.0385875953 -0.0244375049 -0.0034771015
#> abbreviate     0.000000000  0.0000000000  0.0000000000  0.0000000000
#>                       [,5]         [,6]         [,7]          [,8]         [,9]
#> AalenJohansen  0.002155197  0.013780943 0.0067557309  0.0114430627 -0.020865398
#> aamatch       -0.001127767 -0.002070709 0.0016154237 -0.0005329059 -0.001647039
#> AATtools       0.000000000  0.000000000 0.0000000000  0.0000000000  0.000000000
#> ABACUS         0.003010378  0.001609039 0.0019505319  0.0056205304 -0.010698728
#> abasequence   -0.006236351 -0.004469552 0.0003145782 -0.0021933245  0.001462097
#> abbreviate     0.000000000  0.000000000 0.0000000000  0.0000000000  0.000000000
#>                      [,10]
#> AalenJohansen  0.026209453
#> aamatch       -0.001527705
#> AATtools       0.000000000
#> ABACUS        -0.004437714
#> abasequence    0.002961663
#> abbreviate     0.000000000
# 将新的数据代入 
pred <- predict(tmod, newdata = pdb_dfm_core[2001:2010, ])
# 查看前两个特征向量
pred$docs_newspace[, 1:2]
#> 10 x 2 Matrix of class "dgeMatrix"
#>           
#> docs                [,1]          [,2]
#>   BSGW     -0.0472541490 -5.188679e-02
#>   bshazard -0.0007970512 -6.088738e-04
#>   bSi      -0.0016681456 -1.570781e-03
#>   bsicons   0.0000000000  0.000000e+00
#>   bSims    -0.0002998020  3.677796e-05
#>   bsitar   -0.0460884818 -3.184701e-02
#>   bskyr     0.0000000000  0.000000e+00
#>   BSL      -0.0272898457 -3.673816e-02
#>   bslib    -0.0002362428  2.423737e-04
#>   BsMD     -0.0050720550 -4.815169e-03

19.8 文本主题探索

text2vec 包的实现 LDA(Latent Dirichlet Allocation)算法做文本主题建模。

library(text2vec)
# 创建 Tokens
pdb_tokens <- word_tokenizer(tolower(pdb$Title))
pdb_itokens <- itoken(pdb_tokens, ids = pdb$Package, progressbar = FALSE)
# 去掉停止词
pdb_v <- create_vocabulary(pdb_itokens, stopwords = stopwords("en"))
# 词频不小于 10 的词
# 文档比例不大于 0.2
pdb_v <- prune_vocabulary(pdb_v, term_count_min = 10, doc_proportion_max = 0.2)
pdb_v
#> Number of docs: 22509 
#> 175 stopwords: i, me, my, myself, we, our ... 
#> ngram_min = 1; ngram_max = 1 
#> Vocabulary: 
#>              term term_count doc_count
#>            <char>      <int>     <int>
#>    1:        2019         10         7
#>    2:           4         10        10
#>    3:      active         10        10
#>    4:       along         10        10
#>    5: alternating         10        10
#>   ---                                 
#> 1837:       using       1069      1069
#> 1838:           r       1408      1388
#> 1839:      models       1657      1642
#> 1840:    analysis       2306      2288
#> 1841:        data       3729      3634
vectorizer <- vocab_vectorizer(pdb_v)
pdb_dtm <- create_dtm(pdb_itokens, vectorizer, type = "dgTMatrix")
lda_model <- LDA$new(n_topics = 10, doc_topic_prior = 0.1, topic_word_prior = 0.01)
doc_topic_distr <- lda_model$fit_transform(
  x = pdb_dtm, n_iter = 1000,
  convergence_tol = 0.001, n_check_convergence = 25,
  progressbar = FALSE
)
代码
barplot(doc_topic_distr[1, ],
  xlab = "topic", ylab = "proportion", ylim = c(0, 1),
  names.arg = 1:ncol(doc_topic_distr)
)
图 19.4: 文档主题分布

每个主题抽取 Top 的 10 个词

lda_model$get_top_words(n = 10, topic_number = 1L:10L, lambda = 1)
#>       [,1]        [,2]            [,3]          [,4]          [,5]         
#>  [1,] "data"      "data"          "analysis"    "models"      "data"       
#>  [2,] "r"         "analysis"      "estimation"  "regression"  "analysis"   
#>  [3,] "api"       "functions"     "data"        "linear"      "statistical"
#>  [4,] "interface" "statistics"    "high"        "model"       "tests"      
#>  [5,] "access"    "tools"         "using"       "generalized" "control"    
#>  [6,] "files"     "network"       "functions"   "selection"   "methods"    
#>  [7,] "client"    "meta"          "matrix"      "bayesian"    "tools"      
#>  [8,] "wrapper"   "gene"          "correlation" "estimation"  "quality"    
#>  [9,] "read"      "visualization" "based"       "mixed"       "plot"       
#> [10,] "package"   "tool"          "response"    "variable"    "two"        
#>       [,6]            [,7]         [,8]        [,9]       [,10]        
#>  [1,] "models"        "learning"   "data"      "data"     "r"          
#>  [2,] "data"          "based"      "models"    "time"     "shiny"      
#>  [3,] "estimation"    "model"      "time"      "series"   "functions"  
#>  [4,] "distribution"  "analysis"   "model"     "analysis" "package"    
#>  [5,] "distributions" "bayesian"   "analysis"  "tools"    "create"     
#>  [6,] "sample"        "using"      "inference" "using"    "code"       
#>  [7,] "analysis"      "clustering" "based"     "r"        "interface"  
#>  [8,] "multivariate"  "machine"    "bayesian"  "spatial"  "ggplot2"    
#>  [9,] "single"        "multi"      "multiple"  "text"     "using"      
#> [10,] "model"         "models"     "survival"  "fast"     "interactive"

TODO: 使用交叉验证配合 perplexity 度量获取最佳的主题个数 (Zhang, Li, 和 Zhang 2023)

19.9 文本相似性

在互联网 App 中,计算文本之间的相似性有很多应用,如搜索、推荐和广告的召回阶段,根据用户输入的文本召回相关的内容。

text2vec 包实现了 GloVe 模型 — 一种词向量表示的无监督学习算法。从语料中生成词共现矩阵,基于此训练数据,算法得出的是词向量空间中的线性子结构。GloVe 度量了词与词之间共现的可能性,词与词抽象的概念差异可以表示成向量的差异。下面继续基于这份文本数据,生成词共现矩阵,计算文本相似度。

参考 quanteda 包官网对 text2vec 包GloVe 模块的介绍

# 特征筛选
feats <- dfm(pdb_toks, verbose = FALSE) |>
    dfm_trim(min_termfreq = 5) |>
    featnames()
# Token 处理
pdb_toks <- tokens_select(pdb_toks, feats, padding = TRUE)
# 特征共现矩阵
pdb_fcm <- fcm(pdb_toks, context = "window", count = "weighted", weights = 1 / (1:5), tri = TRUE)

GloVe 模型

library(text2vec)
# GloVe 模型  50 维词向量表示
glove <- GlobalVectors$new(rank = 50, x_max = 10)
# 词共现矩阵
wv_main <- glove$fit_transform(
  pdb_fcm,
  n_iter = 10,
  convergence_tol = 0.01,
  n_threads = 8,
  progressbar = FALSE
)

dim(wv_main)
#> [1] 3620   50
wv_context <- glove$components

dim(wv_context)
#> [1]   50 3620
word_vectors <- wv_main + t(wv_context)

计算距离

cpp <- word_vectors["rcpp", , drop = FALSE] -
  word_vectors["stan", , drop = FALSE] +
    word_vectors["python", , drop = FALSE]

# 文档的余弦相似性 取一个小数据
library(quanteda.textstats)
cos_sim <- textstat_simil(
  x = as.dfm(word_vectors), y = as.dfm(cpp),
  margin = "documents", method = "cosine")
# 召回的相关词
head(sort(cos_sim[, 1], decreasing = TRUE), 5)
#>      rcpp    python Dimension   ConText    Online 
#> 0.5180990 0.4928317 0.4536143 0.4365528 0.4194676

19.10 习题

  1. text2vec 包内置的电影评论数据集 movie_review 中 sentiment(表示正面或负面评价)列作为响应变量,构建二分类模型,对用户的一段评论分类。(提示:词向量化后,采用 glmnet 包做交叉验证调整参数、模型)

  2. 根据任务视图对 R 包的标记,建立有监督的多分类模型,评估模型的分类效果,并对尚未标记的 R 包分类。(提示:一个 R 包可能同时属于多个任务视图,考虑使用 xgboost 包)