结构主题模型(一)stm包工作流
前言
對(duì)論文(stm: An R Package for Structural Topic Models)中stm代碼的工作流進(jìn)行梳理,總體結(jié)構(gòu)參考論文原文,但對(duì)部分代碼執(zhí)行的順序提出個(gè)人想法。因時(shí)間有限,存在未能解決的問(wèn)題(如選擇合適的主題數(shù)……),論文后面的部分內(nèi)容仍未詳細(xì)敘述,后續(xù)有時(shí)間將會(huì)補(bǔ)充。若有朋友能提出有效的修改建議和解決方案,博主將在第一時(shí)間做出反饋。最后,希望對(duì)使用STM結(jié)構(gòu)主題模型的朋友們有幫助😁
論文復(fù)現(xiàn)過(guò)程中的相關(guān)問(wèn)題匯總
結(jié)構(gòu)主題模型(二)復(fù)現(xiàn)
論文原文、數(shù)據(jù)及代碼
stm: An R Package for Structural Topic Model
stm庫(kù)官方文檔
3.0 讀取數(shù)據(jù)
樣例數(shù)據(jù)poliblogs2008.csv為一個(gè)關(guān)于美國(guó)政治的博文集,來(lái)自CMU2008年政治博客語(yǔ)料庫(kù):American Thinker, Digby, Hot Air, Michelle Malkin, Think Progress, and Talking Points Memo。每個(gè)博客論壇都有自己的政治傾向,所以每篇博客都有寫(xiě)作日期和政治意識(shí)形態(tài)的元數(shù)據(jù)。
建議讀取xlsx,因?yàn)閏sv文件以逗號(hào)作為分隔符,有時(shí)會(huì)出現(xiàn)問(wèn)題。pandas:csv-excel文件相互轉(zhuǎn)換
# data <- read.csv("./poliblogs2008.csv", sep =",", quote = "", header = TRUE, fileEncoding = "UTF-8") data <- read_excel(path = "./poliblogs2008.xlsx", sheet = "Sheet1", col_names = TRUE)若數(shù)據(jù)為中文,可參考以下文章對(duì)中文文本進(jìn)行分詞等預(yù)處理操作后,再進(jìn)行后續(xù)步驟
以3.0為開(kāi)始序號(hào)是為了和論文原文保持一致
3.1 Ingest: Reading and processing text data
提取數(shù)據(jù):將原始數(shù)據(jù)處理成STM可以分析的三塊內(nèi)容(分別是documents,vocab ,meta),用到的是textProcessor或readCorpus這兩個(gè)函數(shù)。
textProcessor()函數(shù)旨在提供一種方便快捷的方式來(lái)處理相對(duì)較小的文本,以便使用軟件包進(jìn)行分析。它旨在以簡(jiǎn)單的形式快速攝取數(shù)據(jù),例如電子表格,其中每個(gè)文檔都位于單個(gè)單元格中。
# 調(diào)用textProcessor算法,將 data$document、data 作為參數(shù) processed <- textProcessor(documents = data$documents, metadata = data, wordLengths = c(1, Inf))textProcessor()函數(shù)中的參數(shù)wordLengths = c(3, Inf)表示:短于最小字長(zhǎng)(默認(rèn)為3字符)或長(zhǎng)于最大字長(zhǎng)(默認(rèn)為inf)的字?jǐn)?shù)將被丟棄,[用戶(hù)@qq_39172034]建議設(shè)置該參數(shù)為wordLengths = c(1, Inf),以避免避免單個(gè)漢字被刪除
論文中提到,textProcessor()可以處理多種語(yǔ)言,需設(shè)置變量language = "en", customstopwords = NULL,。截至0.5支持的版本“丹麥語(yǔ)、荷蘭語(yǔ)、英語(yǔ)、芬蘭語(yǔ)、法語(yǔ)、德語(yǔ)、匈牙利語(yǔ)、意大利語(yǔ)、挪威語(yǔ)、葡萄牙語(yǔ)、羅馬尼亞語(yǔ)、俄語(yǔ)、瑞典語(yǔ)、土耳其語(yǔ)”,不支持中文
詳見(jiàn):textProcessor function - RDocumentation
3.2 Prepare: Associating text with metadata
數(shù)據(jù)預(yù)處理:轉(zhuǎn)換數(shù)據(jù)格式,根據(jù)閾值刪除低頻單詞等,用到的是prepDocuments()和plotRemoved()兩個(gè)函數(shù)
plotRemoved()函數(shù)可繪制不同閾值下刪除的document、words、token數(shù)量
pdf("output/stm-plot-removed.pdf") plotRemoved(processed$documents, lower.thresh = seq(1, 200, by = 100)) dev.off()根據(jù)此pdf文件的結(jié)果(output/stm-plot-removed.pdf),確定prepDocuments()中的參數(shù)lower.thresh的取值,以此確定變量docs、vocab、meta
論文中提到如果在處理過(guò)程中發(fā)生任何更改,PrepDocuments還將重新索引所有元數(shù)據(jù)/文檔關(guān)系。例如,當(dāng)文檔因?yàn)楹械皖l單詞而在預(yù)處理階段被完全刪除,那么PrepDocuments()也將刪除元數(shù)據(jù)中的相應(yīng)行。因此在讀入和處理文本數(shù)據(jù)后,檢查文檔的特征和相關(guān)詞匯表以確保它們已被正確預(yù)處理是很重要的。
# 去除詞頻低于15的詞匯 out <- prepDocuments(documents = processed$documents, vocab = processed$vocab, meta = processed$meta, lower.thresh = 15)docs <- out$documents vocab <- out$vocab meta <- out$meta-
docs:documents。包含單詞索引及其相關(guān)計(jì)數(shù)的文檔列表
-
vocab:a vocab character vector。包含與單詞索引關(guān)聯(lián)的單詞
-
meta:a metadata matrix。包含文檔協(xié)變量
以下表示兩篇短文章documents:第一篇文章包含5個(gè)單詞,每個(gè)單詞出現(xiàn)在vocab vector的第21、23、87、98、112位置上,除了第一個(gè)單詞出現(xiàn)兩次,其余單詞都僅出現(xiàn)一次。第二篇文章包含3個(gè)單詞,解釋同上。
| [,1] | [,2] | [,3] | [,4] | [,5] | |
| [1,] | 21 | 23 | 87 | 98 | 112 |
| [2,] | 2 | 1 | 1 | 1 | 1 |
| [[2]] | [,1] | [,2] | [,3] | ||
| [1,] | 16 | 61 | 90 | ||
| [2,] | 1 | 1 | 1 |
3.3 Estimate: Estimating the structural topic model
STM的關(guān)鍵創(chuàng)新是它將元數(shù)據(jù)合并到主題建??蚣苤?/strong>。在STM中,元數(shù)據(jù)可以通過(guò)兩種方式輸入到主題模型中:**主題流行度(topical prevalence)**和主題內(nèi)容(topical content)。主題流行度中的元數(shù)據(jù)協(xié)變量允許觀察到的元數(shù)據(jù)影響被討論主題的頻率。主題內(nèi)容中的協(xié)變量允許觀察到的元數(shù)據(jù)影響給定主題內(nèi)的詞率使用——即如何討論特定主題。對(duì)主題流行率和主題內(nèi)容的估計(jì)是通過(guò)stm()函數(shù)進(jìn)行的。
主題流行度(topical prevalence)表示每個(gè)主題對(duì)某篇文檔的貢獻(xiàn)程度,因?yàn)椴煌奈臋n來(lái)自不同的地方,所以自然地希望主題流行度能隨著元數(shù)據(jù)的變化而變化。
具體而言,論文將變量rating(意識(shí)形態(tài),Liberal,Conservative)作為主題流行度的協(xié)變量,除了意識(shí)形態(tài),還可以通過(guò)+號(hào)增加其他協(xié)變量,如增加原始數(shù)據(jù)中的day”變量(表示發(fā)帖日期)
s(day)中的s()為spline function,a fairly flexible b-spline basis
day這個(gè)變量是從2008年的第一天到最后一天,就像panel data一樣,如果帶入時(shí)序設(shè)置為天(365個(gè)penal),則會(huì)損失300多個(gè)自由度,所以引入spline function解決自由度損失的問(wèn)題。
The stm package also includes a convenience functions(), which selects a fairly flexible b-spline basis. In the current example we allow for the variabledayto be estimated with a spline.
poliblogPrevFit <- stm(documents = out$documents, vocab = out$vocab, K = 20, prevalence = ~rating + s(day), max.em.its = 75, data = out$meta, init.type = "Spectral")R中主題流行度協(xié)變量prevalence能表示為含有多個(gè)斜變量和階乘或連續(xù)協(xié)變量的公式,在spline包中還有其他的標(biāo)準(zhǔn)轉(zhuǎn)換函數(shù):log()、ns()、bs()
隨著迭代的進(jìn)行,如果bound變化足夠小,則認(rèn)為模型收斂converge了。
3.4 Evaluate: Model selection and search
因?yàn)榛旌现黝}模型的后驗(yàn)往往非凸和難以解決,模型的確定取決于參數(shù)的起始值(例如,特定主題的單詞分布)。兩種實(shí)現(xiàn)模型初始化的方式:
- spectral initialization。init.type="Spectral"。優(yōu)先選取此方式
- a collapsed Gibbs sampler for LDA
selectModel()首先建立一個(gè)運(yùn)行模型的網(wǎng)絡(luò)(net),并依次將所有模型運(yùn)行(小于10次)E step和M step,拋棄低likelihood的模型,接著僅運(yùn)行高likelihood的前20%的模型,直到收斂(convergence)或達(dá)到最大迭代次數(shù)(max.em.its)
通過(guò)plotModels()函數(shù)顯示的語(yǔ)義一致性(semantic coherence)和排他性(exclusivity)選擇合適的模型,semcoh和exclu越大則模型越好
# 繪制圖形平均得分每種模型采用不同的圖例 plotModels(poliblogSelect, pch=c(1,2,3,4), legend.position="bottomright") # 選擇模型3 selectedmodel <- poliblogSelect$runout[[3]]對(duì)比兩種或多個(gè)主題數(shù),通過(guò)對(duì)比語(yǔ)義連貫性SemCoh和排他性Exl確定合適的主題數(shù)
3.5 Understand: Interpreting the STM by plotting and inspecting results
選擇好模型后,就是通過(guò)stm包中提供的一些函數(shù)來(lái)展示模型的結(jié)果。為與原論文保持一致,使用初始模型poliblogPrevFit作為參數(shù),而非SelectModel
每個(gè)主題下的高頻單詞排序:labelTopics()、sageLabels()
兩個(gè)函數(shù)都將與每個(gè)主題相關(guān)的單詞輸出,其中sageLabels()僅對(duì)于包含內(nèi)容協(xié)變量的模型使用。此外,sageLabels()函數(shù)結(jié)果比labelTopics()更詳細(xì),而且默認(rèn)輸出所有主題下的高頻詞等信息
# labelTopics() Label topics by listing top words for selected topics 1 to 5. labelTopicsSel <- labelTopics(poliblogPrevFit, c(1:5)) sink("output/labelTopics-selected.txt", append=FALSE, split=TRUE) print(labelTopicsSel) sink()# sageLabels() 比 labelTopics() 輸出更詳細(xì) sink("stm-list-sagelabel.txt", append=FALSE, split=TRUE) print(sageLabels(poliblogPrevFit)) sink()TODO:兩個(gè)函數(shù)輸出結(jié)果存在差異
列出與某個(gè)主題高度相關(guān)的文檔:findthoughts()
shortdoc <- substr(out$meta$documents, 1, 200) # 參數(shù) 'texts=shortdoc' 表示輸出每篇文檔前200個(gè)字符,n表示輸出相關(guān)文檔的篇數(shù) thoughts1 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=1)$docs[[1]] pdf("findThoughts-T1.pdf") plotQuote(thoughts1, width=40, main="Topic 1") dev.off()# how about more documents for more of these topics? thoughts6 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=6)$docs[[1]] thoughts18 <- findThoughts(poliblogPrevFit, texts=shortdoc, n=2, topics=18)$docs[[1]] pdf("stm-plot-find-thoughts.pdf") # mfrow=c(2, 1)將會(huì)把圖輸出到2行1列的表格中 par(mfrow = c(2, 1), mar = c(.5, .5, 1, .5)) plotQuote(thoughts6, width=40, main="Topic 6") plotQuote(thoughts18, width=40, main="Topic 18") dev.off()估算元數(shù)據(jù)和主題/主題內(nèi)容之間的關(guān)系:estimateEffect
out$meta$rating<-as.factor(out$meta$rating) # since we're preparing these coVariates by estimating their effects we call these estimated effects 'prep' # we're estimating Effects across all 20 topics, 1:20. We're using 'rating' and normalized 'day,' using the topic model poliblogPrevFit. # The meta data file we call meta. We are telling it to generate the model while accounting for all possible uncertainty. Note: when estimating effects of one covariate, others are held at their mean prep <- estimateEffect(1:20 ~ rating+s(day), poliblogPrevFit, meta=out$meta, uncertainty = "Global") summary(prep, topics=1) summary(prep, topics=2) summary(prep, topics=3) summary(prep, topics=4)uncertainty有"Global", “Local”, "None"三個(gè)選擇,The default is “Global”, which will incorporate estimation uncertainty of the topic proportions into the uncertainty estimates using the method of composition. If users do not propagate the full amount of uncertainty, e.g., in order to speed up computational time, they can choose uncertainty = “None”, which will generally result in narrower confidence intervals because it will not include the additional estimation uncertainty.
summary(prep, topics=1)輸出結(jié)果:
Call: estimateEffect(formula = 1:20 ~ rating + s(day), stmobj = poliblogPrevFit, metadata = meta, uncertainty = "Global")Topic 1:Coefficients:Estimate Std. Error t value Pr(>|t|) (Intercept) 0.068408 0.011233 6.090 1.16e-09 *** ratingLiberal -0.002513 0.002588 -0.971 0.33170 s(day)1 -0.008596 0.021754 -0.395 0.69276 s(day)2 -0.035476 0.012314 -2.881 0.00397 ** s(day)3 -0.002806 0.015696 -0.179 0.85813 s(day)4 -0.030237 0.013056 -2.316 0.02058 * s(day)5 -0.026256 0.013791 -1.904 0.05695 . s(day)6 -0.010658 0.013584 -0.785 0.43269 s(day)7 -0.005835 0.014381 -0.406 0.68494 s(day)8 0.041965 0.016056 2.614 0.00897 ** s(day)9 -0.101217 0.016977 -5.962 2.56e-09 *** s(day)10 -0.024237 0.015679 -1.546 0.12216 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 13.6 Visualize: Presenting STM results
Summary visualization
主題占比條形圖
# see PROPORTION OF EACH TOPIC in the entire CORPUS. Just insert your STM output pdf("top-topic.pdf") plot(poliblogPrevFit, type = "summary", xlim = c(0, .3)) dev.off()Metadata/topic relationship visualization
主題關(guān)系對(duì)比圖
pdf("stm-plot-topical-prevalence-contrast.pdf") plot(prep, covariate = "rating", topics = c(6, 13, 18),model = poliblogPrevFit, method = "difference",cov.value1 = "Liberal", cov.value2 = "Conservative",xlab = "More Conservative ... More Liberal",main = "Effect of Liberal vs. Conservative",xlim = c(-.1, .1), labeltype = "custom",custom.labels = c("Obama/McCain", "Sarah Palin", "Bush Presidency")) dev.off()主題6、13、18自定義標(biāo)簽為"Obama/McCain"、“Sarah Palin”、“Bush Presidency”,主題6、主題13的意識(shí)形態(tài)偏中立,既不是保守,也不是自由,主題18的意識(shí)形態(tài)偏向于保守。
主題隨著時(shí)間變化的趨勢(shì)圖
pdf("stm-plot-topic-prevalence-with-time.pdf") plot(prep, "day", method = "continuous", topics = 13, model = z, printlegend = FALSE, xaxt = "n", xlab = "Time (2008)") monthseq <- seq(from = as.Date("2008-01-01"), to = as.Date("2008-12-01"), by = "month") monthnames <- months(monthseq) # There were 50 or more warnings (use warnings() to see the first 50) axis(1, at = as.numeric(monthseq) - min(as.numeric(monthseq)), labels = monthnames) dev.off()運(yùn)行報(bào)錯(cuò),但可以輸出以下圖片,原因不明
topic content
顯示某主題中哪些詞匯與一個(gè)變量值與另一個(gè)變量值的關(guān)聯(lián)度更大。
# TOPICAL CONTENT. # STM can plot the influence of covariates included in as a topical content covariate. # A topical content variable allows for the vocabulary used to talk about a particular # topic to vary. First, the STM must be fit with a variable specified in the content option. # Let's do something different. Instead of looking at how prevalent a topic is in a class of documents categorized by meta-data covariate... # ... let's see how the words of the topic are emphasized differently in documents of each category of the covariate # First, we we estimate a new stm. It's the same as the old one, including prevalence option, but we add in a content option poliblogContent <- stm(out$documents, out$vocab, K = 20, prevalence = ~rating + s(day), content = ~rating, max.em.its = 75, data = out$meta, init.type = "Spectral") pdf("stm-plot-content-perspectives.pdf") plot(poliblogContent, type = "perspectives", topics = 10) dev.off()主題10與古巴有關(guān)。它最常用的詞是“拘留、監(jiān)禁、法庭、非法、酷刑、強(qiáng)制執(zhí)行、古巴”。上顯示了自由派和保守派對(duì)這個(gè)主題的不同看法,自由派強(qiáng)調(diào)“酷刑”,而保守派則強(qiáng)調(diào)“非法”和“法律”等典型的法庭用語(yǔ)
原文:Its top FREX words were “detaine, prison, court, illeg, tortur, enforc, guantanamo”中的tortur應(yīng)為torture
繪制主題間的詞匯差異
pdf("stm-plot-content-perspectives-16-18.pdf") plot(poliblogPrevFit, type = "perspectives", topics = c(16, 18)) dev.off()Plotting covariate interactions
# Interactions between covariates can be examined such that one variable may ??moderate?? # the effect of another variable. ###Interacting covariates. Maybe we have a hypothesis that cities with low $$/capita become more repressive sooner, while cities with higher budgets are more patient ##first, we estimate an STM with the interaction poliblogInteraction <- stm(out$documents, out$vocab, K = 20,prevalence = ~rating * day, max.em.its = 75,data = out$meta, init.type = "Spectral") # Prep covariates using the estimateEffect() function, only this time, we include the # interaction variable. Plot the variables and save as pdf files. prep <- estimateEffect(c(16) ~ rating * day, poliblogInteraction,metadata = out$meta, uncertainty = "None") pdf("stm-plot-two-topic-contrast.pdf") plot(prep, covariate = "day", model = poliblogInteraction,method = "continuous", xlab = "Days", moderator = "rating",moderator.value = "Liberal", linecol = "blue", ylim = c(0, 0.12),printlegend = FALSE) plot(prep, covariate = "day", model = poliblogInteraction,method = "continuous", xlab = "Days", moderator = "rating",moderator.value = "Conservative", linecol = "red", add = TRUE,printlegend = FALSE) legend(0, 0.06, c("Liberal", "Conservative"),lwd = 2, col = c("blue", "red")) dev.off()上圖描繪了時(shí)間(博客發(fā)帖的日子)和評(píng)分(自由派和保守派)之間的關(guān)系。主題16患病率以時(shí)間的線性函數(shù)繪制,評(píng)分為0(自由)或1(保守)。
3.7 Extend: Additional tools for interpretation and visualization
繪制詞云圖
pdf("stm-plot-wordcloud.pdf") cloud(poliblogPrevFit, topic = 13, scale = c(2, 0.25)) dev.off()主題相關(guān)性
# topicCorr(). # STM permits correlations between topics. Positive correlations between topics indicate # that both topics are likely to be discussed within a document. A graphical network # display shows how closely related topics are to one another (i.e., how likely they are # to appear in the same document). This function requires 'igraph' package. # see GRAPHICAL NETWORK DISPLAY of how closely related topics are to one another, (i.e., how likely they are to appear in the same document) Requires 'igraph' package mod.out.corr <- topicCorr(poliblogPrevFit) pdf("stm-plot-topic-correlations.pdf") plot(mod.out.corr) dev.off()stmCorrViz
stmCorrViz軟件包提供了一個(gè)不同的d3可視化環(huán)境,該環(huán)境側(cè)重于使用分層聚類(lèi)方法將主題分組,從而可視化主題相關(guān)性。
存在亂碼問(wèn)題
# The stmCorrViz() function generates an interactive visualisation of topic hierarchy/correlations in a structural topicl model. The package performs a hierarchical # clustering of topics that are then exported to a JSON object and visualised using D3. # corrViz <- stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", documents_raw=data$documents, documents_matrix=out$documents)stmCorrViz(poliblogPrevFit, "stm-interactive-correlation.html", documents_raw=data$documents, documents_matrix=out$documents)4 Changing basic estimation defaults
此部分為解釋如何更改stm包的估算命令中的默認(rèn)設(shè)置
首先討論如何在初始化模型參數(shù)的不同方法中進(jìn)行選擇,然后討論如何設(shè)置和評(píng)估收斂標(biāo)準(zhǔn),再描述一種在分析包含數(shù)萬(wàn)個(gè)或更多文檔時(shí)加速收斂的方法,最后,討論內(nèi)容協(xié)變量模型的一些變化,這些變化允許用戶(hù)控制模型的復(fù)雜性。
問(wèn)題
ems.its和run的區(qū)別是什么?ems.its表示的組大迭代數(shù),每次迭代run=20?
3.4-3中如何根據(jù)四個(gè)圖確定合適的主題數(shù)?
補(bǔ)充
在Ingest部分,作者提到其他用于文本處理的quanteda包,該包可以方便地導(dǎo)入文本和相關(guān)元數(shù)據(jù),準(zhǔn)備要處理的文本,并將文檔轉(zhuǎn)換為文檔術(shù)語(yǔ)矩陣(document-term matrix)。另一個(gè)包,readtext包含非常靈活的工具,用于讀取多種文本格式,如純文本、XML和JSON格式,可以輕松地從中創(chuàng)建語(yǔ)料庫(kù)。
為從其他文本處理程序中讀取數(shù)據(jù),可使用txtorg,此程序可以創(chuàng)建三種獨(dú)立的文件:a metadata file, a vocabulary file, and a file with the original documents。默認(rèn)導(dǎo)出格式為L(zhǎng)DA-C sparse matrix format,可以用readCorpus()設(shè)置"ldac"option以讀取
論文:stm: An R Package for Structural Topic Models (harvard.edu)
參考文章:R軟件 STM package實(shí)操- 嗶哩嗶哩 (bilibili.com)
相關(guān)github倉(cāng)庫(kù):
JvH13/FF-STM: Web Appendix - Methodology for Structural Topic Modeling (github.com)
dondealban/learning-stm: Learning structural topic modeling using the stm R package. (github.com)
bstewart/stm: An R Package for the Structural Topic Model (github.com)
總結(jié)
以上是生活随笔為你收集整理的结构主题模型(一)stm包工作流的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: JS数据类型与分支结构
- 下一篇: 双NameNode的同步机制