lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现
[導(dǎo)讀]百度最近開源了一個(gè)新的關(guān)于主題模型的項(xiàng)目。文檔主題推斷工具、語(yǔ)義匹配計(jì)算工具以及基于工業(yè)級(jí)語(yǔ)料訓(xùn)練的三種主題模型:Latent
Dirichlet Allocation(LDA)、SentenceLDA 和Topical Word Embedding(TWE)。
一、Familia簡(jiǎn)介
幫Familia,打個(gè)小廣告~ ?Familia的github
主題模型在工業(yè)界的應(yīng)用范式可以抽象為兩大類: 語(yǔ)義表示和語(yǔ)義匹配。
語(yǔ)義表示 (Semantic Representation)
對(duì)文檔進(jìn)行主題降維,獲得文檔的語(yǔ)義表示,這些語(yǔ)義表示可以應(yīng)用于文本分類、文本內(nèi)容分析、CTR預(yù)估等下游應(yīng)用。
語(yǔ)義匹配 (Semantic Matching)
計(jì)算文本間的語(yǔ)義匹配度,我們提供兩種文本類型的相似度計(jì)算方式:-?短文本-長(zhǎng)文本相似度計(jì)算,使用場(chǎng)景包括文檔關(guān)鍵詞抽取、計(jì)算搜索引擎查詢和網(wǎng)頁(yè)的相似度等等。
-?長(zhǎng)文本-長(zhǎng)文本相似度計(jì)算,使用場(chǎng)景包括計(jì)算兩篇文檔的相似度、計(jì)算用戶畫像和新聞的相似度等等。
Familia自帶的Demo包含以下功能:
語(yǔ)義表示計(jì)算
利用主題模型對(duì)輸入文檔進(jìn)行主題推斷,以得到文檔的主題降維表示。
語(yǔ)義匹配計(jì)算
計(jì)算文本之間的相似度,包括短文本-長(zhǎng)文本、長(zhǎng)文本-長(zhǎng)文本間的相似度計(jì)算。
模型內(nèi)容展現(xiàn)
對(duì)模型的主題詞,近鄰詞進(jìn)行展現(xiàn),方便用戶對(duì)模型的主題有直觀的理解。
.
二、Topical Word Embedding(TWE)
Zhiyuan Liu老師的文章,paper下載以及github
In this way, contextual word embeddings can be flexibly obtained to measure contextual word similarity. We can also build document representations.
且有三款:TWE-1,TWE-2,TWE-3,來(lái)看看和傳統(tǒng)的skip-gram的結(jié)構(gòu)區(qū)別:
在多標(biāo)簽文本分類的精確度:
百度開源項(xiàng)目 Familia中TWE模型的內(nèi)容展現(xiàn):請(qǐng)輸入主題編號(hào)(0-10000):????105
Embedding?Result??????????????Multinomial?Result
------------------------------------------------
對(duì)話????????????????????????????????????對(duì)話
磋商????????????????????????????????????合作
合作????????????????????????????????????中國(guó)
非方????????????????????????????????????磋商
探討????????????????????????????????????交流
對(duì)話會(huì)議????????????????????????????????聯(lián)合
議題????????????????????????????????????國(guó)家
中方????????????????????????????????????討論
對(duì)話會(huì)??????????????????????????????????支持
交流????????????????????????????????????包括
第一列為基于embedding的結(jié)果,第二列為基于多項(xiàng)分布的結(jié)果,均按照在主題中的重要程度從大到小的順序排序。
來(lái)簡(jiǎn)單看一下train文件:import?gensim?#modified?gensim?version
import?pre_process?#?read?the?wordmap?and?the?tassgin?file?and?create?the?sentence
import?sys
if?__name__=="__main__":
if?len(sys.argv)!=4:
print?"Usage?:?python?train.py?wordmap?tassign?topic_number"
sys.exit(1)
reload(sys)
sys.setdefaultencoding('utf-8')
wordmapfile?=?sys.argv[1]
tassignfile?=?sys.argv[2]
topic_number?=?int(sys.argv[3])
id2word?=?pre_process.load_id2word(wordmapfile)
pre_process.load_sentences(tassignfile,?id2word)
sentence_word?=?gensim.models.word2vec.LineSentence("tmp/word.file")
print?"Training?the?word?vector..."
w?=?gensim.models.Word2Vec(sentence_word,size=400,?workers=20)
sentence?=?gensim.models.word2vec.CombinedSentence("tmp/word.file","tmp/topic.file")
print?"Training?the?topic?vector..."
w.train_topic(topic_number,?sentence)
print?"Saving?the?topic?vectors..."
w.save_topic("output/topic_vector.txt")
print?"Saving?the?word?vectors..."
w.save_wordvector("output/word_vector.txt")
.
三、SentenceLDA
paper鏈接 + ?github:balikasg/topicModelling
SentenceLDA是什么?
an extension of LDA whose goal is to overcome this limitation by incorporating the structure of
the text in the generative and inference processes.
SentenceLDA和LDA區(qū)別?
LDA and senLDA differ in that the second assumes a very strong dependence of the latent topics between the words of sentences, whereas the first ssumes independence between the words of documents in general
SentenceLDA和LDA兩者對(duì)比實(shí)驗(yàn):
We illustrate the advantages of sentenceLDA by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text classification) evaluation tasks on different text collections
原作者的github的結(jié)果:
https://github.com/balikasg/topicModelling/tree/master/senLDA
截取一部分code:import?numpy?as?np,?vocabulary_sentenceLayer,?string,?nltk.data,?sys,?codecs,?json,?time
from?nltk.tokenize?import?sent_tokenize
from?lda_sentenceLayer?import?lda_gibbs_sampling1
from?sklearn.cross_validation?import?train_test_split,?StratifiedKFold
from?nltk.stem?import?WordNetLemmatizer
from?sklearn.utils?import?shuffle
from?functions?import?*
path2training?=?sys.argv[1]
training?=?codecs.open(path2training,?'r',?encoding='utf8').read().splitlines()
topics?=?int(sys.argv[2])
alpha,?beta?=?0.5?/?float(topics),?0.5?/?float(topics)
voca_en?=?vocabulary_sentenceLayer.VocabularySentenceLayer(set(nltk.corpus.stopwords.words('english')),?WordNetLemmatizer(),?excluds_stopwords=True)
ldaTrainingData?=?change_raw_2_lda_input(training,?voca_en,?True)
ldaTrainingData?=?voca_en.cut_low_freq(ldaTrainingData,?1)
iterations?=?201
classificationData,?y?=?load_classification_data(sys.argv[3],?sys.argv[4])
classificationData?=?change_raw_2_lda_input(classificationData,?voca_en,?False)
classificationData?=?voca_en.cut_low_freq(classificationData,?1)
final_acc,?final_mif,?final_perpl,?final_ar,?final_nmi,?final_p,?final_r,?final_f?=?[],?[],?[],?[],?[],?[],?[],?[]
start?=?time.time()
for?j?in?range(5):
perpl,?cnt,?acc,?mif,?ar,?nmi,?p,?r,?f?=?[],?0,?[],?[],?[],?[],?[],?[],?[]
lda?=?lda_gibbs_sampling1(K=topics,?alpha=alpha,?beta=beta,?docs=?ldaTrainingData,?V=voca_en.size())
for?i?in?range(iterations):
lda.inference()
if?i?%?5?==?0:
print?"Iteration:",?i,?"Perplexity:",?lda.perplexity()
features?=?lda.heldOutPerplexity(classificationData,?3)
print?"Held-out:",?features[0]
scores?=?perform_class(features[1],?y)
acc.append(scores[0][0])
mif.append(scores[1][0])
perpl.append(features[0])
final_acc.append(acc)
final_mif.append(mif)
final_perpl.append(perpl)
來(lái)看看百度開源項(xiàng)目的最終效果,LDA和SentenceLDA的內(nèi)容展現(xiàn):
LDA結(jié)果:請(qǐng)輸入主題編號(hào)(0-1999):?105
--------------------------------------------
對(duì)話????0.189676
合作????0.0805558
中國(guó)????0.0276284
磋商????0.0269797
交流????0.021069
聯(lián)合????0.0208559
國(guó)家????0.0183163
討論????0.0154165
支持????0.0146714
包括????0.014198
第二列的數(shù)值表示詞在該主題下的重要程度。
SentenceLDA結(jié)果:請(qǐng)輸入主題編號(hào)(0-1999):?105
--------------------------------------------
浙江????0.0300595
浙江省??0.0290975
寧波????0.0195277
記者????0.0174735
寧波市??0.0132504
長(zhǎng)春市??0.0123353
街道????0.0107271
吉林省??0.00954326
金華????0.00772971
公安局??0.00678163
.
四、CopulaLDA
SentenceLDA和CopulaLDA同一作者,可見github:balikasg/topicModelling
沒細(xì)看,來(lái)貼效果好了:
總結(jié)
以上是生活随笔為你收集整理的lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: php 在函数里打开链接,JavaScr
- 下一篇: php 中间代码,PHP内核中用户函数、