當(dāng)前位置：首頁 > 编程语言 > php >内容正文

php

lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现

發(fā)布時間：2023/12/2 php 37 豆豆

生活随笔收集整理的這篇文章主要介紹了 lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

[導(dǎo)讀]百度最近開源了一個新的關(guān)于主題模型的項目。文檔主題推斷工具、語義匹配計算工具以及基于工業(yè)級語料訓(xùn)練的三種主題模型：Latent

Dirichlet Allocation(LDA)、SentenceLDA 和Topical Word Embedding(TWE)。

一、Familia簡介

幫Familia，打個小廣告~ ?Familia的github

主題模型在工業(yè)界的應(yīng)用范式可以抽象為兩大類: 語義表示和語義匹配。

語義表示 (Semantic Representation)

對文檔進行主題降維，獲得文檔的語義表示，這些語義表示可以應(yīng)用于文本分類、文本內(nèi)容分析、CTR預(yù)估等下游應(yīng)用。

語義匹配 (Semantic Matching)

計算文本間的語義匹配度，我們提供兩種文本類型的相似度計算方式:-?短文本-長文本相似度計算，使用場景包括文檔關(guān)鍵詞抽取、計算搜索引擎查詢和網(wǎng)頁的相似度等等。

-?長文本-長文本相似度計算，使用場景包括計算兩篇文檔的相似度、計算用戶畫像和新聞的相似度等等。

Familia自帶的Demo包含以下功能：

語義表示計算

利用主題模型對輸入文檔進行主題推斷，以得到文檔的主題降維表示。

語義匹配計算

計算文本之間的相似度，包括短文本-長文本、長文本-長文本間的相似度計算。

模型內(nèi)容展現(xiàn)

對模型的主題詞，近鄰詞進行展現(xiàn)，方便用戶對模型的主題有直觀的理解。

二、Topical Word Embedding(TWE)

Zhiyuan Liu老師的文章，paper下載以及github

In this way, contextual word embeddings can be flexibly obtained to measure contextual word similarity. We can also build document representations.

且有三款：TWE-1，TWE-2，TWE-3，來看看和傳統(tǒng)的skip-gram的結(jié)構(gòu)區(qū)別：

在多標(biāo)簽文本分類的精確度：

百度開源項目 Familia中TWE模型的內(nèi)容展現(xiàn)：請輸入主題編號(0-10000):????105

Embedding?Result??????????????Multinomial?Result

------------------------------------------------

對話????????????????????????????????????對話

磋商????????????????????????????????????合作

合作????????????????????????????????????中國

非方????????????????????????????????????磋商

探討????????????????????????????????????交流

對話會議????????????????????????????????聯(lián)合

議題????????????????????????????????????國家

中方????????????????????????????????????討論

對話會??????????????????????????????????支持

交流????????????????????????????????????包括

第一列為基于embedding的結(jié)果，第二列為基于多項分布的結(jié)果，均按照在主題中的重要程度從大到小的順序排序。

來簡單看一下train文件：import?gensim?#modified?gensim?version

import?pre_process?#?read?the?wordmap?and?the?tassgin?file?and?create?the?sentence

import?sys

if?__name__=="__main__":

if?len(sys.argv)!=4:

print?"Usage?:?python?train.py?wordmap?tassign?topic_number"

sys.exit(1)

reload(sys)

sys.setdefaultencoding('utf-8')

wordmapfile?=?sys.argv[1]

tassignfile?=?sys.argv[2]

topic_number?=?int(sys.argv[3])

id2word?=?pre_process.load_id2word(wordmapfile)

pre_process.load_sentences(tassignfile,?id2word)

sentence_word?=?gensim.models.word2vec.LineSentence("tmp/word.file")

print?"Training?the?word?vector..."

w?=?gensim.models.Word2Vec(sentence_word,size=400,?workers=20)

sentence?=?gensim.models.word2vec.CombinedSentence("tmp/word.file","tmp/topic.file")

print?"Training?the?topic?vector..."

w.train_topic(topic_number,?sentence)

print?"Saving?the?topic?vectors..."

w.save_topic("output/topic_vector.txt")

print?"Saving?the?word?vectors..."

w.save_wordvector("output/word_vector.txt")

三、SentenceLDA

paper鏈接 + ?github：balikasg/topicModelling

SentenceLDA是什么？

an extension of LDA whose goal is to overcome this limitation by incorporating the structure of

the text in the generative and inference processes.

SentenceLDA和LDA區(qū)別？

LDA and senLDA differ in that the second assumes a very strong dependence of the latent topics between the words of sentences, whereas the first ssumes independence between the words of documents in general

SentenceLDA和LDA兩者對比實驗：

We illustrate the advantages of sentenceLDA by comparing it with LDA using both intrinsic (perplexity) and extrinsic (text classification) evaluation tasks on different text collections

原作者的github的結(jié)果：

https://github.com/balikasg/topicModelling/tree/master/senLDA

截取一部分code：import?numpy?as?np,?vocabulary_sentenceLayer,?string,?nltk.data,?sys,?codecs,?json,?time

from?nltk.tokenize?import?sent_tokenize

from?lda_sentenceLayer?import?lda_gibbs_sampling1

from?sklearn.cross_validation?import?train_test_split,?StratifiedKFold

from?nltk.stem?import?WordNetLemmatizer

from?sklearn.utils?import?shuffle

from?functions?import?*

path2training?=?sys.argv[1]

training?=?codecs.open(path2training,?'r',?encoding='utf8').read().splitlines()

topics?=?int(sys.argv[2])

alpha,?beta?=?0.5?/?float(topics),?0.5?/?float(topics)

voca_en?=?vocabulary_sentenceLayer.VocabularySentenceLayer(set(nltk.corpus.stopwords.words('english')),?WordNetLemmatizer(),?excluds_stopwords=True)

ldaTrainingData?=?change_raw_2_lda_input(training,?voca_en,?True)

ldaTrainingData?=?voca_en.cut_low_freq(ldaTrainingData,?1)

iterations?=?201

classificationData,?y?=?load_classification_data(sys.argv[3],?sys.argv[4])

classificationData?=?change_raw_2_lda_input(classificationData,?voca_en,?False)

classificationData?=?voca_en.cut_low_freq(classificationData,?1)

final_acc,?final_mif,?final_perpl,?final_ar,?final_nmi,?final_p,?final_r,?final_f?=?[],?[],?[],?[],?[],?[],?[],?[]

start?=?time.time()

for?j?in?range(5):

perpl,?cnt,?acc,?mif,?ar,?nmi,?p,?r,?f?=?[],?0,?[],?[],?[],?[],?[],?[],?[]

lda?=?lda_gibbs_sampling1(K=topics,?alpha=alpha,?beta=beta,?docs=?ldaTrainingData,?V=voca_en.size())

for?i?in?range(iterations):

lda.inference()

if?i?%?5?==?0:

print?"Iteration:",?i,?"Perplexity:",?lda.perplexity()

features?=?lda.heldOutPerplexity(classificationData,?3)

print?"Held-out:",?features[0]

scores?=?perform_class(features[1],?y)

acc.append(scores[0][0])

mif.append(scores[1][0])

perpl.append(features[0])

final_acc.append(acc)

final_mif.append(mif)

final_perpl.append(perpl)

來看看百度開源項目的最終效果，LDA和SentenceLDA的內(nèi)容展現(xiàn)：

LDA結(jié)果：請輸入主題編號(0-1999):?105

--------------------------------------------

對話????0.189676

合作????0.0805558

中國????0.0276284

磋商????0.0269797

交流????0.021069

聯(lián)合????0.0208559

國家????0.0183163

討論????0.0154165

支持????0.0146714

包括????0.014198

第二列的數(shù)值表示詞在該主題下的重要程度。

SentenceLDA結(jié)果：請輸入主題編號(0-1999):?105

--------------------------------------------

浙江????0.0300595

浙江省??0.0290975

寧波????0.0195277

記者????0.0174735

寧波市??0.0132504

長春市??0.0123353

街道????0.0107271

吉林省??0.00954326

金華????0.00772971

公安局??0.00678163

四、CopulaLDA

SentenceLDA和CopulaLDA同一作者，可見github：balikasg/topicModelling

沒細看，來貼效果好了：

總結(jié)

以上是生活随笔為你收集整理的lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： php 在函数里打开链接,JavaScr
下一篇： php 中间代码,PHP内核中用户函数、

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

php

lda php,主题模型︱几款新主题模型——SentenceLDA、CopulaLDA、TWE简析与实现

總結(jié)