日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 >

R学习之——R用于文本挖掘(tm包)

發布時間:2023/12/19 28 豆豆
生活随笔 收集整理的這篇文章主要介紹了 R学习之——R用于文本挖掘(tm包) 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

?

首先需要安裝并加載tm包。


?

1、讀取文本

x = readLines("222.txt")

2、建立語料庫

> r=Corpus(VectorSource(x))> rA corpus with 7012 text documents

3、語料庫輸出,保存到硬盤

> writeCorpus(r)

?

4、查看語料庫

> print(r) A corpus with 7012 text documents > summary(r) A corpus with 7012 text documentsThe metadata consists of 2 tag-value pairs and a data frame Available tags are:create_date creator Available variables in the data frame are:MetaID

? > inspect(r[2:2])
? A corpus with 1 text document

? The metadata consists of 2 tag-value pairs and a data frame
? Available tags are:
? create_date creator
? Available variables in the data frame are:
? MetaID

? [[1]]
? Female; Genital Neoplasms, Female/*therapy; Humans

? > r[[2]]
? Female; Genital Neoplasms, Female/*therapy; Humans

5、建立“文檔-詞”矩陣

> dtm = DocumentTermMatrix(r) > head(dtm) A document-term matrix (6 documents, 16381 terms)Non-/sparse entries: 110/98176 Sparsity : 100% Maximal term length: 81 Weighting : term frequency (tf)

6、查看“文檔-詞”矩陣

> inspect(dtm[1:2,1:4])

7、查找出現200次以上的詞

> findFreqTerms(dtm,200)[1] "acute" "adjuvant" "advanced" "after" [5] "and" "breast" "cancer" "cancer:" [9] "carcinoma" "cell" "chemotherapy" "clinical" [13] "colorectal" "factor" "for" "from" [17] "group" "growth" "iii" "leukemia" [21] "lung" "lymphoma" "metastatic" "non-small-cell" [25] "oncology" "patients" "phase" "plus" [29] "prostate" "randomized" "receptor" "response" [33] "results" "risk" "study" "survival" [37] "the" "therapy" "treatment" "trial" [41] "tumor" "with"

7、移除出現次數較少的詞

inspect(removeSparseTerms(dtm, 0.4))

8、查找和“stem”的相關系數在0.5以上的詞

> findAssocs(dtm, "stem", 0.5)stem cells 1.00 0.61

?9、計算文檔相似度(用cosine計算距離)

> dist_dtm <- dissimilarity(dtm, method = 'cosine') > head(dist_dtm) [1] 1.0000000 0.7958759 0.8567770 0.9183503 0.9139337 0.9309934

10、聚類

> hc <- hclust(dist_dtm, method = 'ave') > plot(hc,xlab='')

?

?

? ? ?

轉載于:https://www.cnblogs.com/todoit/archive/2012/07/13/2589741.html

總結

以上是生活随笔為你收集整理的R学习之——R用于文本挖掘(tm包)的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。