當前位置：首頁 > 人工智能 > 循环神经网络 >内容正文

循环神经网络

matlab可以使用词云分析吗,利用豆瓣短评数据生成词云

發(fā)布時間：2023/12/10 循环神经网络 52 豆豆

生活随笔收集整理的這篇文章主要介紹了 matlab可以使用词云分析吗,利用豆瓣短评数据生成词云小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

在之前的文章中，我們獲得了豆瓣爬取的短評內(nèi)容，匯總到了一個文件中，但是，沒有被利用起來的數(shù)據(jù)是沒有意義的。

前文提到，有一篇微信推文的關于詞云制作的一個實踐記錄，準備照此試驗一下。

思路分析

讀文件

利用with open() as...將文件讀進來。這里需要注意文件內(nèi)容的大小。

分詞

由于獲取的是大量的短評文字，而制作詞云需要的是各種詞語，有了詞，才能談詞云，所以目前第一步需求的就是講短評內(nèi)容拆分成一個個的中文詞匯。

這里就用到了我所聽過的一個庫jieba，可以將中文語句拆解成一個個的詞匯。這里是用的是lcut()方法，能將中文字符串拆解成一個列表，每項都是一個詞。

清洗非中文

但是，我們在分析中，需要的就是中文文字，所以需要將非中文字符徹底清理，這里使用了正則表達式。短小精悍的一個模式[\u4e00-\u9fa5]+即可匹配。

使用正則表達式，我的習慣是現(xiàn)在網(wǎng)上的一些在線正則表達式工具上直接測試。其中oschina的不錯，還給提供了一些例子。

這里是oschina的工具網(wǎng)站，做的很好。

處理停詞

由于這些詞匯中，有很多詞是沒有實際分析價值的，所以我們需要利用一個停詞文件來將不必要的詞處理掉。

參考文章中，是利用pandas庫匯總的方法read_csv()來處理停詞文件。，利用一個isin()方法實現(xiàn)了停詞。

聚合

詞分開了，基本也處理干凈了。接下來應該考慮制作詞云的問題。

我們這里想要重點突出在所有評論中的重要的核心觀點，為了實現(xiàn)這樣的目的，我們使用了分詞。

這似乎是一種有些“斷章取義”的思路。借助詞頻的分布實現(xiàn)重點突出高詞頻內(nèi)容的方式，來展現(xiàn)我們的詞云。

所以現(xiàn)在我們需要做的事，就是處理詞匯的聚合問題，統(tǒng)計詞頻而已。

參考文種中利用了類DataFrame的分組方法group()和聚合方法agg()。

關于這里，參考文章中在agg()中使用了一個顯式的字典(可見文末參考文章)，調(diào)用了numpy.size，但是似乎是這種用法將來會被移除，查了一些文章，說是可以這樣用，就是不能自己定制字典了。

FutureWarning: using a dict on a Series for aggregation is deprecated and will be removed in a future version

詞云

這里使用了第三方庫wordcloud。這個庫在安裝的時候，直接pip install wordcloud時，我出了問題，提示微軟開發(fā)工具的問題，折騰了半天，最后還是直接在一個極為豐富的第三方庫的集合站點上下載使用pip insatll了它的whl文件。

這下可以正常使用了。

同時，這里為了能夠顯示處理圖片，使用了matplotlib.pyplot&numpy來進行處理。

掩膜設置

由wordcloud項目主頁README 了解，可以使用二值圖像來設定掩膜(mask)。

出于提升數(shù)據(jù)的表現(xiàn)力，也出于學習的目的，這里使用了直接編寫的rgb2gray()&gray2bw()函數(shù)來實現(xiàn)真彩圖像轉(zhuǎn)換為二值圖像的過程。獲得了最終的二值圖像掩膜。

這里開始我并不知道需要怎樣的圖像，看了給的示例代碼，用的圖片的是二值圖像，才明白，白白浪費了好多時間。

而且，我的理解，由彩色轉(zhuǎn)為二值圖像，是必要經(jīng)過灰度圖像這個過程的。

關于matplotlib.pyplot的使用，網(wǎng)上都說，和matlab的語法很類似，以前了解過一點，所以看著例子中的imshow()，很自然的就想出了imread()，實現(xiàn)了圖片的讀取。

在查閱文檔的過程中發(fā)現(xiàn)了一個有意思的地方。

Return value is a numpy.array. For grayscale images, the return array is MxN. For RGB images, the return value is MxNx3. For RGBA images the return value is MxNx4.

matplotlib can only read PNGs natively, but if PIL is installed, it will use it to load the image and return an array (if possible) which can be used with imshow(). Note, URL strings may not be compatible with PIL. Check the PIL documentation for more information.

我文中使用的是JPG圖像，可見是調(diào)用了PIL處理。

而這里對于二值圖像的獲取，開始經(jīng)歷了一個誤區(qū)。由于在網(wǎng)上搜索的時候，搜到的大多是利用PIL庫的Image模塊的open()&convert()方法的處理，附加參數(shù)1，可以實現(xiàn)二值圖像的轉(zhuǎn)化，但是在這里使用，后面在使用詞云的時候，會提示缺少屬性，可見這里不適合這樣處理。

詞云設定

詞云支持自定義字體，背景顏色，掩膜設置等等，可以直接在IDE中跳至源文件中查看。都有相關的介紹。

文末代碼是一些參數(shù)的摘錄。

詞頻選擇

這里使用了剛才聚合排序好的數(shù)據(jù)，選擇了前1000個詞進行展示，并組合成字典，傳入了詞云的實例對象的方法fit_words()生成了詞云。

詞云展示

這里使用了matplotlib.pyplot的的幾個函數(shù)，實現(xiàn)了圖像的保存，顯示，以及坐標軸的隱藏。

這里倒是有個小異或，有點分不清楚imshow()與show()了。兩者從文檔我也沒看出個所以然來。不過他們有個最明顯的區(qū)別就是后者依賴圖形窗口，但是前者似乎不需要。

要是有明白的，還請大家留言或者發(fā)郵件給我。

完整代碼

# -*- coding: utf-8 -*-

"""

Created on Thu Aug 17 16:31:35 2017

@note: 為了便于閱讀，將模塊的引用就近安置了

@author: lart

"""

# 讀取事先爬取好的文件，由于文件較小，直接一次性讀入。若文件較大，則最好分體積讀入。

with open('秘密森林的短評.txt', 'r', encoding='utf-8') as file:

comments = file.readlines()

comment = ''.join(comments)

# 摘取中文字符，沒有在下載時處理，正好保留原始數(shù)據(jù)。

import re

pattern = re.compile(r'[\u4e00-\u9fa5]+')

data = pattern.findall(comment)

filted_comment = ''.join(data)

# 分詞

import jieba

word = jieba.lcut(filted_comment)

# 整理

import pandas as pd

words_df = pd.DataFrame({'words': word})

#停詞相關設置。參數(shù) quoting=3 全不引用

stopwords = pd.read_csv(

"stopwords.txt",

index_col=False,

quoting=3,

sep="\t",

names=['stopword'],

encoding='utf-8'

)

words_df = words_df[~words_df.words.isin(stopwords.stopword)]

# 聚合

words_stat = words_df.groupby('words')['words'].agg({'size'})

words_stat = words_stat.reset_index().sort_values("size", ascending=False)

# 詞云設置

from wordcloud import WordCloud

import matplotlib.pyplot as plt

import numpy as np

def rgb2gray(rgb):

return np.dot(rgb[...,:3], [0.299, 0.587, 0.114])

def gray2bw(gray):

for raw in range(len(gray)):

for col in range(len(gray[raw])):

gray[raw][col] = (0 if gray[raw][col]>50 else 255)

return gray

img = plt.imread('4.jpg')

mask = rgb2gray(img)

bw = gray2bw(mask)

wordcloud = WordCloud(

font_path="YaHei Consolas Hybrid.ttf",

background_color="white",

mask=bw,

max_font_size=80

)

# word_frequence 為字典類型，可以直接傳入wordcloud.fit_words()

word_frequence = {

x[0]:x[1] for x in words_stat.head(1000).values

}

wordcloud = wordcloud.fit_words(word_frequence)

# 存儲顯示

plt.imsave('img.jpg', wordcloud)

plt.subplot(131)

plt.imshow(img)

plt.axis("off")

plt.subplot(132)

plt.imshow(bw)

plt.axis("off")

plt.subplot(133)

plt.imshow(wordcloud, interpolation='bilinear')

plt.axis("off")

結果文件

使用的掩膜原圖片：

秘密森林劇照

輸出圖片

IDE輸出結果

這里寫圖片描述

停詞文件

Parameters

----------

font_path : string

Font path to the font that will be used (OTF or TTF).

Defaults to DroidSansMono path on a Linux machine. If you are on

another OS or don't have this font, you need to adjust this path.

width : int (default=400)

Width of the canvas.

height : int (default=200)

Height of the canvas.

prefer_horizontal : float (default=0.90)

The ratio of times to try horizontal fitting as opposed to vertical.

If prefer_horizontal < 1, the algorithm will try rotating the word

if it doesn't fit. (There is currently no built-in way to get only

vertical words.)

mask : nd-array or None (default=None)

If not None, gives a binary mask on where to draw words. If mask is not

None, width and height will be ignored and the shape of mask will be

used instead. All white (#FF or #FFFFFF) entries will be considerd

"masked out" while other entries will be free to draw on. [This

changed in the most recent version!]

scale : float (default=1)

Scaling between computation and drawing. For large word-cloud images,

using scale instead of larger canvas size is significantly faster, but

might lead to a coarser fit for the words.

min_font_size : int (default=4)

Smallest font size to use. Will stop when there is no more room in this

size.

font_step : int (default=1)

Step size for the font. font_step > 1 might speed up computation but

give a worse fit.

max_words : number (default=200)

The maximum number of words.

stopwords : set of strings or None

The words that will be eliminated. If None, the build-in STOPWORDS

list will be used.

background_color : color value (default="black")

Background color for the word cloud image.

max_font_size : int or None (default=None)

Maximum font size for the largest word. If None, height of the image is

used.

mode : string (default="RGB")

Transparent background will be generated when mode is "RGBA" and

background_color is None.

relative_scaling : float (default=.5)

Importance of relative word frequencies for font-size. With

relative_scaling=0, only word-ranks are considered. With

relative_scaling=1, a word that is twice as frequent will have twice

the size. If you want to consider the word frequencies and not only

their rank, relative_scaling around .5 often looks good.

.. versionchanged: 2.0

Default is now 0.5.

color_func : callable, default=None

Callable with parameters word, font_size, position, orientation,

font_path, random_state that returns a PIL color for each word.

Overwrites "colormap".

See colormap for specifying a matplotlib colormap instead.

regexp : string or None (optional)

Regular expression to split the input text into tokens in process_text.

If None is specified, ``r"\w[\w']+"`` is used.

collocations : bool, default=True

Whether to include collocations (bigrams) of two words.

.. versionadded: 2.0

colormap : string or matplotlib colormap, default="viridis"

Matplotlib colormap to randomly draw colors from for each word.

Ignored if "color_func" is specified.

.. versionadded: 2.0

normalize_plurals : bool, default=True

Whether to remove trailing 's' from words. If True and a word

appears with and without a trailing 's', the one with trailing 's'

is removed and its counts are added to the version without

trailing 's' -- unless the word ends with 'ss'.

總結

以上是生活随笔為你收集整理的matlab可以使用词云分析吗,利用豆瓣短评数据生成词云的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇： class里面只能写以下5种
下一篇：三瞬属性matlab,matlab：ou