Python wordcloud词云:源码分析及简单使用
Python版本的詞云生成模塊從2015年的v1.0到現(xiàn)在,已經(jīng)更新到了v1.7。
下載請(qǐng)移步至:https://pypi.org/project/wordcloud/
wordcloud簡(jiǎn)單應(yīng)用:
import jieba import wordcloudw = wordcloud.WordCloud(width=600,height=600,background_color='white',font_path='msyh.ttc' ) text = '看到此標(biāo)題,我也是感慨萬(wàn)千 首先弄清楚搞IT和被IT搞,誰(shuí)是搞IT的?馬云就是,馬化騰也是,劉強(qiáng)東也是,他們都是叫搞IT的, 但程序員只是被IT搞的人,可以比作蓋樓砌磚的泥瓦匠,你想想,四十歲的泥瓦匠能跟二十左右歲的年輕人較勁嗎?如果你是老板你會(huì)怎么做?程序員只是技術(shù)含量高的泥瓦匠,社會(huì)是現(xiàn)實(shí)的,社會(huì)的現(xiàn)實(shí)是什么?利益驅(qū)動(dòng)。當(dāng)你跑的速度不比以前快了時(shí),你就會(huì)被挨鞭子趕,這種窘境如果在做程序員當(dāng)初就預(yù)料到的話,你就會(huì)知道,到達(dá)一定高度時(shí),你需要改變行程。 程序員其實(shí)真的不是什么好職業(yè),技術(shù)每天都在更新,要不停的學(xué),你以前學(xué)的每天都在被淘汰,加班可能是標(biāo)配了吧。 熱點(diǎn),你知道什么是熱點(diǎn)嗎?社會(huì)上啥熱就是熱點(diǎn),我舉幾個(gè)例子:在早淘寶之初,很多人都覺(jué)得做淘寶能讓自己發(fā)展,當(dāng)初的規(guī)則是產(chǎn)品按時(shí)間輪候展示,也就是你的商品上架時(shí)間一到就會(huì)被展示,不論你星級(jí)多高。這種一律平等的條件固然好,但淘寶隨后調(diào)整了顯示規(guī)則,對(duì)產(chǎn)品和店鋪,銷量進(jìn)行了加權(quán),一下導(dǎo)致小賣家被弄到了很深的胡同里,沒(méi)人看到自己的產(chǎn)品,如何賣?做廣告費(fèi)用也非常高,入不敷出,想必做過(guò)淘寶的都知道,再后來(lái)淘寶弄天貓,顯然,天貓是上檔次的商城,不同于淘寶的擺地?cái)?#xff0c;因?yàn)閿偽毁M(fèi)漲價(jià)還鬧過(guò)事,鬧也白鬧,你有能力就弄,沒(méi)能力就淘汰掉。前幾天淘寶又推出C2M,客戶反向定制,客戶直接掛鉤大廠家,沒(méi)你小賣家什么事。 后來(lái)又出現(xiàn)了微商,在微商出現(xiàn)當(dāng)天我就知道這東西不行,它比淘寶假貨還下三濫.我對(duì)TX一直有點(diǎn)偏見(jiàn),因?yàn)轵_子都使用QQ 我說(shuō)這么多只想說(shuō)一個(gè)事,世界是變化的,你只能適應(yīng)變化,否則就會(huì)被淘汰。 還是回到熱點(diǎn)這個(gè)話題,育兒嫂這個(gè)職位有很多人了解嗎?前幾年放開(kāi)二胎后,這個(gè)職位迅速串紅,我的一個(gè)親戚初中畢業(yè),現(xiàn)在已經(jīng)月入一萬(wàn)五,職務(wù)就是照看剛出生的嬰兒28天,節(jié)假日要雙薪。 你說(shuō)這難到讓我一個(gè)男的去當(dāng)育兒嫂嗎?扯,我只是說(shuō)熱點(diǎn)問(wèn)題。你沒(méi)踩在熱點(diǎn)上,你賺錢就會(huì)很費(fèi)勁 這兩年的熱點(diǎn)是什么?短視頻,你可以看到抖音的一些作品根本就不是普通人能實(shí)現(xiàn)的,說(shuō)明專業(yè)級(jí)人才都開(kāi)始努力往這上使勁了。 我只會(huì)編程,別的不會(huì)怎么辦?那你就去編程。沒(méi)人用了怎么辦?你看看你自己能不能雇傭你自己 學(xué)會(huì)適應(yīng)社會(huì),學(xué)會(huì)改變自己去適應(yīng)社會(huì) 最后說(shuō)一句:科大訊飛的劉鵬說(shuō)的是對(duì)的。那我為什么還做程序員?他可以完成一些原始積累,只此而已。' new_str = ' '.join(jieba.lcut(text)) w.generate(new_str) w.to_file('x.png')?下面分析源碼:
wordcloud源碼中生成詞云圖的主要步驟有:
1、分割詞組
2、生成詞云
3、保存圖片
我們從 generate(self, text)切入,發(fā)現(xiàn)它僅僅調(diào)用了自身對(duì)象的一個(gè)方法 self.generate_from_text(text)
def generate_from_text(self, text):"""Generate wordcloud from text."""words = self.process_text(text) # 分割詞組self.generate_from_frequencies(words) # 生成詞云的主要方法(重點(diǎn)分析)return selfprocess_text()源碼如下,處理的邏輯比較簡(jiǎn)單:分割詞組、去除數(shù)字、去除's、去除數(shù)字、去除短詞、去除禁用詞等。
def process_text(self, text):"""Splits a long text into words, eliminates the stopwords.Parameters----------text : stringThe text to be processed.Returns-------words : dict (string, int)Word tokens with associated frequency...versionchanged:: 1.2.2Changed return type from list of tuples to dict.Notes-----There are better ways to do word tokenization, but I don't want toinclude all those things."""flags = (re.UNICODE if sys.version < '3' and type(text) is unicode else 0) regexp = self.regexp if self.regexp is not None else r"\w[\w']+"# 獲得分詞words = re.findall(regexp, text, flags)# 去除 'swords = [word[:-2] if word.lower().endswith("'s") else word for word in words]# 去除數(shù)字if not self.include_numbers:words = [word for word in words if not word.isdigit()]# 去除短詞,長(zhǎng)度小于指定值min_word_length的詞,被視為短詞,篩除if self.min_word_length:words = [word for word in words if len(word) >= self.min_word_length]# 去除禁用詞stopwords = set([i.lower() for i in self.stopwords])if self.collocations:word_counts = unigrams_and_bigrams(words, stopwords, self.normalize_plurals, self.collocation_threshold)else:# remove stopwordswords = [word for word in words if word.lower() not in stopwords]word_counts, _ = process_tokens(words, self.normalize_plurals)return word_counts重頭戲來(lái)了
generate_from_frequencies(self, frequencies, max_font_size=None) 方法體內(nèi)的代碼比較多,總體上分為以下幾步:
1、排序
2、詞頻歸一化
3、創(chuàng)建繪圖對(duì)象
4、確定初始字體大小(字號(hào))
5、擴(kuò)展單詞集
6、確定每個(gè)單詞的字體大小、位置、旋轉(zhuǎn)角度、顏色等信息
源碼如下(根據(jù)個(gè)人理解已添加中文注釋):
def generate_from_frequencies(self, frequencies, max_font_size=None):"""Create a word_cloud from words and frequencies.Parameters----------frequencies : dict from string to floatA contains words and associated frequency.max_font_size : intUse this font-size instead of self.max_font_sizeReturns-------self"""# make sure frequencies are sorted and normalized# 1、排序# 對(duì)“單詞-頻率”列表按頻率降序排序frequencies = sorted(frequencies.items(), key=itemgetter(1), reverse=True)if len(frequencies) <= 0:raise ValueError("We need at least 1 word to plot a word cloud, ""got %d." % len(frequencies))# 確保單詞數(shù)在設(shè)置的最大范圍內(nèi),超出的部分被舍棄掉frequencies = frequencies[:self.max_words]# largest entry will be 1# 取第一個(gè)單詞的頻率作為最大詞頻max_frequency = float(frequencies[0][1])# 2、詞頻歸一化# 把所有單詞的詞頻歸一化,由于單詞已經(jīng)排序,所以歸一化后應(yīng)該是這樣的:[('xxx', 1),('xxx', 0.96),('xxx', 0.87),...]frequencies = [(word, freq / max_frequency)for word, freq in frequencies]# 隨機(jī)對(duì)象,用于產(chǎn)生一個(gè)隨機(jī)數(shù),來(lái)確定是否旋轉(zhuǎn)90度if self.random_state is not None:random_state = self.random_stateelse:random_state = Random()if self.mask is not None:boolean_mask = self._get_bolean_mask(self.mask)width = self.mask.shape[1]height = self.mask.shape[0]else:boolean_mask = Noneheight, width = self.height, self.width# 用于查找單詞可能放置的位置,例如圖片有效范圍內(nèi)的空白處(非文字區(qū)域)occupancy = IntegralOccupancyMap(height, width, boolean_mask)# 3、創(chuàng)建繪圖對(duì)象# create imageimg_grey = Image.new("L", (width, height))draw = ImageDraw.Draw(img_grey)img_array = np.asarray(img_grey)font_sizes, positions, orientations, colors = [], [], [], []last_freq = 1.# 4、確定初始字號(hào)# 確定最大字號(hào)if max_font_size is None:# if not provided use default font_sizemax_font_size = self.max_font_size# 如果最大字號(hào)是空的,就需要確定一個(gè)最大字號(hào)作為初始字號(hào)if max_font_size is None:# figure out a good font size by trying to draw with# just the first two wordsif len(frequencies) == 1:# we only have one word. We make it big!font_size = self.heightelse:# 遞歸進(jìn)入當(dāng)前函數(shù),以獲得一個(gè)self.layout_,其中只有前兩個(gè)單詞的詞頻信息# 使用這兩個(gè)詞頻計(jì)算出一個(gè)初始字號(hào)self.generate_from_frequencies(dict(frequencies[:2]),max_font_size=self.height)# find font sizessizes = [x[1] for x in self.layout_]try:font_size = int(2 * sizes[0] * sizes[1]/ (sizes[0] + sizes[1]))# quick fix for if self.layout_ contains less than 2 values# on very small images it can be emptyexcept IndexError:try:font_size = sizes[0]except IndexError:raise ValueError("Couldn't find space to draw. Either the Canvas size"" is too small or too much of the image is masked ""out.")else:font_size = max_font_size# we set self.words_ here because we called generate_from_frequencies# above... hurray for good design?self.words_ = dict(frequencies)# 5、擴(kuò)展單詞集# 如果單詞數(shù)不足最大值,則擴(kuò)展單詞集以達(dá)到最大值if self.repeat and len(frequencies) < self.max_words:# pad frequencies with repeating words.times_extend = int(np.ceil(self.max_words / len(frequencies))) - 1# get smallest frequencyfrequencies_org = list(frequencies)downweight = frequencies[-1][1]# 擴(kuò)展單詞數(shù),詞頻會(huì)保持原有詞頻的遞減規(guī)則。for i in range(times_extend):frequencies.extend([(word, freq * downweight ** (i + 1))for word, freq in frequencies_org])# 6、確定每一個(gè)單詞的字體大小、位置、旋轉(zhuǎn)角度、顏色等信息# start drawing grey imagefor word, freq in frequencies:if freq == 0:continue# select the font sizers = self.relative_scalingif rs != 0:font_size = int(round((rs * (freq / float(last_freq))+ (1 - rs)) * font_size))if random_state.random() < self.prefer_horizontal:orientation = Noneelse:orientation = Image.ROTATE_90tried_other_orientation = False# 尋找可能放置的位置,如果尋找一次,沒(méi)有找到,則嘗試改變文字方向或縮小字體大小,繼續(xù)尋找。# 直到找到放置位置或者字體大小超出字號(hào)下限while True:# try to find a positionfont = ImageFont.truetype(self.font_path, font_size)# transpose font optionallytransposed_font = ImageFont.TransposedFont(font, orientation=orientation)# get size of resulting textbox_size = draw.textsize(word, font=transposed_font)# find possible places using integral image:result = occupancy.sample_position(box_size[1] + self.margin,box_size[0] + self.margin,random_state)if result is not None or font_size < self.min_font_size:# either we found a place or font-size went too smallbreak# if we didn't find a place, make font smaller# but first try to rotate!if not tried_other_orientation and self.prefer_horizontal < 1:orientation = (Image.ROTATE_90 if orientation is None elseImage.ROTATE_90)tried_other_orientation = Trueelse:font_size -= self.font_steporientation = Noneif font_size < self.min_font_size:# we were unable to draw any morebreak# 收集該詞的信息:字體大小、位置、旋轉(zhuǎn)角度、顏色x, y = np.array(result) + self.margin // 2# actually draw the text# 此處繪制圖像僅僅用于尋找放置單詞的位置,而不是最終的詞云圖片。詞云圖片是在另一個(gè)函數(shù)中生成:to_imagedraw.text((y, x), word, fill="white", font=transposed_font)positions.append((x, y))orientations.append(orientation)font_sizes.append(font_size)colors.append(self.color_func(word, font_size=font_size,position=(x, y),orientation=orientation,random_state=random_state,font_path=self.font_path))# recompute integral imageif self.mask is None:img_array = np.asarray(img_grey)else:img_array = np.asarray(img_grey) + boolean_mask# recompute bottom right# the order of the cumsum's is important for speed ?!occupancy.update(img_array, x, y)last_freq = freq# layout_是單詞信息列表,表中每項(xiàng)信息:單詞、頻率、字體大小、位置、旋轉(zhuǎn)角度、顏色等信息。為后續(xù)步驟的繪圖工作做好準(zhǔn)備。self.layout_ = list(zip(frequencies, font_sizes, positions,orientations, colors))return self注意
在第6步確定位置時(shí),程序使用循環(huán)和隨機(jī)數(shù)來(lái)查找合適的放置位置,源碼如下。
# 尋找可能放置的位置,如果尋找一次,沒(méi)有找到,則嘗試改變文字方向或縮小字體大小,繼續(xù)尋找。# 直到找到放置位置或者字體大小超出字號(hào)下限while True:# try to find a positionfont = ImageFont.truetype(self.font_path, font_size)# transpose font optionallytransposed_font = ImageFont.TransposedFont(font, orientation=orientation)# get size of resulting textbox_size = draw.textsize(word, font=transposed_font)# find possible places using integral image:result = occupancy.sample_position(box_size[1] + self.margin,box_size[0] + self.margin,random_state)if result is not None or font_size < self.min_font_size:# either we found a place or font-size went too smallbreak# if we didn't find a place, make font smaller# but first try to rotate!if not tried_other_orientation and self.prefer_horizontal < 1:orientation = (Image.ROTATE_90 if orientation is None elseImage.ROTATE_90)tried_other_orientation = Trueelse:font_size -= self.font_steporientation = None其中?occupancy.sample_position() 是具體尋找合適位置的方法。當(dāng)你試圖進(jìn)一步了解其中的奧秘時(shí),卻發(fā)現(xiàn)你的【Ctrl+左鍵】已經(jīng)無(wú)法跳轉(zhuǎn)到深層代碼了,悲哀的事情還是發(fā)生了......o(╥﹏╥)o
在wordcloud.py文件的頂部有這么一行: from .query_integral_image import query_integral_image 而query_integral_image 是一個(gè)pyd文件,該文件無(wú)法直接查看。有關(guān)pyd格式的更多資料,請(qǐng)自行查閱。
再回到 generate_from_frequencies 上來(lái),方法的最后把數(shù)據(jù)整理到了 self.layout_ 變量里,這里面就是所有詞組繪制時(shí)所需要的信息了。然后就可以調(diào)用to_file()方法,保存圖片了。
def to_file(self, filename):img = self.to_image()img.save(filename, optimize=True)return self核心方法?to_image() 就會(huì)把self.layout_里的信息依次取出,繪制每一個(gè)詞組。
def to_image(self):self._check_generated()if self.mask is not None:width = self.mask.shape[1]height = self.mask.shape[0]else:height, width = self.height, self.widthimg = Image.new(self.mode, (int(width * self.scale),int(height * self.scale)),self.background_color)draw = ImageDraw.Draw(img)for (word, count), font_size, position, orientation, color in self.layout_:font = ImageFont.truetype(self.font_path,int(font_size * self.scale))transposed_font = ImageFont.TransposedFont(font, orientation=orientation)pos = (int(position[1] * self.scale),int(position[0] * self.scale))draw.text(pos, word, fill=color, font=transposed_font)return self._draw_contour(img=img)?
引申思考:
查找文字合適的放置該怎樣實(shí)現(xiàn)呢?(注意:文字筆畫的空隙里也是可以放置更小一字號(hào)的文字)
?
~ End ~
總結(jié)
以上是生活随笔為你收集整理的Python wordcloud词云:源码分析及简单使用的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 《你要如何衡量你的人生》笔记与感想(三)
- 下一篇: python re库 正则表达式