需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子
需求分析與建模最佳實(shí)踐
主題建模的實(shí)際使用 (Practical use of topic modeling)
There is a substantial amount of data generated on the internet every second — posts, comments, photos, and videos. These different data types mean that there is a lot of ground to cover, so let’s focus on one — text.
互聯(lián)網(wǎng)每秒產(chǎn)生大量數(shù)據(jù)-帖子,評論,照片和視頻。 這些不同的數(shù)據(jù)類型意味著有很多基礎(chǔ),所以讓我們集中討論一個(gè)文本。
All social conversations are based on written words — tweets, Facebook posts, comments, online reviews, and so on. Being a social media marketer, a Facebook group/profile moderator, or trying to promote your business on social media requires you to know how your audience reacts to the content you are uploading. One way is to read it all, mark hateful comments, divide them into similar topic groups, calculate statistics and… lose a big chunk of your time just to see that there are thousands of new comments to add to your calculations. Fortunately, there is another solution to this problem — machine learning. From this text you will learn:
所有社交對話均基于書面文字-推文,Facebook帖子,評論,在線評論等。 作為社交媒體營銷商,Facebook組/個(gè)人資料主持人,或嘗試在社交媒體上推廣您的業(yè)務(wù),您需要了解受眾對上傳內(nèi)容的React。 一種方法是全部閱讀,標(biāo)記可恨的評論,將它們分為相似的主題組,計(jì)算統(tǒng)計(jì)信息,以及……浪費(fèi)大量時(shí)間,只是要看到有成千上萬的新評論要添加到計(jì)算中。 幸運(yùn)的是,該問題還有另一種解決方案- 機(jī)器學(xué)習(xí) 。 通過本文,您將學(xué)到:
- Why do you need specialised tools for social media analyses? 為什么需要用于社交媒體分析的專用工具?
- What can you get from topic modeling and how it is done? 您可以從主題建模中獲得什么以及如何完成?
- How to automatically look for hate speech in comments? 如何自動(dòng)在評論中尋找仇恨言論?
社交媒體文本為何獨(dú)特? (Why are social media texts unique?)
Before jumping to the analyses, it is really important to understand why social media texts are so unique:
在進(jìn)行分析之前,了解社交媒體文本為何如此獨(dú)特非常重要:
- Posts and comments are short. They mostly contain one simple sentence or even single word or expression. This gives us a limited amount of information to obtain just from one post. 帖子和評論很簡短。 它們主要包含一個(gè)簡單的句子,甚至單個(gè)單詞或表達(dá)。 這使我們僅從一個(gè)帖子中獲得的信息量有限。
- Emojis and smiley faces — used almost exclusively on social media. They give additional details about the author’s emotions and context. 表情符號(hào)和笑臉-幾乎僅在社交媒體上使用。 他們提供了有關(guān)作者的情感和背景的其他詳細(xì)信息。
- Slang phrases which make posts resemble spoken language rather than written. It makes statements appear more casual. 使帖子成語的語類似于口語而不是書面語。 它使語句顯得更隨意。
These features make social media a whole different source of information and demand special attention while running an analysis using machine learning. In contrast, most open-source machine learning solutions are based on long, formal text, like Wikipedia articles and other website posts. As a result, these models perform badly on social media data, because they don’t understand additional forms of expression included. This problem is called domain shift and is a typical NLP problem. Different data also require customised data preparation methods called preprocessing. The step consists of cleaning text from invaluable tokens like URLs or mentions and conversion to machine readable format (more about how we do it in Sotrender). This is why it is crucial to use tools created especially for your data source to get the best results.
這些功能使社交媒體成為完全不同的信息來源,并且在使用機(jī)器學(xué)習(xí)進(jìn)行分析時(shí)需要特別注意。 相反,大多數(shù)開源機(jī)器學(xué)習(xí)解決方案都是基于較長的正式文本,例如Wikipedia文章和其他網(wǎng)站帖子。 結(jié)果,這些模型在社交媒體數(shù)據(jù)上的表現(xiàn)不佳,因?yàn)樗鼈儾焕斫獍钠渌磉_(dá)形式。 此問題稱為域移位,是典型的NLP問題。 不同的數(shù)據(jù)還需要定制的數(shù)據(jù)準(zhǔn)備方法,稱為預(yù)處理。 該步驟包括從寶貴的令牌(如URL或提及)中清除文本,然后轉(zhuǎn)換為機(jī)器可讀格式(更多有關(guān)我們在Sotrender中的操作方式 )。 這就是為什么使用專門為您的數(shù)據(jù)源創(chuàng)建的工具以獲得最佳結(jié)果至關(guān)重要 。
社交媒體的主題建模 (Topic Modeling for social media)
Machine learning for text analysis (Natural Language Processing) is a vast field with lots of different model types that can gain insight into your data. One of the areas that can answer the question “what are the topics of given pieces of texts?” is topic modeling. These models help with understanding what people are talking about in general. It does not require any specially prepared data set with predefined topics. It can find topics which are patterns hidden within the data on its own without supervision and help — which makes it an unsupervised machine learning method. This means that it is easy to build a model for each individual problem.
文本分析的機(jī)器學(xué)習(xí)( 自然語言處理 )是一個(gè)廣闊的領(lǐng)域,具有許多不同的模型類型,可以深入了解您的數(shù)據(jù)。 可以回答以下問題之一:“給定文本的主題是什么?” 是主題建模 。 這些模型有助于理解人們在談?wù)撌裁础?它不需要任何帶有預(yù)定義主題的特殊準(zhǔn)備的數(shù)據(jù)集。 它可以自行查找隱藏在數(shù)據(jù)中的模式主題,而無需監(jiān)督和幫助-這使其成為無監(jiān)督的機(jī)器學(xué)習(xí)方法。 這意味著很容易為每個(gè)單獨(dú)的問題建立模型 。
There are lots of different algorithms that can be used for this task, but the most common and widely used is LDA (Latent Dirichlet Allocation). It’s based on word frequencies and topics distribution in texts. To put it simply, this method counts words in a given data set and groups them based on their co-occurrence into topics. Then the percentage distribution of topics in each document is calculated. As a result this method assumes that each text is a mixture of topics which works great with long documents where every paragraph relates to a different matter.
有許多不同的算法可用于此任務(wù),但是最常見且使用最廣泛的算法是LDA(潛在狄利克雷分配)。 它基于單詞頻率和主題在文本中的分布。 簡而言之,該方法對給定數(shù)據(jù)集中的單詞進(jìn)行計(jì)數(shù),并根據(jù)它們的同時(shí)出現(xiàn)將它們分組為主題。 然后計(jì)算每個(gè)文檔中主題的百分比分布。 結(jié)果,該方法假定每個(gè)文本都是主題的混合體,這對于較長的文檔(每個(gè)段落涉及一個(gè)不同的問題)非常有用。
Figure 1. LDA algorithm (Credit: 圖1. LDA算法(來源: Columbia University)哥倫比亞大學(xué) )That’s why social media texts need a different procedure. One of the new algorithms is GSDMM (Gibbs sampling algorithm for a Dirichlet Mixture Model). What makes this one so different?:
這就是為什么社交媒體文本需要不同的程序的原因 。 新算法之一是GSDMM (狄利克雷混合模型的吉布斯采樣算法)。 是什么讓這個(gè)如此與眾不同?:
Students are told to write down some movie titles they liked within 2 minutes. Most students are able to list 3–5 movies with this time frame (it corresponds to a limited number of words for social media texts). Then they are randomly assigned to a group. The last step is for every student to pick a different table with two rules in mind:
要求學(xué)生在2分鐘內(nèi)寫下他們喜歡的電影標(biāo)題。 大多數(shù)學(xué)生可以在此時(shí)間范圍內(nèi)列出3-5部電影(對應(yīng)于社交媒體文字的單詞數(shù)量有限)。 然后將它們隨機(jī)分配到一個(gè)組。 最后一步是讓每個(gè)學(xué)生在選擇不同表時(shí)要牢記兩個(gè)規(guī)則:
- pick a group with more students — favours bigger groups 選擇一個(gè)有更多學(xué)生的團(tuán)體-偏愛更大的團(tuán)體
- or a group with the most similar movie titles — makes groups more cohesive. 或電影標(biāo)題最相似的小組-使小組更具凝聚力。
This last step is repeated multiple times. First rule that favours bigger groups is crucial to ensure that groups are not excessively fragmented. Due to the limited number of movie titles (words) for each student (text), each group (topic) is bound to have members with different movies in their lists but from the same genre.
最后一步重復(fù)多次。 有利于更大群體的第一條規(guī)則對于確保群體不會(huì)過于分散至關(guān)重要。 由于每個(gè)學(xué)生(文本)的電影標(biāo)題(單詞)數(shù)量有限,因此每個(gè)組(主題)都必須在列表中具有不同電影但來自同一流派的成員。
As A result of the GSDMM algorithm you obtain an assignment of each text to one topic, as well as a list of the most important words for every topic.
作為GSDMM算法的結(jié)果,您可以將每個(gè)文本分配給一個(gè)主題,并獲得每個(gè)主題最重要的單詞的列表。
Figure 3. Documents assignment to topics and getting topic word圖3.文檔分配給主題并獲取主題詞The tricky part is to decide upon number of topic (problem of every unsupervised method) but when you finally do this you can gain quite of a lot of insights from the data:
棘手的部分是確定主題的數(shù)量(每個(gè)無監(jiān)督方法的問題),但是當(dāng)您最終這樣做時(shí),您可以從數(shù)據(jù)中獲得很多洞見:
- Distribution of topics in your data 數(shù)據(jù)中主題的分布
- Word Clouds — allows us to comprehend the topic and name it. It is a quick and easy solution that can replace reading the whole set of text and spare you hours of tedious work of dividing it into sets. 詞云-使我們能夠理解主題并為其命名。 這是一種快速簡便的解決方案,可以代替閱讀整個(gè)文本集,并且省去了將其分成幾組的繁瑣工作。
- Time series analysis of topics — As we can see in the plot below some topics can gain more attention like number 7 and some of them fade away like number 4. Trying to grasp the idea of what is popular or can be popular in the future is a good thing to look back and see how topics were changing in the past. 主題的時(shí)間序列分析-正如我們在下面的圖表中看到的那樣,某些主題可以像數(shù)字7一樣得到更多的關(guān)注,而某些主題像數(shù)字4一樣可以逐漸消失。嘗試掌握流行或?qū)砹餍械南敕ㄊ腔仡欉^去,看看主題是如何變化的,這是一件好事。
Use case
用例
In one of our recent projects for Collegium Civitas we analyzed 50 000 social media posts and comments and performed topic analysis on them. It allowed our client to answer questions like:
在我們針對Colvitaium Civitas的最新項(xiàng)目之一中,我們分析了5萬個(gè)社交媒體帖子和評論,并對它們進(jìn)行了主題分析。 它使我們的客戶可以回答以下問題:
1) What was discussed in the time span of 2 months in social media?
1)在兩個(gè)月的社交媒體中討論了什么?
In the dataset we were able to distinguish 10 different topics, revolving around Covid-19. Discussions covered statistics and covid-19 etiology, everyday life, government response to pandemic, consequences of limitations in traveling, trade market and supplies, everyday life, health care during pandemic, church and politics, common knowledge and conspiracy theories of Covid-19, politics and economy, spam messages and ads.
在數(shù)據(jù)集中,我們能夠區(qū)分10個(gè)不同的主題,圍繞Covid-19。 討論內(nèi)容涉及統(tǒng)計(jì)數(shù)據(jù)和covid-19的病因,日常生活,政府對大流行的React,旅行限制,貿(mào)易市場和物資供應(yīng)的影響,日常生活,大流行期間的醫(yī)療保健,教堂和政治,常識(shí)和陰謀理論,政治和經(jīng)濟(jì),垃圾郵件和廣告。
2) How were the discussions influenced by the pandemic situation?
2)討論如何受到大流行情況的影響?
During the pandemic burst the biggest theme was the origin and statistics of Covid-19. People talked about how the situation is changing and exchanged information about ways of disease spreading . To read more visit Collegium Civitas’ site (Polish version only).
大流行爆發(fā)期間,最大的主題是Covid-19的起源和統(tǒng)計(jì)數(shù)據(jù)。 人們討論了情況的變化情況,并就疾病傳播方式交換了信息。 要了解更多信息,請?jiān)L問Collegium Civitas的網(wǎng)站 (僅波蘭版)。
仇恨語音識(shí)別 (Hate speech recognition)
Another question that can be answered with machine learning is “what kind of emotion do people express in their comments or posts?” or “is my content generating hateful comments?”. There are only a few solutions for these tasks in the Polish language. That is why we build models based on social media text for Sentiment and Hate Speech recognition at Sotrender. Our solutions were built in two steps.
機(jī)器學(xué)習(xí)可以回答的另一個(gè)問題是“ 人們在評論或帖子中表達(dá)什么樣的情感? ”或“ 我的內(nèi)容是否引起仇恨評論? ”。 用波蘭語完成這些任務(wù)的解決方案很少。 這就是為什么我們基于社交媒體文本在Sotrender上建立用于情感和仇恨語音識(shí)別的模型的原因 。 我們的解決方案分兩步構(gòu)建。
The first step is to convert text and emojis into numerical vector representation (embeddings) to be used later in neural networks. The main goal of this step is to achieve some kind of language model (LM) that has the knowledge of a human language so that vectors representing similar words are close to each other (for example: queen and king or paragraph and article) which implies that these words have similar meaning (semantic similarity). The property is shown on the graph below.
第一步是將文本和表情符號(hào)轉(zhuǎn)換為數(shù)字矢量表示形式(嵌入),以供以后在神經(jīng)網(wǎng)絡(luò)中使用。 此步驟的主要目標(biāo)是獲得某種具有人類語言知識(shí)的語言模型(LM),以便表示相似單詞的向量彼此接近(例如:皇后和國王或段落和文章),這意味著這些詞具有相似的含義(語義相似性)。 該屬性如下圖所示。
Figure 7. The intuition behind word similarity圖7.單詞相似性背后的直覺Training this model is similar to teaching a child how to speak by talking to them. Children by listening to their parents talk are able to grasp the meaning of words and the more they hear the more they understand.
訓(xùn)練這種模式類似于教孩子如何通過與他們交談來說話。 孩子們通過聽父母說話可以理解單詞的含義,聽得越多,他們的理解就越多。
According to this analogy, we have to use a huge set of social media text to train our model to understand its language. That is why we used a set of 100 millions posts and comments to train our model so it can properly assign vectors to words as well as to emojis. Tokens vectorised with an embeddings model provide the input to the neural network.
根據(jù)這種類比,我們必須使用大量的社交媒體文本來訓(xùn)練我們的模型以理解其語言。 這就是為什么我們使用了1億篇帖子和評論來訓(xùn)練我們的模型,以便它可以將矢量正確分配給單詞以及表情符號(hào)的原因。 用嵌入模型矢量化的令牌為神經(jīng)網(wǎng)絡(luò)提供輸入。
The second step is designing neural networks for a specific task — Hate speech recognition. The most important thing is the data set — the model needs examples of hate speech and non-hateful texts to learn how to tell them apart. In order to get best results you need to experiment with different architectures and model’s hyperparameters.
第二步是為特定任務(wù)(討厭的語音識(shí)別)設(shè)計(jì)神經(jīng)網(wǎng)絡(luò)。 最重要的是數(shù)據(jù)集-該模型需要仇恨言論和非仇恨文本的示例,以學(xué)習(xí)如何區(qū)分它們。 為了獲得最佳結(jié)果,您需要嘗試使用不同的體系結(jié)構(gòu)和模型的超參數(shù)。
As a result of the hate speech recognition model, we get another grouping of our data set. Now we can see how our audience reacts, how many hateful comments or posts it’s creating. What’s more, by combining it again with the time of publication of each comment, we can see if there was a specific time period when the most hateful comments were generated like shown in a histogram below.
仇恨語音識(shí)別模型的結(jié)果是,我們得到了另一組數(shù)據(jù)集。 現(xiàn)在我們可以看到觀眾的React,正在創(chuàng)建多少可惡的評論或帖子 。 此外,通過將其與每個(gè)評論的發(fā)布時(shí)間再次結(jié)合,我們可以查看是否在特定時(shí)間段內(nèi)生成了最令人討厭的評論,如下面的直方圖所示。
Figure 圖 8. Hate Speech distribution over time8.討厭的語音隨時(shí)間分布Combining this distribution with recent posts or events can give you insight into the type of content that provokes people. Also changes of hate speech contribution in time can be related with changes in topic distribution. Combining all the information from analysis can provide an in-depth image of the dataset.
將此分布與最近的帖子或事件相結(jié)合,可以使您深入了解引起人們注意的內(nèi)容類型 。 仇恨言論貢獻(xiàn)的及時(shí)變化也可能與主題分布的變化有關(guān)。 結(jié)合來自分析的所有信息可以提供數(shù)據(jù)集的深入圖像。
Figure 9. Weekly text count with hate speech圖9.帶有仇恨言論的每周文本計(jì)數(shù)As the histogram above shows most hate is connected to topic 3, 6 and 7. Knowing what makes people angry gives the opportunity to avoid sensitive topics in the future.
如上面的直方圖所示,大多數(shù)仇恨都與主題3、6和7有關(guān)。 知道什么使人生氣會(huì)使將來有機(jī)會(huì)避免敏感主題。
Same goes for sentiment analysis. We can produce similar visualizations for positive, negative or neutral comments and see their distribution in time or topics. If you would like to read thewhole report build based on our analysis of the 8 weeks of data you can find it here (only Polish version).
情緒分析也是如此。 我們可以為肯定,否定或中立的評論提供類似的可視化效果,并查看它們在時(shí)間或主題上的分布。 如果您想根據(jù)我們對8周數(shù)據(jù)的分析來閱讀整個(gè)報(bào)告,則可以在此處找到(僅限波蘭語版本)。
結(jié)論 (Conclusion)
In Sotrender we have models for hate speech and sentiment recognition that are constantly improved and updated for social media texts. What’s more we have experience in building topic modeling models for individual cases. As you can see there’s a lot of benefits coming from this type of analysis:
在Sotrender中,我們提供了針對仇恨言論和情感識(shí)別的模型,這些模型會(huì)針對社交媒體文本進(jìn)行不斷改進(jìn)和更新。 此外,我們在為個(gè)別案例構(gòu)建主題建模模型方面經(jīng)驗(yàn)豐富。 如您所見,這種分析有很多好處:
- Getting to know your audience 了解你的聽眾
- Having in depth look into topics of comments 深入研究評論主題
- Discovering trending themes 發(fā)現(xiàn)熱門主題
- Finding source of hatred or negativity in our content 在我們的內(nèi)容中找到仇恨或否定的根源
To name just a few!
僅舉幾例!
翻譯自: https://towardsdatascience.com/social-media-and-topic-modeling-how-to-analyze-posts-in-practice-d84fc0c613cb
需求分析與建模最佳實(shí)踐
總結(jié)
以上是生活随笔為你收集整理的需求分析与建模最佳实践_社交媒体和主题建模:如何在实践中分析帖子的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 模拟人生畅玩版怎么结婚(MuMu模拟器官
- 下一篇: 机器学习 数据模型_使用PyCaret将