朴素贝叶斯分类器 文本分类_构建灾难响应的文本分类器
樸素貝葉斯分類器 文本分類
背景 (Background)
Following a disaster, typically you will get millions and millions of communications, either direct or via social media, right at the time when disaster response organizations have the least capacity to filter and pull out the messages which are the most important. And often it really is only one in every thousand messages that might be relevant to disaster response professionals.
災(zāi)難發(fā)生后,通常在災(zāi)難響應(yīng)組織過濾和提取最重要消息的能力最差的時(shí)候,您將直接或通過社交媒體獲得數(shù)以百萬計(jì)的通信。 通常,實(shí)際上只有十分之幾的消息可能與災(zāi)難響應(yīng)專業(yè)人員相關(guān)。
So the way that disasters are typically responded to is that different organizations will take care of different parts of the problem. One organization will care about water, another one will care about blocked roads, and another will care about medical supplies.
因此,災(zāi)難通常的應(yīng)對(duì)方式是不同的組織將處理問題的不同部分。 一個(gè)組織將關(guān)心水,另一個(gè)組織將關(guān)心道路阻塞,另一個(gè)組織將關(guān)心醫(yī)療用品。
— Robert Munro, former CTO of Figure Eight (acquired by Appen)
-圖8的前CTO Robert Munro(被Appen收購(gòu))
Robert Munro summed up the problem quite well. With so many messages being received during disasters, there needs to be a way of directing these messages to the appropriate organization so that they can respond to the problem accordingly.
Robert Munro很好地總結(jié)了這個(gè)問題。 災(zāi)難期間收到了如此多的消息,因此需要一種將這些消息定向到適當(dāng)組織的方法,以便它們可以相應(yīng)地對(duì)問題做出響應(yīng)。
Using data from Figure Eight (now Appen), we will be building a web application to classify disaster messages so that an emergency professional would know which organization to send the message to.
使用圖八 (現(xiàn)在為Appen)中的數(shù)據(jù),我們將構(gòu)建一個(gè)Web應(yīng)用程序以對(duì)災(zāi)難消息進(jìn)行分類,以便緊急事件專業(yè)人員知道將消息發(fā)送到哪個(gè)組織。
This walkthrough assumes you have some knowledge of natural language processing and machine learning. We will go over the general process but you can see the full code at my Github.
本演練假定您具有一些自然語言處理和機(jī)器學(xué)習(xí)的知識(shí)。 我們將介紹整個(gè)過程,但是您可以在我的Github上查看完整的代碼。
數(shù)據(jù) (The Data)
The data contains 26,248 labeled messages that were sent during past disasters around the world. Each message is labeled as 1 or more of the following 36 categories:
數(shù)據(jù)包含在世界各地過去的災(zāi)難中發(fā)送的26,248條帶標(biāo)簽的郵件。 每條消息被標(biāo)記為以下36個(gè)類別中的1個(gè)或多個(gè):
'related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report'
“相關(guān)”,“請(qǐng)求”,“提供”,“援助相關(guān)”,“醫(yī)療幫助”,“醫(yī)療產(chǎn)品”,“搜索和救援”,“安全”,“軍事”,“獨(dú)身”,“水”,“食品”,“庇護(hù)所” ”,“衣服”,“錢”,“失民”,“難民”,“死亡”,“其他援助”,“基礎(chǔ)設(shè)施相關(guān)”,“運(yùn)輸”,“建筑物”,“電力”,“工具”,“醫(yī)院”, “商店”,“援助中心”,“其他基礎(chǔ)設(shè)施”,“與天氣相關(guān)”,“洪水”,“風(fēng)暴”,“火災(zāi)”,“地震”,“寒冷”,“其他天氣”,“直接報(bào)告”
Note: Messages don’t necessarily fall into only 1 category. A message can be labeled as multiple categories or even none.
注意:郵件不一定只屬于1類。 一條消息可以標(biāo)記為多個(gè)類別,甚至可以都不標(biāo)記。
Figure 1: Original data format. 圖1 :原始數(shù)據(jù)格式。 (A) The message dataset on the left and (a)左側(cè)的消息數(shù)據(jù)集和右側(cè)的(B) categories dataset on the right are connected via the (B)類別數(shù)據(jù)集通過id column.id列連接。As seen in figure 1, the original data was split into 2 CSV files:
如圖1所示 ,原始數(shù)據(jù)分為2個(gè)CSV文件:
And the categories dataset (figure 1B) was formatted in a way that is unusable. All 36 categories and their corresponding values (0 for no or 1 for yes) are stuffed into a single column. To be able to use this dataset as labels for our supervised learning model, we’ll need to transform that single column into 36 separate columns (1 for each category) with binary numeric values, shown in figure 2 below.
而且類別數(shù)據(jù)集( 圖1B)的格式無法使用。 所有36個(gè)類別及其對(duì)應(yīng)的值(0表示“否”或1表示“是”)被填充到一列中。 為了能夠?qū)⒋藬?shù)據(jù)集用作監(jiān)督學(xué)習(xí)模型的標(biāo)簽,我們需要將該單列轉(zhuǎn)換為具有二進(jìn)制數(shù)值的36個(gè)單獨(dú)的列(每個(gè)類別1個(gè)),如下圖2所示。
Figure 2: Categories dataset transformed into a usable format. There are 35 columns with binary numeric values.圖2 :將類別數(shù)據(jù)集轉(zhuǎn)換為可用格式。 一共有35列具有二進(jìn)制數(shù)值。None of the messages in the dataset were labeled as child_alone so this category will be removed since it is not providing any information.
數(shù)據(jù)集中的所有消息均未標(biāo)記為child_alone因此將刪除該類別,因?yàn)樗刺峁┤魏涡畔ⅰ?
To prepare the data, I wrote an ETL pipeline with the following steps:
為了準(zhǔn)備數(shù)據(jù),我編寫了一個(gè)ETL管道,其步驟如下:
Transform the categories dataset from 1 string variable (figure 1B) into 36 numeric variables (figure 2)
將類別數(shù)據(jù)集從1個(gè)字符串變量( 圖1B )轉(zhuǎn)換為36個(gè)數(shù)字變量( 圖2 )
Drop child_alone from the categories dataset, leaving 35 categories left to classify
從類別數(shù)據(jù)集中刪除child_alone ,剩下35個(gè)類別以進(jìn)行分類
分類器 (The Classifier)
With the data processed, we can use it to train a classification model. But wait! Machine learning models don’t know how to interpret text data directly, so we need to somehow convert the text into numeric features first. No worries though. This feature extraction can be done in conjunction with the classification model within a single pipeline.
處理完數(shù)據(jù)后,我們可以使用它來訓(xùn)練分類模型。 可是等等! 機(jī)器學(xué)習(xí)模型不知道如何直接解釋文本數(shù)據(jù),因此我們需要首先以某種方式將文本轉(zhuǎn)換為數(shù)字特征。 不用擔(dān)心。 可以與單個(gè)管道中的分類模型一起完成此特征提取。
The machine learning pipeline (code below) was built as follows:
機(jī)器學(xué)習(xí)管道(以下代碼)的構(gòu)建如下:
1. Tf-idf vectorizer — tokenizes an entire corpus of text data to build a vocabulary and converts individual documents into a numeric vector based on the vocabulary
1. Tf-idf矢量化器 -標(biāo)記整個(gè)文本數(shù)據(jù)集以構(gòu)建詞匯表,并根據(jù)該詞匯表將單個(gè)文檔轉(zhuǎn)換為數(shù)字矢量
Tokenizer steps: lowercase all characters > remove all punctuation > tokenize text into individual words > strip any white space surrounding words > remove stopwords (words that add no meaning to a sentence) > stem remaining words
標(biāo)記生成器步驟:小寫字母>除去所有標(biāo)點(diǎn)>標(biāo)記化文本為單個(gè)單詞>剝?nèi)ト魏慰瞻字車膯卧~>移除停止詞(即沒有意義添加到句子話)>干剩余字
Vectorizer steps: convert a text document into a term frequency vector (word counts) > normalize word counts by multiplying the inverse document frequency
矢量化器步驟:將文本文檔轉(zhuǎn)換為術(shù)語頻率矢量(字?jǐn)?shù))>通過乘以逆文檔頻率將字?jǐn)?shù)歸一化
2. Multi-output classifier using a logistic regression model — predicts 35 binary labels (0 or 1 for each of the 35 categories)
2. 使用邏輯回歸模型的多輸出分類器 -預(yù)測(cè)35個(gè)二元標(biāo)簽(35個(gè)類別中的每個(gè)類別為0或1)
Figure 3: Code for the machine learning pipeline. The 圖3 :機(jī)器學(xué)習(xí)管道的代碼。 將tokenize helper function is passed into the pipeline’s first step (the tf-idf vectorizer).標(biāo)記化幫助器功能傳遞到管道的第一步(tf-idf矢量化器)。After importing the data from the database we just created, we split the data into a training and test set, and use the training set to train the classifier pipeline outlined above. A grid search was done to optimize the parameters for both steps in the pipeline and the final classifier was evaluated on the test set with the following results:
從剛剛創(chuàng)建的數(shù)據(jù)庫(kù)中導(dǎo)入數(shù)據(jù)后,我們將數(shù)據(jù)分為訓(xùn)練和測(cè)試集,并使用訓(xùn)練集來訓(xùn)練上面概述的分類器管道。 進(jìn)行了網(wǎng)格搜索以優(yōu)化管道中兩個(gè)步驟的參數(shù),并在測(cè)試集中對(duì)最終分類器進(jìn)行了評(píng)估,結(jié)果如下:
Average accuracy: 0.9483
平均準(zhǔn)確度 :0.9483
Average precision: 0.9397
平均精度 :0.9397
Average recall: 0.9483
平均召回率 :0.9483
Average F-score: 0.9380
平均F值 :0.9380
As this was a multi-output classification problem, these metrics were averaged across all 35 outputs.
由于這是一個(gè)多輸出分類問題,因此對(duì)所有35個(gè)輸出進(jìn)行平均。
I also tried Naive Bayes and random forest models, but they didn’t perform as well as the logistic regression model. The random forest model had slightly better metrics for a lot of the categories, but since it takes significantly longer to train, I opted for logistic regression.
我還嘗試過樸素貝葉斯和隨機(jī)森林模型,但它們的表現(xiàn)不如邏輯回歸模型。 對(duì)于許多類別,隨機(jī)森林模型的指標(biāo)稍好一些,但是由于訓(xùn)練所需的時(shí)間明顯更長(zhǎng),因此我選擇了邏輯回歸。
Finally, the trained classifier is saved in pickle format.
最后,訓(xùn)練有素的分類器以泡菜格式保存。
應(yīng)用程序 (The Application)
Now that we have a trained classifier, we can build it into a web application that classifies disaster messages. Personally, I prefer Flask as it is a lightweight framework, perfect for smaller applications. The app’s interface is shown in figure 4 below.
現(xiàn)在,我們擁有訓(xùn)練有素的分類器,可以將其構(gòu)建到對(duì)災(zāi)難消息進(jìn)行分類的Web應(yīng)用程序中。 就個(gè)人而言,我更喜歡Flask,因?yàn)樗禽p量級(jí)的框架,非常適合較小的應(yīng)用程序。 該應(yīng)用程序的界面如下圖4所示。
Figure 4: The web application’s interface. 圖4 :Web應(yīng)用程序的界面。 (A) The home page (left) contains an input form and a data dashboard below. (A)主頁(yè)(左)在下面包含一個(gè)輸入表單和一個(gè)數(shù)據(jù)儀表板。 (B) The result page (right) displays the entered message and the classification results.(B)結(jié)果頁(yè)面(右側(cè))顯示輸入的消息和分類結(jié)果。As shown in Figure 4, the web application has 2 pages:
如圖4所示,Web應(yīng)用程序有2個(gè)頁(yè)面:
Home page: This page contains an input field to enter a message to classify and a dashboard of interactive visualizations that summarizes the data. The dashboard (created with Plotly) shows the (1) distribution of message genres, (2) the distribution of message word counts, (3) top message categories, and (4) the most common words in messages.
主頁(yè) :此頁(yè)面包含一個(gè)輸入字段,用于輸入要分類的消息以及用于匯總數(shù)據(jù)的交互式可視化儀表板。 儀表板(使用Plotly創(chuàng)建)顯示(1)消息類型的分布,(2)消息字?jǐn)?shù)的分布,(3)頂部消息類別,以及(4)消息中最常見的單詞。
Result page: This page displays the message that was entered into the input field and the 35 classification results for that message. The categories highlighted blue are the categories that the message was classified as.
結(jié)果頁(yè)面 :此頁(yè)面顯示輸入到輸入字段中的消息以及該消息的35個(gè)分類結(jié)果。 藍(lán)色突出顯示的類別是郵件被分類為的類別。
Both pages were written in HTML and Bootstrap (a CSS library) and are rendered by the Flask app. To build the app, we first load in the data and the trained model.
這兩個(gè)頁(yè)面都是用HTML和Bootstrap(一個(gè)CSS庫(kù))編寫的,并由Flask應(yīng)用程序呈現(xiàn)。 要構(gòu)建該應(yīng)用程序,我們首先要加載數(shù)據(jù)和經(jīng)過訓(xùn)練的模型。
We use the data to set up the home-page visualizations in the back-end with Plotly’s Python library and render these visualizations in the front-end with Plotly’s Javascript library.
我們使用這些數(shù)據(jù)在Plotly的Python庫(kù)中在后端設(shè)置主頁(yè)可視化效果,并在Plotly的Javascript庫(kù)中在前端渲染這些可視化效果。
When text is entered into the input field and submitted, it is fetched by Flask to the back-end where the model will classify it, and the result page will then be rendered with the classification results.
將文本輸入輸入字段并提交后,Flask會(huì)將其提取到模型將對(duì)其進(jìn)行分類的后端,然后將使用分類結(jié)果來呈現(xiàn)結(jié)果頁(yè)面。
As shown in figure 4B, I tested an example message:
如圖4B 所示 ,我測(cè)試了一個(gè)示例消息:
“Please, we need tents and water. We are in Silo, Thank you!”
“請(qǐng),我們需要帳篷和水。 我們?cè)谕矀}(cāng),謝謝!”
And it was classified as “related”, “request”, “aid related”, “water” and “shelter”.
它分為“相關(guān)”,“請(qǐng)求”,“與援助有關(guān)”,“水”和“庇護(hù)所”。
摘要 (Summary)
The main components of this project are (1) the data processing pipeline, which transforms the data into a usable format and prepares it for the classifier, (2) the machine learning pipeline, which includes a tf-idf vectorizer and a logistic regression classifier, and (3) the web application, which serves the trained classifier and a data dashboard.
該項(xiàng)目的主要組件是(1)數(shù)據(jù)處理管道,它將數(shù)據(jù)轉(zhuǎn)換為可用格式并為分類器做準(zhǔn)備;(2)機(jī)器學(xué)習(xí)管道,其中包括tf-idf矢量化器和邏輯回歸分類器,以及(3)Web應(yīng)用程序,該服務(wù)為訓(xùn)練有素的分類器和數(shù)據(jù)儀表板提供服務(wù)。
Here are some ideas for improving this project you may want to try:
以下是一些改進(jìn)您可能想嘗試的項(xiàng)目的想法:
- Different or additional text processing steps, like lemmatization instead of stemming 不同的或附加的文本處理步驟,例如詞法化而不是詞干化
- Extract more features from the text, like message word count 從文本中提取更多功能,例如消息字?jǐn)?shù)
- A different classification algorithm, like convolutional neural networks 不同的分類算法,例如卷積神經(jīng)網(wǎng)絡(luò)
The web application is available on my Github. Clone the repository and follow the instructions in the readme to try it yourself!
該Web應(yīng)用程序可在我的Github上找到 。 克隆存儲(chǔ)庫(kù),并按照自述文件中的說明進(jìn)行操作!
翻譯自: https://medium.com/analytics-vidhya/building-a-text-classifier-for-disaster-response-caf83137e08d
樸素貝葉斯分類器 文本分類
總結(jié)
以上是生活随笔為你收集整理的朴素贝叶斯分类器 文本分类_构建灾难响应的文本分类器的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 卡尔曼滤波滤波方程_了解卡尔曼滤波器及其
- 下一篇: 销货清单数据_2020年8月数据科学阅读