垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器
垃圾郵件分類器
If you’re just starting out in Machine Learning, chances are you’ll be undertaking a classification project. As a beginner, I built an SMS spam classifier but did a ton of research to know where to start. In this article, I’ll walk you through my project in 10 steps to make it easier for you to build your first spam classifier using Tf-IDF Vectorizer, and the Na?ve Bayes model!
如果您剛剛開始學(xué)習(xí)機(jī)器學(xué)習(xí),那么您很可能會進(jìn)行分類項(xiàng)目。 作為一個初學(xué)者,我建立了一個SMS垃圾郵件分類器,但進(jìn)行了大量研究以了解從何開始。 在本文中,我將分10個步驟逐步介紹我的項(xiàng)目,以使您更輕松地使用Tf-IDF Vectorizer和Na?veBayes模型構(gòu)建第一個垃圾郵件分類器!
1.加載并簡化數(shù)據(jù)集 (1. Load and simplify the dataset)
Our SMS text messages dataset has 5 columns if you read it in pandas: v1 (containing the class labels ham/spam for each text message), v2 (containing the text messages themselves), and three Unnamed columns which have no use. We’ll rename the v1 and v2 columns to class_label and message respectively while getting rid of the rest of the columns.
如果您以熊貓閱讀,我們的SMS短信數(shù)據(jù)集有5列:v1(每個短信包含類別標(biāo)簽ham / spam),v2(包含短信本身)和三個無用的未命名列。 我們將第1版和第2版列分別重命名為class_label和message,而除去其余的列。
import pandas as pddf = pd.read_csv(r'spam.csv',encoding='ISO-8859-1')
df.rename(columns = {'v1':'class_label', 'v2':'message'}, inplace = True)
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True)df
Check out the fact that ‘5572 rows x 2 columns’ means that our dataset has 5572 text messages!
看看“ 5572行x 2列”這一事實(shí)意味著我們的數(shù)據(jù)集包含5572條文本消息!
2.瀏覽數(shù)據(jù)集:條形圖 (2. Explore the dataset: Bar Chart)
It’s a good idea to carry out some Exploratory Data Analysis (EDA) in a classification problem to visualize, get some information out of, or find any issues with your data before you start working with it. We’ll look at how many spam/ham messages we have and create a bar chart for it.
在開始處理分類問題之前,最好對分類問題進(jìn)行一些探索性數(shù)據(jù)分析(EDA)以可視化,從中獲取一些信息或發(fā)現(xiàn)任何問題。 我們將查看有多少垃圾郵件/火腿郵件,并為其創(chuàng)建條形圖。
#exploring the datasetdf['class_label'].value_counts()Our dataset has 4825 ham messages and 747 spam messages. This is an imbalanced dataset; the number of ham messages is much higher than those of spam! This can potentially cause our model to be biased. To fix this, we could resample our data to get an equal number of spam/ham messages.
我們的數(shù)據(jù)集包含4825個火腿郵件和747垃圾郵件。 這是一個不平衡的數(shù)據(jù)集; 火腿郵件的數(shù)量遠(yuǎn)高于垃圾郵件! 這可能會導(dǎo)致我們的模型出現(xiàn)偏差。 為了解決這個問題,我們可以對數(shù)據(jù)重新采樣以獲取相同數(shù)量的垃圾郵件/火腿郵件。
To generate our bar chart, we use NumPy and pyplot from Matplotlib.
為了生成條形圖,我們使用Matplotlib中的NumPy和pyplot。
3.探索數(shù)據(jù)集:詞云 (3. Explore the dataset: Word Clouds)
For my project, I generated word clouds of the most frequently occurring words in my spam messages.
對于我的項(xiàng)目,我生成了垃圾郵件中最常出現(xiàn)的單詞的單詞云。
First, we’ll filter out all the spam messages from our dataset. df_spam is a DataFrame that contains only spam messages.
首先,我們將從數(shù)據(jù)集中過濾掉所有垃圾郵件。 df_spam是僅包含垃圾郵件的DataFrame。
df_spam = df[df.class_label=='spam']df_spamNext, we’ll convert our DataFrame to a list, where every element of that list will be a spam message. Then, we’ll join each element of our list into one big string of spam messages. The lowercase form of that string is the required format needed for our word cloud creation.
接下來,我們將DataFrame轉(zhuǎn)換為一個列表,該列表中的每個元素都是垃圾郵件。 然后,我們將列表中的每個元素加入一大串垃圾郵件中。 該字符串的小寫形式是創(chuàng)建詞云所需的必需格式。
spam_list= df_spam['message'].tolist()filtered_spam = filtered_spam.lower()Finally, we’ll import the relevant libraries and pass in our string as a parameter:
最后,我們將導(dǎo)入相關(guān)的庫并將字符串作為參數(shù)傳遞:
import osfrom wordcloud import WordCloud
from PIL import Imagecomment_mask = np.array(Image.open("comment.png"))
#create and generate a word cloud image
wordcloud = WordCloud(max_font_size = 160, margin=0, mask = comment_mask, background_color = "white", colormap="Reds").generate(filtered_spam)
After displaying it:
顯示后:
Pretty cool, huh? The most common words in spam messages in our dataset are ‘free,’ ‘call now,’ ‘to claim,’ ‘have won,’ etc.
太酷了吧? 在我們的數(shù)據(jù)集中,垃圾郵件中最常見的單詞是“免費(fèi)”,“立即致電”,“聲明”,“贏得”等。
For this word cloud, we needed the Pillow library only because I’ve used masking to create that nice speech bubble shape. If you want it in square form, omit the mask parameter.
對于這個詞云,我們僅需要Pillow庫是因?yàn)槲沂褂昧苏谡謥韯?chuàng)建漂亮的語音氣泡形狀。 如果要以正方形形式使用,請省略mask參數(shù)。
Similarly, for ham messages:
同樣,對于火腿消息:
4.處理不平衡的數(shù)據(jù)集 (4. Handle imbalanced datasets)
To handle imbalanced data, you have a variety of options. I got a pretty good f-measure in my project even with unsampled data, but if you want to resample, see this.
要處理不平衡的數(shù)據(jù),您有多種選擇。 即使使用未采樣的數(shù)據(jù),我在項(xiàng)目中也獲得了相當(dāng)不錯的f度量,但是如果您想重新采樣,請參閱此 。
5.分割數(shù)據(jù)集 (5. Split the dataset)
First, let’s convert our class labels from string to numeric form:
首先,讓我們將類標(biāo)簽從字符串轉(zhuǎn)換為數(shù)字形式:
df['class_label'] = df['class_label'].apply(lambda x: 1 if x == 'spam' else 0)In Machine Learning, we usually split our data into two subsets — train and test. We feed the train set along with the known output values for it (in this case, 0 or 1 corresponding to spam or ham) to our model so that it learns the patterns in our data. Then we use the test set to get the model’s predicted labels on this subset. Let’s see how to split our data.
在機(jī)器學(xué)習(xí)中,我們通常將數(shù)據(jù)分為兩個子集:訓(xùn)練和測試。 我們將訓(xùn)練集及其已知的輸出值(在這種情況下為0或1,對應(yīng)于垃圾郵件或火腿)輸入模型,以便它學(xué)習(xí)數(shù)據(jù)中的模式。 然后,我們使用測試集在此子集上獲取模型的預(yù)測標(biāo)簽。 讓我們看看如何拆分?jǐn)?shù)據(jù)。
First, we import the relevant module from the sklearn library:
首先,我們從sklearn庫中導(dǎo)入相關(guān)模塊:
from sklearn.model_selection import train_test_splitAnd then we make the split:
然后我們進(jìn)行拆分:
x_train, x_test, y_train, y_test = train_test_split(df['message'], df['class_label'], test_size = 0.3, random_state = 0)Let’s now see how many messages we have for our test and train subsets:
現(xiàn)在,讓我們看看我們的測試和訓(xùn)練子集有多少條消息:
print('rows in test set: ' + str(x_test.shape))print('rows in train set: ' + str(x_train.shape))
So we have 1672 messages for testing, and 3900 messages for training!
因此,我們有1672條消息用于測試,3900條消息用于培訓(xùn)!
6.應(yīng)用Tf-IDF矢量化器進(jìn)行特征提取 (6. Apply Tf-IDF Vectorizer for feature extraction)
Our Na?ve Bayes model requires data to be in either Tf-IDF vectors or word vector count. The latter is achieved using Count Vectorizer, but we’ll obtain the former through using Tf-IDF Vectorizer.
我們的樸素貝葉斯模型要求數(shù)據(jù)必須在Tf-IDF向量或單詞向量計數(shù)中。 后者是使用Count Vectorizer實(shí)現(xiàn)的,但我們將使用Tf-IDF Vectorizer獲得前者。
TF-IDF Vectorizer creates Tf-IDF values for every word in our text messages. Tf-IDF values are computed in a manner that gives a higher value to words appearing less frequently so that words appearing many times due to English syntax don’t overshadow the less frequent yet more meaningful and interesting terms.
TF-IDF矢量化器為文本消息中的每個單詞創(chuàng)建Tf-IDF值。 Tf-IDF值的計算方式是為出現(xiàn)頻率較低的單詞賦予較高的值,以使由于英語語法而出現(xiàn)多次的單詞不會掩蓋頻率較低但更有意義和有趣的術(shù)語。
lst = x_train.tolist()vectorizer = TfidfVectorizer(
input= lst , # input is the actual text
lowercase=True, # convert to lowercase before tokenizing
stop_words='english' # remove stop words
)features_train_transformed = vectorizer.fit_transform(list) #gives tf idf vector for x_train
features_test_transformed = vectorizer.transform(x_test) #gives tf idf vector for x_test
7.訓(xùn)練我們的樸素貝葉斯模型 (7. Train our Naive Bayes Model)
We fit our Na?ve Bayes model, aka MultinomialNB, to our Tf-IDF vector version of x_train, and the true output labels stored in y_train.
我們將我們的樸素貝葉斯模型(也稱為MultinomialNB)擬合到我們的Tf-IDF矢量版本x_train,并將真實(shí)輸出標(biāo)簽存儲在y_train中。
from sklearn.naive_bayes import MultinomialNB# train the model
classifier = MultinomialNB()
classifier.fit(features_train_transformed, y_train)
8.檢查準(zhǔn)確性,并進(jìn)行f測度 (8. Check out the accuracy, and f-measure)
It’s time to pass in our Tf-IDF matrix corresponding to x_test, along with the true output labels (y_test), to find out how well our model did!
現(xiàn)在該傳遞與x_test對應(yīng)的Tf-IDF矩陣以及真實(shí)的輸出標(biāo)簽(y_test),以了解我們的模型的效果如何!
First, let’s see the model accuracy:
首先,讓我們看看模型的準(zhǔn)確性:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))Our accuracy is great! However, it’s not a great indicator if our model becomes biased. Hence we perform the next step.
我們的準(zhǔn)確性非常好! 但是,如果我們的模型出現(xiàn)偏差,這并不是一個很好的指標(biāo)。 因此,我們執(zhí)行下一步。
9.查看混淆矩陣和分類報告 (9. View the confusion matrix and classification report)
Let’s now look at our confusion matrix and f-measure scores to confirm if our model is doing OK or not:
現(xiàn)在,讓我們看一下混淆矩陣和f-measure得分,以確認(rèn)我們的模型是否正常:
labels = classifier.predict(features_test_transformed)from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_reportactual = y_test.tolist()
predicted = labels
results = confusion_matrix(actual, predicted)
print('Confusion Matrix :')
print(results)
print ('Accuracy Score :',accuracy_score(actual, predicted))
print ('Report : ')
print (classification_report(actual, predicted) )
score_2 = f1_score(actual, predicted, average = 'binary')
print('F-Measure: %.3f' % score_2)
We have an f-measure score of 0.853, and our confusion matrix shows that our model is making only 61 incorrect classifications. Looks pretty good to me 😊
我們的f測度得分為0.853,而混淆矩陣表明我們的模型僅進(jìn)行了61個錯誤的分類。 對我來說看起來不錯😊
10.混淆矩陣的熱圖(可選) (10. Heatmap for our Confusion Matrix (Optional))
You can create a heatmap using the seaborn library to visualize your confusion matrix. The code below does just that.
您可以使用seaborn庫創(chuàng)建熱圖以可視化混淆矩陣。 下面的代碼就是這樣做的。
And that’s it to make your very own spam classifier! To summarize, we imported the dataset and visualized it. Then we split it into train/test and converted it into Tf-IDF vectors. Finally, we trained our Naive Bayes model, and saw the results! You could take this a step further and deploy it as a web app if you like.
這就是您自己的垃圾郵件分類器! 總而言之,我們導(dǎo)入了數(shù)據(jù)集并將其可視化。 然后我們將其拆分為訓(xùn)練/測試,并將其轉(zhuǎn)換為Tf-IDF向量。 最后,我們訓(xùn)練了我們的樸素貝葉斯模型,并看到了結(jié)果! 如果愿意,您可以進(jìn)一步將它部署為Web應(yīng)用程序。
參考資料/資源: (References/Resources:)
[1] D. T, Confusion Matrix Visualization (2019), https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
[1] D.T,混淆矩陣可視化(2019), https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
C. Vince, Naive Bayes Spam Classifier (2018), https://www.codeproject.com/Articles/1231994/Naive-Bayes-Spam-Classifier
C.文斯 樸素貝葉斯垃圾郵件分類器(2018), https://www.codeproject.com/Articles/1231994/Naive-Bayes-Spam-Classifier
H. Attri, Feature Extraction using TF-IDF algorithm (2019), https://medium.com/@hritikattri10/feature-extraction-using-tf-idf-algorithm-44eedb37305e
H. Attri, 使用TF-IDF算法進(jìn)行特征提取(2019), https://medium.com/@hritikattri10/feature-extraction-using-tf-idf-algorithm-44eedb37305e
A. Bronshtein, Train/Test Split and Cross Validation in Python (2017), https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
A.Bronshtein,《 Python中的Train / Test拆分和交叉驗(yàn)證》(2017年), https ://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset
數(shù)據(jù)集 : https : //www.kaggle.com/uciml/sms-spam-collection-dataset
Full code: https://github.com/samimakhan/Spam-Classification-Project/tree/master/Naive-Bayes
完整代碼 : https : //github.com/samimakhan/Spam-Classification-Project/tree/master/Naive-Bayes
翻譯自: https://towardsdatascience.com/how-to-build-your-first-spam-classifier-in-10-steps-fdbf5b1b3870
垃圾郵件分類器
總結(jié)
以上是生活随笔為你收集整理的垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 委托状态废单怎么撤销
- 下一篇: pytorch图像分类_使用PyTorc