垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器
垃圾郵件分類器
If you’re just starting out in Machine Learning, chances are you’ll be undertaking a classification project. As a beginner, I built an SMS spam classifier but did a ton of research to know where to start. In this article, I’ll walk you through my project in 10 steps to make it easier for you to build your first spam classifier using Tf-IDF Vectorizer, and the Na?ve Bayes model!
如果您剛剛開始學習機器學習,那么您很可能會進行分類項目。 作為一個初學者,我建立了一個SMS垃圾郵件分類器,但進行了大量研究以了解從何開始。 在本文中,我將分10個步驟逐步介紹我的項目,以使您更輕松地使用Tf-IDF Vectorizer和Na?veBayes模型構建第一個垃圾郵件分類器!
1.加載并簡化數據集 (1. Load and simplify the dataset)
Our SMS text messages dataset has 5 columns if you read it in pandas: v1 (containing the class labels ham/spam for each text message), v2 (containing the text messages themselves), and three Unnamed columns which have no use. We’ll rename the v1 and v2 columns to class_label and message respectively while getting rid of the rest of the columns.
如果您以熊貓閱讀,我們的SMS短信數據集有5列:v1(每個短信包含類別標簽ham / spam),v2(包含短信本身)和三個無用的未命名列。 我們將第1版和第2版列分別重命名為class_label和message,而除去其余的列。
import pandas as pddf = pd.read_csv(r'spam.csv',encoding='ISO-8859-1')
df.rename(columns = {'v1':'class_label', 'v2':'message'}, inplace = True)
df.drop(['Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4'], axis = 1, inplace = True)df
Check out the fact that ‘5572 rows x 2 columns’ means that our dataset has 5572 text messages!
看看“ 5572行x 2列”這一事實意味著我們的數據集包含5572條文本消息!
2.瀏覽數據集:條形圖 (2. Explore the dataset: Bar Chart)
It’s a good idea to carry out some Exploratory Data Analysis (EDA) in a classification problem to visualize, get some information out of, or find any issues with your data before you start working with it. We’ll look at how many spam/ham messages we have and create a bar chart for it.
在開始處理分類問題之前,最好對分類問題進行一些探索性數據分析(EDA)以可視化,從中獲取一些信息或發現任何問題。 我們將查看有多少垃圾郵件/火腿郵件,并為其創建條形圖。
#exploring the datasetdf['class_label'].value_counts()Our dataset has 4825 ham messages and 747 spam messages. This is an imbalanced dataset; the number of ham messages is much higher than those of spam! This can potentially cause our model to be biased. To fix this, we could resample our data to get an equal number of spam/ham messages.
我們的數據集包含4825個火腿郵件和747垃圾郵件。 這是一個不平衡的數據集; 火腿郵件的數量遠高于垃圾郵件! 這可能會導致我們的模型出現偏差。 為了解決這個問題,我們可以對數據重新采樣以獲取相同數量的垃圾郵件/火腿郵件。
To generate our bar chart, we use NumPy and pyplot from Matplotlib.
為了生成條形圖,我們使用Matplotlib中的NumPy和pyplot。
3.探索數據集:詞云 (3. Explore the dataset: Word Clouds)
For my project, I generated word clouds of the most frequently occurring words in my spam messages.
對于我的項目,我生成了垃圾郵件中最常出現的單詞的單詞云。
First, we’ll filter out all the spam messages from our dataset. df_spam is a DataFrame that contains only spam messages.
首先,我們將從數據集中過濾掉所有垃圾郵件。 df_spam是僅包含垃圾郵件的DataFrame。
df_spam = df[df.class_label=='spam']df_spamNext, we’ll convert our DataFrame to a list, where every element of that list will be a spam message. Then, we’ll join each element of our list into one big string of spam messages. The lowercase form of that string is the required format needed for our word cloud creation.
接下來,我們將DataFrame轉換為一個列表,該列表中的每個元素都是垃圾郵件。 然后,我們將列表中的每個元素加入一大串垃圾郵件中。 該字符串的小寫形式是創建詞云所需的必需格式。
spam_list= df_spam['message'].tolist()filtered_spam = filtered_spam.lower()Finally, we’ll import the relevant libraries and pass in our string as a parameter:
最后,我們將導入相關的庫并將字符串作為參數傳遞:
import osfrom wordcloud import WordCloud
from PIL import Imagecomment_mask = np.array(Image.open("comment.png"))
#create and generate a word cloud image
wordcloud = WordCloud(max_font_size = 160, margin=0, mask = comment_mask, background_color = "white", colormap="Reds").generate(filtered_spam)
After displaying it:
顯示后:
Pretty cool, huh? The most common words in spam messages in our dataset are ‘free,’ ‘call now,’ ‘to claim,’ ‘have won,’ etc.
太酷了吧? 在我們的數據集中,垃圾郵件中最常見的單詞是“免費”,“立即致電”,“聲明”,“贏得”等。
For this word cloud, we needed the Pillow library only because I’ve used masking to create that nice speech bubble shape. If you want it in square form, omit the mask parameter.
對于這個詞云,我們僅需要Pillow庫是因為我使用了遮罩來創建漂亮的語音氣泡形狀。 如果要以正方形形式使用,請省略mask參數。
Similarly, for ham messages:
同樣,對于火腿消息:
4.處理不平衡的數據集 (4. Handle imbalanced datasets)
To handle imbalanced data, you have a variety of options. I got a pretty good f-measure in my project even with unsampled data, but if you want to resample, see this.
要處理不平衡的數據,您有多種選擇。 即使使用未采樣的數據,我在項目中也獲得了相當不錯的f度量,但是如果您想重新采樣,請參閱此 。
5.分割數據集 (5. Split the dataset)
First, let’s convert our class labels from string to numeric form:
首先,讓我們將類標簽從字符串轉換為數字形式:
df['class_label'] = df['class_label'].apply(lambda x: 1 if x == 'spam' else 0)In Machine Learning, we usually split our data into two subsets — train and test. We feed the train set along with the known output values for it (in this case, 0 or 1 corresponding to spam or ham) to our model so that it learns the patterns in our data. Then we use the test set to get the model’s predicted labels on this subset. Let’s see how to split our data.
在機器學習中,我們通常將數據分為兩個子集:訓練和測試。 我們將訓練集及其已知的輸出值(在這種情況下為0或1,對應于垃圾郵件或火腿)輸入模型,以便它學習數據中的模式。 然后,我們使用測試集在此子集上獲取模型的預測標簽。 讓我們看看如何拆分數據。
First, we import the relevant module from the sklearn library:
首先,我們從sklearn庫中導入相關模塊:
from sklearn.model_selection import train_test_splitAnd then we make the split:
然后我們進行拆分:
x_train, x_test, y_train, y_test = train_test_split(df['message'], df['class_label'], test_size = 0.3, random_state = 0)Let’s now see how many messages we have for our test and train subsets:
現在,讓我們看看我們的測試和訓練子集有多少條消息:
print('rows in test set: ' + str(x_test.shape))print('rows in train set: ' + str(x_train.shape))
So we have 1672 messages for testing, and 3900 messages for training!
因此,我們有1672條消息用于測試,3900條消息用于培訓!
6.應用Tf-IDF矢量化器進行特征提取 (6. Apply Tf-IDF Vectorizer for feature extraction)
Our Na?ve Bayes model requires data to be in either Tf-IDF vectors or word vector count. The latter is achieved using Count Vectorizer, but we’ll obtain the former through using Tf-IDF Vectorizer.
我們的樸素貝葉斯模型要求數據必須在Tf-IDF向量或單詞向量計數中。 后者是使用Count Vectorizer實現的,但我們將使用Tf-IDF Vectorizer獲得前者。
TF-IDF Vectorizer creates Tf-IDF values for every word in our text messages. Tf-IDF values are computed in a manner that gives a higher value to words appearing less frequently so that words appearing many times due to English syntax don’t overshadow the less frequent yet more meaningful and interesting terms.
TF-IDF矢量化器為文本消息中的每個單詞創建Tf-IDF值。 Tf-IDF值的計算方式是為出現頻率較低的單詞賦予較高的值,以使由于英語語法而出現多次的單詞不會掩蓋頻率較低但更有意義和有趣的術語。
lst = x_train.tolist()vectorizer = TfidfVectorizer(
input= lst , # input is the actual text
lowercase=True, # convert to lowercase before tokenizing
stop_words='english' # remove stop words
)features_train_transformed = vectorizer.fit_transform(list) #gives tf idf vector for x_train
features_test_transformed = vectorizer.transform(x_test) #gives tf idf vector for x_test
7.訓練我們的樸素貝葉斯模型 (7. Train our Naive Bayes Model)
We fit our Na?ve Bayes model, aka MultinomialNB, to our Tf-IDF vector version of x_train, and the true output labels stored in y_train.
我們將我們的樸素貝葉斯模型(也稱為MultinomialNB)擬合到我們的Tf-IDF矢量版本x_train,并將真實輸出標簽存儲在y_train中。
from sklearn.naive_bayes import MultinomialNB# train the model
classifier = MultinomialNB()
classifier.fit(features_train_transformed, y_train)
8.檢查準確性,并進行f測度 (8. Check out the accuracy, and f-measure)
It’s time to pass in our Tf-IDF matrix corresponding to x_test, along with the true output labels (y_test), to find out how well our model did!
現在該傳遞與x_test對應的Tf-IDF矩陣以及真實的輸出標簽(y_test),以了解我們的模型的效果如何!
First, let’s see the model accuracy:
首先,讓我們看看模型的準確性:
print("classifier accuracy {:.2f}%".format(classifier.score(features_test_transformed, y_test) * 100))Our accuracy is great! However, it’s not a great indicator if our model becomes biased. Hence we perform the next step.
我們的準確性非常好! 但是,如果我們的模型出現偏差,這并不是一個很好的指標。 因此,我們執行下一步。
9.查看混淆矩陣和分類報告 (9. View the confusion matrix and classification report)
Let’s now look at our confusion matrix and f-measure scores to confirm if our model is doing OK or not:
現在,讓我們看一下混淆矩陣和f-measure得分,以確認我們的模型是否正常:
labels = classifier.predict(features_test_transformed)from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_reportactual = y_test.tolist()
predicted = labels
results = confusion_matrix(actual, predicted)
print('Confusion Matrix :')
print(results)
print ('Accuracy Score :',accuracy_score(actual, predicted))
print ('Report : ')
print (classification_report(actual, predicted) )
score_2 = f1_score(actual, predicted, average = 'binary')
print('F-Measure: %.3f' % score_2)
We have an f-measure score of 0.853, and our confusion matrix shows that our model is making only 61 incorrect classifications. Looks pretty good to me 😊
我們的f測度得分為0.853,而混淆矩陣表明我們的模型僅進行了61個錯誤的分類。 對我來說看起來不錯😊
10.混淆矩陣的熱圖(可選) (10. Heatmap for our Confusion Matrix (Optional))
You can create a heatmap using the seaborn library to visualize your confusion matrix. The code below does just that.
您可以使用seaborn庫創建熱圖以可視化混淆矩陣。 下面的代碼就是這樣做的。
And that’s it to make your very own spam classifier! To summarize, we imported the dataset and visualized it. Then we split it into train/test and converted it into Tf-IDF vectors. Finally, we trained our Naive Bayes model, and saw the results! You could take this a step further and deploy it as a web app if you like.
這就是您自己的垃圾郵件分類器! 總而言之,我們導入了數據集并將其可視化。 然后我們將其拆分為訓練/測試,并將其轉換為Tf-IDF向量。 最后,我們訓練了我們的樸素貝葉斯模型,并看到了結果! 如果愿意,您可以進一步將它部署為Web應用程序。
參考資料/資源: (References/Resources:)
[1] D. T, Confusion Matrix Visualization (2019), https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
[1] D.T,混淆矩陣可視化(2019), https://medium.com/@dtuk81/confusion-matrix-visualization-fc31e3f30fea
C. Vince, Naive Bayes Spam Classifier (2018), https://www.codeproject.com/Articles/1231994/Naive-Bayes-Spam-Classifier
C.文斯 樸素貝葉斯垃圾郵件分類器(2018), https://www.codeproject.com/Articles/1231994/Naive-Bayes-Spam-Classifier
H. Attri, Feature Extraction using TF-IDF algorithm (2019), https://medium.com/@hritikattri10/feature-extraction-using-tf-idf-algorithm-44eedb37305e
H. Attri, 使用TF-IDF算法進行特征提取(2019), https://medium.com/@hritikattri10/feature-extraction-using-tf-idf-algorithm-44eedb37305e
A. Bronshtein, Train/Test Split and Cross Validation in Python (2017), https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
A.Bronshtein,《 Python中的Train / Test拆分和交叉驗證》(2017年), https ://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6
Dataset: https://www.kaggle.com/uciml/sms-spam-collection-dataset
數據集 : https : //www.kaggle.com/uciml/sms-spam-collection-dataset
Full code: https://github.com/samimakhan/Spam-Classification-Project/tree/master/Naive-Bayes
完整代碼 : https : //github.com/samimakhan/Spam-Classification-Project/tree/master/Naive-Bayes
翻譯自: https://towardsdatascience.com/how-to-build-your-first-spam-classifier-in-10-steps-fdbf5b1b3870
垃圾郵件分類器
總結
以上是生活随笔為你收集整理的垃圾邮件分类器_如何在10个步骤中构建垃圾邮件分类器的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 委托状态废单怎么撤销
- 下一篇: pytorch图像分类_使用PyTorc