日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

SVM实现邮件分类

發(fā)布時(shí)間:2024/7/23 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 SVM实现邮件分类 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

首先學(xué)習(xí)一下svm分類的使用。

主要有以下步驟:

  • Loading and Visualizing Dataj
  • Training Linear SVM
  • Implementing Gaussian Kernel
  • Training SVM with RBF Kernel
  • 選擇最優(yōu)的C, sigma參數(shù)
  • 畫出邊界線

線性keneral實(shí)現(xiàn)

C = 1; model = svmTrain(X, y, C, @linearKernel, 1e-3, 20); visualizeBoundaryLinear(X, y, model);

高斯keneral實(shí)現(xiàn)

function sim = gaussianKernel(x1, x2, sigma) x1 = x1(:); x2 = x2(:); sim = 0; sim = exp( - (x1-x2)'* (x1-x2) / (2 * sigma *sigma ) ); endload('ex6data2.mat');% SVM Parameters C = 1; sigma = 0.1;% We set the tolerance and max_passes lower here so that the code will run % faster. However, in practice, you will want to run the training to % convergence. model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); visualizeBoundary(X, y, model);

選擇合適的參數(shù)

function [C, sigma] = dataset3Params(X, y, Xval, yval) C = 1; sigma = 0.3; C_vec = [0.01 0.03 0.1 0.3 1 3 10 30]'; sigma_vec = [0.01 0.03 0.1 0.3 1 3 10 30]'; error_val = zeros(length(C_vec),length(sigma_vec)); error_train = zeros(length(C_vec),length(sigma_vec)); for i = 1:length(C_vec)for j = 1:length(sigma_vec)model= svmTrain(X, y, C_vec(i), @(x1, x2) gaussianKernel(x1, x2, sigma_vec(j))); predictions = svmPredict(model, Xval);error_val(i,j) = mean(double(predictions ~= yval));end end % figure % error_val % surf(C_vec,sigma_vec,error_val) % 畫出三維圖找最低點(diǎn)[minval,ind] = min(error_val(:)); % 0.03 [I,J] = ind2sub([size(error_val,1) size(error_val,2)],ind); C = C_vec(I) % 1 sigma = sigma_vec(J) % 0.100% [I,J]=find(error_val == min(error_val(:)) ); % 另一種方式找最小元素位子 % C = C_vec(I) % 1 % sigma = sigma_vec(J) % 0.100 end[C, sigma] = dataset3Params(X, y, Xval, yval);% Train the SVM model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); visualizeBoundary(X, y, model);

郵件分類

主要步驟如下:

  • 郵件數(shù)據(jù)歸一化處理
  • 特征提取
  • Train Linear SVM for Spam Classification
  • Test Spam Classification
  • Top Predictors of Spam
  • 測(cè)試自己的email

歸一化處理

In processEmail.m, we have implemented the following email prepro- cessing and normalization steps:

  • Lower-casing: The entire email is converted into lower case, so that captialization is ignored (e.g., IndIcaTE is treated the same as Indicate).
  • Stripping HTML: All HTML tags are removed from the emails. Many emails often come with HTML formatting; we remove all the HTML tags, so that only the content remains.
  • Normalizing URLs: All URLs are replaced with the text “httpaddr”.
  • Normalizing Email Addresses: All email addresses are replaced
    with the text “emailaddr”.
  • Normalizing Numbers: All numbers are replaced with the text
    “number”.
  • Normalizing Dollars: All dollar signs ($) are replaced with the text
    “dollar”.
  • Word Stemming: Words are reduced to their stemmed form. For ex- ample, “discount”, “discounts”, “discounted” and “discounting” are all replaced with “discount”. Sometimes, the Stemmer actually strips off additional characters from the end, so “include”, “includes”, “included”, and “including” are all replaced with “includ”.
  • Removal of non-words: Non-words and punctuation have been re- moved. All white spaces (tabs, newlines, spaces) have all been trimmed to a single space character.

處理之后效果如下:

Vocabulary List
我們?nèi)±]件中最常見的單詞放入單詞表中。
Our vocabulary list was selected by choosing all words which occur at least a 100 times in the spam corpus, resulting in a list of 1899 words. In practice, a vocabulary list with about 10,000 to 50,000 words is often used.

將我們郵件中有的單詞在單詞表中的id存儲(chǔ)在word_indices中

for i=1:length(vocabList)if( strcmp(vocabList{i}, str) )word_indices = [word_indices;i];endend

Extracting Features from Emails
然后查找我們的郵件中的單詞在單詞表中的位置,有則置1,無則跳過。
You should look up the word in the vocabulary list vocabList and find if the word exists in the vocabulary list. If the word exists, you should add the index of the word into the word indices variable. If the word does not exist, and is therefore not in the vocabulary, you can skip the word.

function x = emailFeatures(word_indices) % Total number of words in the dictionary n = 1899;% You need to return the following variables correctly. x = zeros(n, 1); x(word_indices) = 1; end

Training SVM for Spam Classification

load('spamTrain.mat');fprintf('\nTraining Linear SVM (Spam Classification)\n') fprintf('(this may take 1 to 2 minutes) ...\n')C = 0.1; model = svmTrain(X, y, C, @linearKernel);p = svmPredict(model, X);fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100);%% =================== Part 4: Test Spam Classification ================ load('spamTest.mat');fprintf('\nEvaluating the trained Linear SVM on a test set ...\n')p = svmPredict(model, Xtest);fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100);

After loading the dataset, ex6 spam.m will proceed to train a SVM to classify between spam (y = 1) and non-spam (y = 0) emails. Once the training completes, you should see that the classifier gets a training accuracy of about 99.8% and a test accuracy of about 98.5%.

Top Predictors for Spam
找出最易被判斷為垃圾郵件的單詞。

[weight, idx] = sort(model.w, 'descend'); vocabList = getVocabList();fprintf('\nTop predictors of spam: \n'); for i = 1:15fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i)); end

Try your own emails

filename = 'spamSample1.txt';% Read and predict file_contents = readFile(filename); word_indices = processEmail(file_contents); x = emailFeatures(word_indices); p = svmPredict(model, x);fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p); fprintf('(1 indicates spam, 0 indicates not spam)\n\n');

可以看出我們的郵件判斷準(zhǔn)確率大概在98%左右。

總結(jié)

以上是生活随笔為你收集整理的SVM实现邮件分类的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。