當前位置：首頁 > 编程语言 > asp.net >内容正文

asp.net

使用ML.NET实现情感分析[新手篇]

發布時間：2023/12/4 asp.net 33 豆豆

生活随笔收集整理的這篇文章主要介紹了使用ML.NET实现情感分析[新手篇] 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

在發出《.NET Core玩轉機器學習》和《使用ML.NET預測紐約出租車費》兩文后，相信讀者朋友們即使在不明就里的情況下，也能按照內容順利跑完代碼運行出結果，對使用.NET Core和ML.NET，以及機器學習的效果有了初步感知。得到這些體驗后，那么就需要回頭小結一下了，本文仍然基于一個情感分析的案例，以剛接觸機器學習的.NET開發者的視角，側重展開一下起手ML.NET的基本理解和步驟。

當我們意識到某個現實問題超出了傳統的模式匹配能力范圍，需要借助模擬的方式先盡可能還原已經產生的事實（通常也稱為擬合），然后復用這種穩定的模擬過程（通常也稱為模型），對即將發生的條件進行估計，求得發生或不發生相同結果的概率，此時就是利用機器學習最好的機會，同時也要看到，這也是機器學習通常離不開大量數據的原因，歷史數據太少，模擬還原這個過程效果就會差很多，自然地，評估的結果誤差就大了。所以在重視數據的準確性、完整性的同時，要學會經營數據的體量出來。

若要使用機器學習解決問題，一般會經歷以下這些步驟：

1. 描述問題產生的場景

2. 針對特定場景收集數據

3. 對數據預處理

4. 確定模型（算法）進行訓練

5. 對訓練好的模型進行驗證和調優

6. 使用模型進行預測分析

?接下來我將用案例逐一介紹。?

描述問題產生的場景

說到情感分析，我假定一個最簡單的句子表達的場景，就是當看到一句話，通過特定的詞語，我們能判斷這是一個正向積極的態度，或是負面消極的。比如“我的程序順利通過測試啦”這就是一個正向的，而“這個函數的性能實在堪憂”就是一個負面的表達。所以，對詞語的鑒別就能間接知道說這句話的人的情感反應。（本案例為降低理解的復雜程度，暫不考慮斷句、重音、標點之類的這些因素。）

針對特定場景收集數據

為了證實上面的思路，我們需要先收集一些有用的數據。其實這也是讓眾多開發者卡住的環節，除了使用爬蟲和自己系統中的歷史數據，往往想不到短時間還能在哪獲取到。互聯網上有不少學院和機構，甚至政府都是有開放數據集提供的，推薦兩處獲取比較高質量數據集的來源：

UC Irvine Machine Learning Repository來自加州大學

kaggle.com一個著名的計算科學與機器學習競賽網站

這次我從UCI找到一個剛好只是每行有一個句子加一個標簽，并且標簽已標注好每個句子是正向還是負向的數據集了。在Sentiment Labelled Sentences Data Set下載。格式類似如下：

A very, very, very slow-moving, aimless movie about a distressed, drifting young man.? 0

Not sure who was more lost - the flat characters or the audience, nearly half of whom walked out.? 0

Attempting artiness with black & white and clever camera angles, the movie disappointed - became even more ridiculous - as the acting was poor and the plot and lines almost non-existent.? 0

Very little music or anything to speak of.? 0

The best scene in the movie was when Gerardo is trying to find a song that keeps running through his head.? 1

The rest of the movie lacks art, charm, meaning... If it's about emptiness, it works I guess because it's empty.? 0

Wasted two hours.? 0

...

觀察每一行，一共是Tab分隔的兩個字段，第一個字段是句子，一般我們稱之為特征（Feature），第二個字段是個數值，0表示負向，1表示正向，一般我們稱之為目標或標簽（Label），目標值往往是人工標注的，如果沒有這個，是無法使用對歷史數據進行擬合這種機器學習方式的。所以，一份高質量的數據集對人工標注的要求很高，要盡可能準確。

對數據預處理

對于創建項目一系列步驟，參看我開頭提到的兩篇文章即可，不再贅述。我們直接進入正題，ML.NET對數據的處理以及后面的訓練流程是通用的，這也是為了以后擴展到其他第三方機器學習包設計的。首先觀察數據集的格式，創建與數據集一致的結構，方便導入過程。LearningPipeline類專門用來定義機器學習過程的對象，所以緊接著我們需要創建它。代碼如下：

const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";

const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";

public class SentimentData

{

? ? [Column(ordinal: "0")]

? ? public string SentimentText;

? ? [Column(ordinal: "1", name: "Label")]

? ? public float Sentiment;

}

var pipeline = new LearningPipeline();

pipeline.Add(new TextLoader<SentimentData>(_dataPath, useHeader: false, separator: "tab"));

pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

SentimentData就是我需要的導入用的數據結構，可以看到，Column屬性除了指示對應數據集的行位置，額外的對應最后一列，表示正向還是負向的字段，還要指定它是目標值，并取了個標識名。TextLoader就是專門用來導入文本數據的類，TextFeaturizer就是指定特征的類，因為每一行數據不是每一個字段都可以成為特征的，如果有較多字段時，可以在此處特別地指定出來，這樣不會被無關的字段影響。

確定模型（算法）進行訓練

本案例目標是一個0/1的值類型，換句話說恰好是一個二分類問題，因此模型上我選擇了FastTreeBinaryClassifier這個類，如果略有了解機器學習的朋友一定知道邏輯回歸算法，與之在目的上大致相似。若要定義模型，同時要指定一個預測用的結構，這樣模型就會按特定的結構輸出模型的效果，一般這個輸出用的結構至少要包含目標字段。代碼片段如下：

public class SentimentPrediction

{

? ? [ColumnName("PredictedLabel")]

? ? public bool Sentiment;

}

pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

對訓練好的模型進行驗證和調優

在得到模型后，需要用測試數據集進行驗證，看看擬合的效果是不是符合預期，BinaryClassificationEvaluator就是FastTreeBinaryClassifier對應的驗證用的類，驗證的結果用BinaryClassificationMetrics類保存。代碼片段如下：

var testData = new TextLoader<SentimentData>(_testDataPath, useHeader: false, separator: "tab");

var evaluator = new BinaryClassificationEvaluator();

BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);

Console.WriteLine();

Console.WriteLine("PredictionModel quality metrics evaluation");

Console.WriteLine("------------------------------------------");

Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");

Console.WriteLine($"Auc: {metrics.Auc:P2}");

Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

像Accuracy，Auc，F1Score都是一些常見的評價指標，包含了正確率、誤差一類的得分，如果得分很低，就需要調整前一個步驟中定義模型時的參數值。詳細的解釋參考：Machine learning glossary

使用模型進行預測分析

訓練好一個稱心如意的模型后，就可以正式使用了。本質上就是再取來一些沒有人工標注結果的數據，讓模型進行分析返回一個符合某目標值的概率。代碼片段如下：

IEnumerable<SentimentData> sentiments = new[]

{

? ? new SentimentData

? ? {

? ? ? ? SentimentText = "Contoso's 11 is a wonderful experience",

? ? ? ? Sentiment = 0

? ? },

? ? new SentimentData

? ? {

? ? ? ? SentimentText = "The acting in this movie is very bad",

? ? ? ? Sentiment = 0

? ? },

? ? new SentimentData

? ? {

? ? ? ? SentimentText = "Joe versus the Volcano Coffee Company is a great film.",

? ? ? ? Sentiment = 0

? ? }

};

IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);

Console.WriteLine();

Console.WriteLine("Sentiment Predictions");

Console.WriteLine("---------------------");

var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

foreach (var item in sentimentsAndPredictions)

{

? ? Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");

}

運行結果可以看到，其分類是符合真實判斷的。盡管驗證階段的得分不高，這也是很正常的，再沒有任何調優下，存在一些中性、多義的句子干擾預測導致的。

這樣，再有新的句子就可以放心地通過程序自動完成分類了，是不是很簡單！希望本文能帶給.NET開發的朋友們對ML.NET躍躍欲試的興趣。

順便提一下，微軟Azure還有一個機器學習的在線工作室，鏈接地址為：https://studio.azureml.net/，相關的AI項目庫在：https://gallery.azure.ai/browse，對于暫時無法安裝本地機器學習環境，以及找不到練手項目的朋友，不妨試試這個。

最后放出項目的文件結構以及完整的代碼：

using System;

using Microsoft.ML.Models;

using Microsoft.ML.Runtime;

using Microsoft.ML.Runtime.Api;

using Microsoft.ML.Trainers;

using Microsoft.ML.Transforms;

using System.Collections.Generic;

using System.Linq;

using Microsoft.ML;

namespace SentimentAnalysis

{

? ? class Program

? ? {

? ? ? ? const string _dataPath = @".\data\sentiment labelled sentences\imdb_labelled.txt";

? ? ? ? const string _testDataPath = @".\data\sentiment labelled sentences\yelp_labelled.txt";

? ? ? ? public class SentimentData

? ? ? ? {

? ? ? ? ? ? [Column(ordinal: "0")]

? ? ? ? ? ? public string SentimentText;

? ? ? ? ? ? [Column(ordinal: "1", name: "Label")]

? ? ? ? ? ? public float Sentiment;

? ? ? ? }

? ? ? ? public class SentimentPrediction

? ? ? ? {

? ? ? ? ? ? [ColumnName("PredictedLabel")]

? ? ? ? ? ? public bool Sentiment;

? ? ? ? }

? ? ? ? public static PredictionModel<SentimentData, SentimentPrediction> Train()

? ? ? ? {

? ? ? ? ? ? var pipeline = new LearningPipeline();

? ? ? ? ? ? pipeline.Add(new TextLoader<SentimentData>(_dataPath, useHeader: false, separator: "tab"));

? ? ? ? ? ? pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

? ? ? ? ? ? pipeline.Add(new FastTreeBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

? ? ? ? ? ? PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

? ? ? ? ? ? return model;

? ? ? ? }

? ? ? ? public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)

? ? ? ? {

? ? ? ? ? ? var testData = new TextLoader<SentimentData>(_testDataPath, useHeader: false, separator: "tab");

? ? ? ? ? ? var evaluator = new BinaryClassificationEvaluator();

? ? ? ? ? ? BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);

? ? ? ? ? ? Console.WriteLine();

? ? ? ? ? ? Console.WriteLine("PredictionModel quality metrics evaluation");

? ? ? ? ? ? Console.WriteLine("------------------------------------------");

? ? ? ? ? ? Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");

? ? ? ? ? ? Console.WriteLine($"Auc: {metrics.Auc:P2}");

? ? ? ? ? ? Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

? ? ? ? }

? ? ? ? public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)

? ? ? ? {

? ? ? ? ? ? IEnumerable<SentimentData> sentiments = new[]

? ? ? ? ? ? {

? ? ? ? ? ? ? ? new SentimentData

? ? ? ? ? ? ? ? {

? ? ? ? ? ? ? ? ? ? SentimentText = "Contoso's 11 is a wonderful experience",

? ? ? ? ? ? ? ? ? ? Sentiment = 0

? ? ? ? ? ? ? ? },

? ? ? ? ? ? ? ? new SentimentData

? ? ? ? ? ? ? ? {

? ? ? ? ? ? ? ? ? ? SentimentText = "The acting in this movie is very bad",

? ? ? ? ? ? ? ? ? ? Sentiment = 0

? ? ? ? ? ? ? ? },

? ? ? ? ? ? ? ? new SentimentData

? ? ? ? ? ? ? ? {

? ? ? ? ? ? ? ? ? ? SentimentText = "Joe versus the Volcano Coffee Company is a great film.",

? ? ? ? ? ? ? ? ? ? Sentiment = 0

? ? ? ? ? ? ? ? }

? ? ? ? ? ? };

? ? ? ? ? ? IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);

? ? ? ? ? ? Console.WriteLine();

? ? ? ? ? ? Console.WriteLine("Sentiment Predictions");

? ? ? ? ? ? Console.WriteLine("---------------------");

? ? ? ? ? ? var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

? ? ? ? ? ? foreach (var item in sentimentsAndPredictions)

? ? ? ? ? ? {

? ? ? ? ? ? ? ? Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");

? ? ? ? ? ? }

? ? ? ? ? ? Console.WriteLine();

? ? ? ? }

? ? ? ? static void Main(string[] args)

? ? ? ? {

? ? ? ? ? ? var model = Train();

? ? ? ? ? ? Evaluate(model);

? ? ? ? ? ? Predict(model);

? ? ? ? }

? ? }

}

總結

以上是生活随笔為你收集整理的使用ML.NET实现情感分析[新手篇]的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：使用ML.NET预测纽约出租车费
下一篇：潘正磊：再过三五年 AI会变成开发人员的