當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

nlp构建_使用NLP构建自杀性推文分类器

發(fā)布時(shí)間：2023/11/29 编程问答 40 豆豆

生活随笔收集整理的這篇文章主要介紹了 nlp构建_使用NLP构建自杀性推文分类器小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

nlp構(gòu)建

Over the years, suicide has been one of the major causes of death worldwide, According to Wikipedia, Suicide resulted in 828,000 global deaths in 2015, an increase from 712,000 deaths in 1990. This makes suicide the 10th leading cause of death worldwide. There is also increasing evidence that the Internet and social media can influence suicide-related behaviour. Using Natural Language Processing, a field in Machine Learning, I built a very simple suicidal ideation classifier which predict whether a text is likely to be suicidal or not.

多年來，自殺一直是全世界主要的死亡原因之一。據(jù)維基百科稱，自殺導(dǎo)致2015年全球死亡828,000人，比1990年的712,000人有所增加。這使自殺成為全球第十大死亡原因。越來越多的證據(jù)表明，互聯(lián)網(wǎng)和社交媒體可以影響自殺相關(guān)行為。使用機(jī)器學(xué)習(xí)中的自然語言處理這一領(lǐng)域，我建立了一個(gè)非常簡單的自殺意念分類器，該分類器可預(yù)測文本是否可能具有自殺意味。

數(shù)據(jù) (Data)

I used a Twitter crawler which I found on Github, made some few changes to the code by removing hashtags, links, URL and symbols whenever it crawls data from Twitter, the data were crawled based on query parameters which contain words like:

我使用了一個(gè)在Github上找到的Twitter搜尋器，通過在每次從Twitter抓取數(shù)據(jù)時(shí)刪除標(biāo)簽，鏈接，URL和符號來對代碼進(jìn)行了一些更改，這些數(shù)據(jù)是根據(jù)包含以下單詞的查詢參數(shù)進(jìn)行抓取的：

Depressed, hopeless, Promise to take care of, I dont belong here, Nobody deserve me, I want to die etc.

沮喪，絕望，無極照顧，我不屬于這里，沒人值得我，我想死等等。

Although some of the text we’re in no way related to suicide at all, I had to manually label the data which were about 8200 rows of tweets. I also sourced for more Twitter Data and I was able to concatenate with the one I previously had which was enough for me to train.

盡管有些文本根本與自殺無關(guān)，但我不得不手動標(biāo)記大約8200行tweet數(shù)據(jù)。我還獲得了更多的Twitter數(shù)據(jù)，并且能夠與以前擁有的足以進(jìn)行訓(xùn)練的數(shù)據(jù)相結(jié)合。

建立模型 (Building the Model)

數(shù)據(jù)預(yù)處理 (Data Preprocessing)

I imported the following libraries:

我導(dǎo)入了以下庫：

import pickle
import re
import numpy as np
import pandas as pd
from tqdm import tqdm
import nltk
nltk.download('stopwords')

I then wrote a function to clean the text data to remove any form of HTML markup, keep emoticon characters, remove non-word character and lastly convert to lowercase.

然后，我編寫了一個(gè)函數(shù)來清除文本數(shù)據(jù)，以刪除任何形式HTML標(biāo)記，保留表情符號字符，刪除非單詞字符并最后轉(zhuǎn)換為小寫字母。

def preprocess_tweet(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
lowercase_text = re.sub('[\W]+', ' ', text.lower())
text = lowercase_text+' '.join(emoticons).replace('-', '')
return text

After that, I applied the preprocess_tweet function to the tweet dataset to clean the data.

之后，我將preprocess_tweet函數(shù)應(yīng)用于tweet數(shù)據(jù)集以清理數(shù)據(jù)。

tqdm.pandas()df = pd.read_csv('data.csv')
df['tweet'] = df['tweet'].progress_apply(preprocess_tweet)

Then I converted the text to tokens by using the .split() method and used word stemming to convert the text to their root form.

然后，我使用.split()方法將文本轉(zhuǎn)換為標(biāo)記，并使用詞干將文本轉(zhuǎn)換為其根形式。

from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]

Then I imported the stopwords library to remove stop words in the text.

然后，我導(dǎo)入了停用詞庫，以刪除文本中的停用詞。

from nltk.corpus import stopwords
stop = stopwords.words('english')

Testing the function on a single text.

在單個(gè)文本上測試功能。

[w for w in tokenizer_porter('a runner likes running and runs a lot') if w not in stop]

Output:

輸出：

['runner', 'like', 'run', 'run', 'lot']

矢量化器 (Vectorizer)

For this project, I used the Hashing Vectorizer because it data-independent, which means that it is very low memory scalable to large datasets and it doesn’t store vocabulary dictionary in memory. I then created a tokenizer function for the Hashing Vectorizer

在此項(xiàng)目中，我使用了Hashing Vectorizer，因?yàn)樗c數(shù)據(jù)無關(guān)，這意味著它的內(nèi)存非常低，可擴(kuò)展到大型數(shù)據(jù)集，并且不將詞匯表存儲在內(nèi)存中。然后，我為Hashing Vectorizer創(chuàng)建了tokenizer函數(shù)

def tokenizer(text):
text = re.sub('<[^>]*>', '', text)
emoticons = re.findall('(?::|;|=)(?:-)?(?:\(|D|P)',text.lower())
text = re.sub('[\W]+', ' ', text.lower())
text += ' '.join(emoticons).replace('-', '')
tokenized = [w for w in tokenizer_porter(text) if w not in stop]
return tokenized

Then I created the Hashing Vectorizer object.

然后，我創(chuàng)建了哈希向量化器對象。

from sklearn.feature_extraction.text import HashingVectorizervect = HashingVectorizer(decode_error='ignore', n_features=2**21,
preprocessor=None,tokenizer=tokenizer)

模型 (Model)

For the Model, I used the stochastic gradient descent classifier algorithm.

對于模型，我使用了隨機(jī)梯度下降分類器算法。

from sklearn.linear_model import SGDClassifier
clf = SGDClassifier(loss='log', random_state=1)

培訓(xùn)與驗(yàn)證 (Training and Validation)

X = df["tweet"].to_list()
y = df['label']

For the model, I used 80% for training and 20% for testing.

對于模型，我使用了80％的訓(xùn)練和20％的測試。

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,
y,
test_size=0.20,
random_state=0)

Then I transformed the text data to vectors with the Hashing Vectorizer we created earlier:

然后，使用之前創(chuàng)建的Hashing Vectorizer將文本數(shù)據(jù)轉(zhuǎn)換為矢量：

X_train = vect.transform(X_train)
X_test = vect.transform(X_test)

Finally, I then fit the data to the algorithm

最后，然后將數(shù)據(jù)擬合到算法中

classes = np.array([0, 1])
clf.partial_fit(X_train, y_train,classes=classes)

Let's test the accuracy on our test data:

讓我們在測試數(shù)據(jù)上測試準(zhǔn)確性：

print('Accuracy: %.3f' % clf.score(X_test, y_test))

Output:

輸出：

Accuracy: 0.912

I had an accuracy of 91% which is fair enough, after that, I then updated the model with the prediction

我的準(zhǔn)確度是91％，這還算公允，之后，我用預(yù)測更新了模型

clf = clf.partial_fit(X_test, y_test)

測試和做出預(yù)測 (Testing and Making Predictions)

I added the text “I’ll kill myself am tired of living depressed and alone” to the model.

我在模型中添加了文本“我會厭倦生活在沮喪和孤獨(dú)中，殺死自己”。

label = {0:'negative', 1:'positive'}
example = ["I'll kill myself am tired of living depressed and alone"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%'
%(label[clf.predict(X)[0]],np.max(clf.predict_proba(X))*100))

And I got the output:

我得到了輸出：

Prediction: positive
Probability: 93.76%

And when I used the following text “It’s such a hot day, I’d like to have ice cream and visit the park”, I got the following prediction:

當(dāng)我使用以下文字“天氣真熱，我想吃冰淇淋并參觀公園”時(shí)，我得到以下預(yù)測：

Prediction: negative
Probability: 97.91%

The model was able to predict accurately for both cases. And that's how you build a simple suicidal tweet classifier.

該模型能夠準(zhǔn)確預(yù)測這兩種情況。這就是您構(gòu)建簡單的自殺性推文分類器的方式。

You can find the notebook I used for this article here

您可以在這里找到我用于本文的筆記本

Thanks for reading 😊

感謝您閱讀😊

翻譯自: https://towardsdatascience.com/building-a-suicidal-tweet-classifier-using-nlp-ff6ccd77e971

nlp構(gòu)建

總結(jié)

以上是生活随笔為你收集整理的nlp构建_使用NLP构建自杀性推文分类器的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：梦到掉牙齿会怎么样
下一篇：时间序列分析 lstm_LSTM —时间