Datawhale-零基础入门NLP-新闻文本分类Task04
1 FastText 學習路徑
FastText?是 facebook 近期開源的一個詞向量計算以及文本分類工具,FastText的學習路徑為:
具體原理就不作解析了,詳細教程見:https://fasttext.cc/docs/en/support.html
2 FastText 安裝
2.1 基于框架的安裝
需要從github下載源碼,然后生成可執行的fasttext文件
(1)命令:git clone https://github.com/facebookresearch/fastText.git
(2)命令:cd fastText/ ?and ? ls ?
(3)命令:make
2.2 基于Python模塊的安裝
(1)直接pip安裝:pip install fasttext
(2)源碼安裝:
3 FastText 實現文本分類
3.1 例子
(1)下載數據
#讀取數據 wget https://dl.fbaipublicfiles.com/fasttext/data/cooking.stackexchange.tar.gz #解壓數據 tar xvzf cooking.stackexchange.tar.gz #顯示前幾行 head cooking.stackexchange.txt(2)劃分數據集
#查看數據 wc cooking.stackexchange.txt#劃分數據集 head -n 12404 cooking.stackexchange.txt > cooking.train tail -n 3000 cooking.stackexchange.txt > cooking.valid(3)訓練與調參
此處是基于命令行的展示,Python的展示可參考:https://fasttext.cc/docs/en/supervised-tutorial.html
fasttext的參數有:
訓練:
./fasttext supervised -input cooking.train -output model_cooking預測:
./fasttext predict model_cooking.bin -3.2 基于新聞文本的FastText分析
import fasttext import pandas as pd from sklearn.metrics import f1_scoretrain_df = pd.read_csv('data/data45216/train_set.csv',sep='\t')train_df['label_ft'] = '__label__' + train_df['label'].astype(str) train_df[['text','label_ft']].iloc[:-5000].to_csv('train.csv',index=None,header=None,sep='\t')model = fasttext.train_supervised('train.csv',lr=1.0,wordNgrams=2,verbose=2,minCount=1,epoch=25,loss='hs')val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']] print(f1_score(train_df['label'].values[-5000:].astype(str),val_pred,average='macro'))輸出結果為:
4 FastText調參
FastText的train_supervised參數有:
可通過以上參數進行手動設置,也可用過FastText的自動調參功能進行調參。
4.1 基于命令行
(1)驗證集驗證-autotune-validation
./fasttext supervised -input cooking.train -output model_cooking -autotune-validation cooking.valid? ? ? ?(2)設置執行時間-autotune-duration
./fasttext supervised -input cooking.train -output model_cooking -autotune-validation cooking.valid -autotune-duration 600? ? ? ?(3)模型大小?-autotune-modelsize
./fasttext supervised -input cooking.train -output model_cooking -autotune-validation cooking.valid -autotune-modelsize 2M(4)指標?-autotune-metric
-autotune-metric f1:__label__baking -autotune-metric precisionAtRecall:30 -autotune-metric precisionAtRecall:30:__label__baking -autotune-metric recallAtPrecision:30 -autotune-metric recallAtPrecision:30:__label__baking4.2 基于Python模塊
(1)驗證集驗證autotuneValidationFile
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid')? ? ? ?(2)設置執行時間autotuneDuration
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneDuration=600)? ? ? ?(3)模型大小autotuneModelSize
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneModelSize="2M")(4)指標?autotuneMetric
model = fasttext.train_supervised(input='cooking.train', autotuneValidationFile='cooking.valid', autotuneMetric="f1:__label__baking")5 作業
使用自動調參進行訓練:
import fasttext import pandas as pd from sklearn.metrics import f1_scoretrain_df = pd.read_csv('data/data45216/train_set.csv',sep='\t')#將label值轉成fasttext識別的格式 train_df['label_ft'] = '__label__' + train_df['label'].astype(str) #劃分訓練集和驗證集 train_df[['text','label_ft']].iloc[:10000].to_csv('train.csv',index=None,header=None,sep='\t') train_df[['text','label_ft']].iloc[10000:15000].to_csv('valid.csv',index=None,header=None,sep='\t')#建立模型 model = fasttext.train_supervised('train.csv',lr=1.0,wordNgrams=2,verbose=2,minCount=1,epoch=25,loss='hs',autotuneValidationFile='valid.csv',autotuneMetric="f1:__label__baking")#預測 val_pred = [model.predict(x)[0][0].split('__')[-1] for x in train_df.iloc[-5000:]['text']] print(f1_score(train_df['label'].values[-5000:].astype(str),val_pred,average='macro'))?
總結
以上是生活随笔為你收集整理的Datawhale-零基础入门NLP-新闻文本分类Task04的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 孪生网络图像相似度_文本蕴含之孪生网络(
- 下一篇: 关于SBUS信号在单片机中的一些个人理解