當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

fasttext 安装_fasttext的简单介绍

發布時間：2024/10/8 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了 fasttext 安装_fasttext的简单介绍小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

需要注意的問題：

1、linux mac 平臺

2、標簽中的下劃線是兩個！兩個！兩個！

環境說明：python2.7、linux

自己打自己臉，目前官方的包只能在linux，mac環境下使用。誤導大家了，對不起。

測試facebook開源的基于深度學習的對文本分類的fastText模型

fasttext python包的安裝:

1 pip install fasttext

第一步獲取分類文本，文本直接用的清華大學的新聞分本，可在文本系列的第三篇找到下載地址。

輸出數據格式：樣本 + 樣本標簽

說明：這一步不是必須的，可以直接從第二步開始，第二步提供了處理好的文本格式。寫這一步主要是為了記憶當時是怎么處理原始文本的。

import jieba

import os

basedir = "/home/li/corpus/news/" #這是我的文件地址，需跟據文件夾位置進行更改

dir_list = ['affairs','constellation','economic','edu','ent','fashion','game','home','house','lottery','science','sports','stock']

##生成fastext的訓練和測試數據集

ftrain = open("news_fasttext_train.txt","w")

ftest = open("news_fasttext_test.txt","w")

num = -1

for e in dir_list:

num += 1

indir = basedir + e + '/'

files = os.listdir(indir)

count = 0

for fileName in files:

count += 1

filepath = indir + fileName

with open(filepath,'r') as fr:

text = fr.read()

text = text.decode("utf-8").encode("utf-8")

seg_text = jieba.cut(text.replace("\t"," ").replace("\n"," "))

outline = " ".join(seg_text)

outline = outline.encode("utf-8") + "\t__label__" + e + "\n"

# print outline

# break

if count < 10000:

ftrain.write(outline)

ftrain.flush()

continue

elif count < 20000:

ftest.write(outline)

ftest.flush()

continue

else:

break

ftrain.close()

ftest.close()

第二步：利用fasttext進行分類。使用的是fasttext的python包。

整理好的數據：百度網盤下載

news_fasttext_train.txt

news_fasttext_test.txt

# _*_coding:utf-8 _*_

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import fasttext

#訓練模型

classifier = fasttext.supervised("news_fasttext_train.txt","news_fasttext.model",label_prefix="__label__")

#load訓練好的模型

#classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='__label__')

```

測試模型

result = classifier.test("news_fasttext_test.txt")

print result.precision

print result.recall

0.92240420242

由于fasttext貌似只提供全部結果的p值和r值，想要統計不同分類的結果，就需要自己寫代碼來實現了。

-- coding: utf-8 --

"""

Created on Wed Oct 18 14:17:27 2017

@author: xiaoguangli

"""

import logging

logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import fasttext

classifier = fasttext.load_model('news_fasttext.model.bin', label_prefix='label')

labels_right = []

texts = []

with open("news_fasttext_test.txt") as fr:

for line in fr:

line = line.decode("utf-8").rstrip()

labels_right.append(line.split("\t")[1].replace("label",""))

texts.append(line.split("\t")[0])

# print labels

# print texts

break

labels_predict = [e[0] for e in classifier.predict(texts)] #預測輸出結果為二維形式

print labels_predict

text_labels = list(set(labels_right))

text_predict_labels = list(set(labels_predict))

print text_predict_labels

print text_labels

A = dict.fromkeys(text_labels,0) #預測正確的各個類的數目

B = dict.fromkeys(text_labels,0) #測試數據集中各個類的數目

C = dict.fromkeys(text_predict_labels,0) #預測結果中各個類的數目

for i in range(0,len(labels_right)):

B[labels_right[i]] += 1

C[labels_predict[i]] += 1

if labels_right[i] == labels_predict[i]:

A[labels_right[i]] += 1

print A

print B

print C

計算準確率，召回率，F值

for key in B:

try:

r = float(A[key]) / float(B[key])

p = float(A[key]) / float(C[key])

f = p * r * 2 / (p + r)

print "%s:\t p:%f\t r:%f\t f:%f" % (key,p,r,f)

except:

print "error:", key, "right:", A.get(key,0), "real:", B.get(key,0), "predict:",C.get(key,0)

實驗數據分類

[u'affairs', u'fashion', u'lottery', u'house', u'science', u'sports', u'game', u'economic', u'ent', u'edu', u'home', u'constellation', u'stock']

['affairs', 'fashion', 'house', 'sports', 'game', 'economic', 'ent', 'edu', 'home', 'stock', 'science']

{'science': 8415, 'affairs': 8257, 'fashion': 3173, 'house': 9491, 'sports': 9739, 'game': 9506, 'economic': 9235, 'ent': 9665, 'edu': 9491, 'home': 9315, 'stock': 9015}

{'science': 10000, 'affairs': 10000, 'fashion': 3369, 'house': 10000, 'sports': 10000, 'game': 10000, 'economic': 10000, 'ent': 10000, 'edu': 10000, 'home': 10000, 'stock': 10000}

{u'affairs': 8562, u'fashion': 3585, u'lottery': 96, u'science': 9088, u'edu': 10068, u'sports': 10099, u'game': 10151, u'economic': 10131, u'ent': 10798, u'house': 10000, u'home': 10103, u'constellation': 432, u'stock': 10256}

#實驗結果

science: p:0.841500 r:0.925946r: f:0.881706

affairs: p:0.825700 r:0.964377r: f:0.889667

fashion: p:0.941822 r:0.885077r: f:0.912568

house: p:0.949100 r:0.949100r: f:0.949100

sports: p:0.973900 r:0.964353r: f:0.969103

game: p:0.950600 r:0.936459r: f:0.943477

economic: p:0.923500 r:0.911559r: f:0.917490

ent: p:0.966500 r:0.895073r: f:0.929416

edu: p:0.949100 r:0.942690r: f:0.945884

home: p:0.931500 r:0.922003r: f:0.926727

stock: p:0.901500 r:0.878998r: f:0.890107

從結果上，看出fasttext的分類效果還是不錯的，沒有進行對fasttext的調參，結果都基本在90以上，不過在預測的時候，不知道怎么多出了一個分類constellation。難道。。。。查找原因中。。。。

2016/11/7更正：從集合B中可以看出訓練集的標簽中是沒有lottery和constellation的數據的，說明在數據準備的時候，每類選取10000篇，導致在測試數據集中lottery和constellation不存在數據了。因此在第一步準備數據的時候可以根據lottery和constellation類的數據進行訓練集和測試集的大小劃分，或者簡單粗暴點，這兩類沒有達到我們的數量要求，可以直接刪除掉

總結

以上是生活随笔為你收集整理的fasttext 安装_fasttext的简单介绍的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： c# 带返回值的action_C#委托A
下一篇： elementui 隐藏输入框_elem