日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 人工智能 > pytorch >内容正文

pytorch

深度学习笔记——基于传统机器学习算法(LR、SVM、GBDT、RandomForest)的句子对匹配方法

發布時間:2024/1/17 pytorch 24 豆豆
生活随笔 收集整理的這篇文章主要介紹了 深度学习笔记——基于传统机器学习算法(LR、SVM、GBDT、RandomForest)的句子对匹配方法 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

句子對匹配(Sentence Pair Matching)問題是NLP中非常常見的一類問題,所謂“句子對匹配”,就是說給定兩個句子S1和S2,任務目標是判斷這兩個句子是否具備某種類型的關系。如果形式化地對這個問題定義,可以理解如下:

意思是給定兩個句子,需要學習一個映射函數,輸入是兩個句子對,經過映射函數變換,輸出是任務分類標簽集合中的某類標簽。

典型的例子就是Paraphrase任務,即要判斷兩個句子是否語義等價,所以它的分類標簽集合就是個{等價,不等價}的二值集合。除此外,還有很多其它類型的任務都屬于句子對匹配,比如問答系統中相似問題匹配和Answer Selection。

我在前一篇文章中寫了一個基于Doc2vec和Word2vec的無監督句子匹配方法,這里就順便用傳統的機器學習算法做一下。用機器學習算法處理的話,這里的映射函數就是用訓練一個分類模型來擬合F,當分類模型訓練好之后,對于未待分類的數據,就可以輸入分類模型,用訓練好的分類模型進行預測直接輸出結果。

?

關于分類算法:

常見的分類模型有邏輯回歸(LR)、樸素貝葉斯、SVM、GBDT和隨機森林(RandomForest)等。本文選用的機器學習分類算法有:邏輯回歸(LR)、SVM、GBDT和隨機森林(RandomForest)。

由于Sklearn中集成了常見的機器學習算法,包括分類、回歸、聚類等,所以本文使用的是Sklearn,版本是0.17.1。

?

關于特征選擇:

由于最近一直在使用doc2vec和Word2vec,而且上篇文章中對比結果表示,用Doc2vec得到句子向量表示比Word2vec求均值得到句子向量表示要好,所以這里使用doc2vec得到句子的向量表示,向量維數為100維,直接將句子的100維doc2vec向量作為特征輸入分類算法。

?

關于數據集:

數據集使用的是Quora發布的Question Pairs語義等價數據集,和上文是同一個數據集,可以點擊這個鏈接下載點擊打開鏈接,其中包含了40多萬對標注好的問題對,如果兩個問題語義等價,則label為1,否則為0。統計之后,共有53萬多個問題。具體格式如下圖所示:

?

統計出所有的問題之后訓練得到每一個問題的doc2vec向量,作為分類算法的特征輸入。

將語料庫隨機打亂之后,切分出10000對數據作為驗證集,剩余的作為訓練集。

?

下面是具體的訓練代碼:

數據加載和得到句子的doc2vec代碼是同一份,放在前面:

?

  • # coding:utf-8

  • import numpy as np

  • import csv

  • import datetime

  • from sklearn.ensemble import RandomForestClassifier

  • import os

  • import pandas as pd

  • from sklearn import metrics, feature_extraction

  • from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer

  • cwd = os.getcwd()

  • ?
  • ?
  • def load_data(datapath):

  • data_train = pd.read_csv(datapath, sep='\t', encoding='utf-8')

  • print data_train.shape

  • ?
  • qid1 = []

  • qid2 = []

  • question1 = []

  • question2 = []

  • labels = []

  • count = 0

  • for idx in range(data_train.id.shape[0]):

  • # for idx in range(400):

  • # count += 1

  • # if count == 21: break

  • print idx

  • q1 = data_train.qid1[idx]

  • q2 = data_train.qid2[idx]

  • ?
  • qid1.append(q1)

  • qid2.append(q2)

  • question1.append(data_train.question1[idx])

  • question2.append(data_train.question2[idx])

  • labels.append(data_train.is_duplicate[idx])

  • ?
  • return qid1, qid2, question1, question2, labels

  • ?
  • def load_doc2vec(word2vecpath):

  • f = open(word2vecpath)

  • embeddings_index = {}

  • count = 0

  • for line in f:

  • # count += 1

  • # if count == 10000: break

  • values = line.split('\t')

  • id = values[0]

  • print id

  • coefs = np.asarray(values[1].split(), dtype='float32')

  • embeddings_index[int(id)+1] = coefs

  • f.close()

  • print('Total %s word vectors.' % len(embeddings_index))

  • ?
  • return embeddings_index

  • ?
  • def sentence_represention(qid, embeddings_index):

  • vectors = np.zeros((len(qid), 100))

  • for i in range(len(qid)):

  • print i

  • vectors[i] = embeddings_index.get(qid[i])

  • ?
  • return vectors


  • ?

    ?

    將main函數中的數據集路徑和doc2vec路徑換成自己的之后就可以直接使用了。

    1.邏輯回歸(LR):

    ?

  • def main():

  • start = datetime.datetime.now()

  • datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'

  • doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"

  • qid1, qid2, labels = load_data(datapath)

  • embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)

  • vectors1 = sentence_represention(qid1, embeddings_index)

  • vectors2 = sentence_represention(qid2, embeddings_index)

  • vectors = np.hstack((vectors1, vectors2))

  • labels = np.array(labels)

  • VALIDATION_SPLIT = 10000

  • VALIDATION_SPLIT0 = 1000

  • indices = np.arange(vectors.shape[0])

  • np.random.shuffle(indices)

  • vectors = vectors[indices]

  • labels = labels[indices]

  • train_vectors = vectors[:-VALIDATION_SPLIT]

  • train_labels = labels[:-VALIDATION_SPLIT]

  • test_vectors = vectors[-VALIDATION_SPLIT:]

  • test_labels = labels[-VALIDATION_SPLIT:]

  • # train_vectors = vectors[:VALIDATION_SPLIT0]

  • # train_labels = labels[:VALIDATION_SPLIT0]

  • # test_vectors = vectors[-VALIDATION_SPLIT0:]

  • # test_labels = labels[-VALIDATION_SPLIT0:]

  • ?
  • lr = LogisticRegression()

  • print '***********************training************************'

  • lr.fit(train_vectors, train_labels)

  • ?
  • print '***********************predict*************************'

  • prediction = lr.predict(test_vectors)

  • accuracy = metrics.accuracy_score(test_labels, prediction)

  • print accuracy

  • end = datetime.datetime.now()

  • print end-start

  • ?
  • ?
  • if __name__ == '__main__':

  • main() # the whole one model

  • ?

    ?

    2.SVM:

    ?

  • def main():

  • start = datetime.datetime.now()

  • datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'

  • doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"

  • qid1, qid2, labels = load_data(datapath)

  • embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)

  • vectors1 = sentence_represention(qid1, embeddings_index)

  • vectors2 = sentence_represention(qid2, embeddings_index)

  • vectors = np.hstack((vectors1, vectors2))

  • labels = np.array(labels)

  • VALIDATION_SPLIT = 10000

  • VALIDATION_SPLIT0 = 1000

  • indices = np.arange(vectors.shape[0])

  • np.random.shuffle(indices)

  • vectors = vectors[indices]

  • labels = labels[indices]

  • train_vectors = vectors[:-VALIDATION_SPLIT]

  • train_labels = labels[:-VALIDATION_SPLIT]

  • test_vectors = vectors[-VALIDATION_SPLIT:]

  • test_labels = labels[-VALIDATION_SPLIT:]

  • # train_vectors = vectors[:VALIDATION_SPLIT0]

  • # train_labels = labels[:VALIDATION_SPLIT0]

  • # test_vectors = vectors[-VALIDATION_SPLIT0:]

  • # test_labels = labels[-VALIDATION_SPLIT0:]

  • ?
  • svm = SVC()

  • print '***********************training************************'

  • svm.fit(train_vectors, train_labels)

  • ?
  • print '***********************predict*************************'

  • prediction = svm.predict(test_vectors)

  • accuracy = metrics.accuracy_score(test_labels, prediction)

  • print accuracy

  • ?
  • end = datetime.datetime.now()

  • print end-start

  • ?
  • ?
  • if __name__ == '__main__':

  • main() # the whole one model


  • ?

    ?

    3.GBDT:

    ?

  • def main():

  • start = datetime.datetime.now()

  • datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'

  • doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"

  • qid1, qid2, labels = load_data(datapath)

  • embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)

  • vectors1 = sentence_represention(qid1, embeddings_index)

  • vectors2 = sentence_represention(qid2, embeddings_index)

  • vectors = np.hstack((vectors1, vectors2))

  • labels = np.array(labels)

  • VALIDATION_SPLIT = 10000

  • VALIDATION_SPLIT0 = 1000

  • indices = np.arange(vectors.shape[0])

  • np.random.shuffle(indices)

  • vectors = vectors[indices]

  • labels = labels[indices]

  • train_vectors = vectors[:-VALIDATION_SPLIT]

  • train_labels = labels[:-VALIDATION_SPLIT]

  • test_vectors = vectors[-VALIDATION_SPLIT:]

  • test_labels = labels[-VALIDATION_SPLIT:]

  • # train_vectors = vectors[:VALIDATION_SPLIT0]

  • # train_labels = labels[:VALIDATION_SPLIT0]

  • # test_vectors = vectors[-VALIDATION_SPLIT0:]

  • # test_labels = labels[-VALIDATION_SPLIT0:]

  • ?
  • gbdt = GradientBoostingClassifier(init=None, learning_rate=0.1, loss='deviance',

  • max_depth=3, max_features=None, max_leaf_nodes=None,

  • min_samples_leaf=1, min_samples_split=2,

  • min_weight_fraction_leaf=0.0, n_estimators=100,

  • random_state=None, subsample=1.0, verbose=0,

  • warm_start=False)

  • print '***********************training************************'

  • gbdt.fit(train_vectors, train_labels)

  • ?
  • print '***********************predict*************************'

  • prediction = gbdt.predict(test_vectors)

  • accuracy = metrics.accuracy_score(test_labels, prediction)

  • acc = gbdt.score(test_vectors, test_labels)

  • print accuracy

  • print acc

  • ?
  • end = datetime.datetime.now()

  • print end-start

  • ?
  • ?
  • if __name__ == '__main__':

  • main() # the whole one model


  • ?

    ?

    4.隨機森林(RandomForest):

    ?

  • def main():

  • start = datetime.datetime.now()

  • datapath = 'D:/dataset/quora/quora_duplicate_questions_Chinese_seg.tsv'

  • doc2vecpath = "D:/dataset/quora/vector2/quora_duplicate_question_doc2vec_100.vector"

  • qid1, qid2, question1, question2, labels = load_data(datapath)

  • ?
  • embeddings_index = load_doc2vec(word2vecpath=doc2vecpath)

  • vectors1 = sentence_represention(qid1, embeddings_index)

  • vectors2 = sentence_represention(qid2, embeddings_index)

  • vectors = np.hstack((vectors1, vectors2))

  • labels = np.array(labels)

  • VALIDATION_SPLIT = 10000

  • VALIDATION_SPLIT0 = 1000

  • indices = np.arange(vectors.shape[0])

  • np.random.shuffle(indices)

  • vectors = vectors[indices]

  • labels = labels[indices]

  • train_vectors = vectors[:-VALIDATION_SPLIT]

  • train_labels = labels[:-VALIDATION_SPLIT]

  • test_vectors = vectors[-VALIDATION_SPLIT:]

  • test_labels = labels[-VALIDATION_SPLIT:]

  • # train_vectors = vectors[:VALIDATION_SPLIT0]

  • # train_labels = labels[:VALIDATION_SPLIT0]

  • # test_vectors = vectors[-VALIDATION_SPLIT0:]

  • # test_labels = labels[-VALIDATION_SPLIT0:]

  • ?
  • randomforest = RandomForestClassifier()

  • print '***********************training************************'

  • randomforest.fit(train_vectors, train_labels)

  • ?
  • print '***********************predict*************************'

  • prediction = randomforest.predict(test_vectors)

  • accuracy = metrics.accuracy_score(test_labels, prediction)

  • print accuracy

  • ?
  • end = datetime.datetime.now()

  • print end-start

  • ?
  • ?
  • if __name__ == '__main__':

  • main() # the whole one model


  • ?

    最終的結果如下:

    LR?68.56%?

    SVM?69.77%

    GBDT?71.4%

    RandomForest?78.36%(跑了多次,最好的一次)

    從準確率上來看,隨機森林的效果最好。時間上面,SVM耗時最長。

    ?

    未來:

    其實本文在特征選擇和分類算法的參數調整上還有很多可以深入的地方,我相信,通過繼續挖掘更多的有用特征,以及對模型的參數進行調整還可以得到更好的結果。

    詳細代碼參見我的GitHub,地址為:點擊打開鏈接

    總結

    以上是生活随笔為你收集整理的深度学习笔记——基于传统机器学习算法(LR、SVM、GBDT、RandomForest)的句子对匹配方法的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。