當(dāng)前位置：首頁 >

nlp-关键词搜索

發(fā)布時間：2025/3/21 27 豆豆

生活随笔收集整理的這篇文章主要介紹了 nlp-关键词搜索小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

關(guān)鍵詞搜索

"""功能：實現(xiàn)關(guān)鍵詞搜索可以嘗試修改/調(diào)試/升級的部分是：文本預(yù)處理步驟: 你可以使用很多不同的方法來使得文本數(shù)據(jù)變得更加清潔自制的特征: 相處更多的特征值表達(dá)方法（關(guān)鍵詞全段重合數(shù)量，重合比率，等等）更好的回歸模型: 根據(jù)之前的課講的Ensemble方法，把分類器提升到極致版本1.0日期：10.10.2019 """import numpy as np import pandas as pd from sklearn.ensemble import RandomForestRegressor, BaggingRegressor from nltk.stem.snowball import SnowballStemmer from sklearn.model_selection import cross_val_score import matplotlib.pyplot as pltdf_train = pd.read_csv('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/train.csv',encoding="ISO-8859-1") df_test = pd.read_csv('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/test.csv',encoding="ISO-8859-1") df_desc = pd.read_csv('C:/Users/Administrator/Desktop/七月在線課程下載/word2vec/input/product_descriptions.csv') # 描述信息# 上下連接訓(xùn)練集和測試集 df_all = pd.concat((df_train, df_test), axis=0, ignore_index=True) # ignore_index=True忽略原索引，生成新索引# 根據(jù)product_uid左右合并產(chǎn)品介紹 df_all = pd.merge(df_all, df_desc, how='left', on='product_uid')#--------------------------------- 文本預(yù)處理 ----------------------------------- # 可以選用各種你覺得靠譜的預(yù)處理方式：去掉停止詞，糾正拼寫，去掉數(shù)字，去掉各種emoji，等等） # 這里只用SnowballStemmer,詞干抽取 stemmer = SnowballStemmer('english')def str_stemmer(s):""":param s: 字符格式的字符:return: 詞干抽取后的字符"""return " ".join([stemmer.stem(word) for word in s.lower().split()])# 將三組object進行詞干抽取 df_all['search_term'] = df_all['search_term'].map(lambda x: str_stemmer(x)) df_all['product_title'] = df_all['product_title'].map(lambda x: str_stemmer(x)) df_all['product_description'] = df_all['product_description'].map(lambda x: str_stemmer(x))#----------------------------------- 自制文本特征 ------------------------------------- # 關(guān)鍵詞的長度 df_all['len_of_query'] = df_all['search_term'].map(lambda x:len(x.split())).astype(np.int64)# 計算str1有多少個重合在str2里面 def str_common_word(str1, str2):print(str1.head())return sum(int(str2.find(word) >= 0) for word in str1.split())# 標(biāo)題中有多少關(guān)鍵詞重合 df_all['commons_in_title'] = df_all.apply(lambda x:str_common_word(x['search_term'], x['product_title']),axis=1) # 描述中有多少關(guān)鍵詞重合 df_all['commons_in_desc'] = df_all.apply(lambda x:str_common_word(x['search_term'], x['product_description']),axis=1) #--------------------------------------------- # 搞完之后，我們把不能被『機器學(xué)習(xí)模型』處理的column給drop掉 df_all = df_all.drop(['search_term', 'product_title', 'product_description'], axis=1)# 重塑訓(xùn)練/測試集 # 搞完一圈預(yù)處理之后，我們讓數(shù)據(jù)重回原本的樣貌 # 分開訓(xùn)練和測試集 df_train = df_all.loc[df_train.index] df_test = df_all.loc[df_train.shape[0]:]# 記錄下測試集的id # 留著上傳的時候能對的上號 test_ids = df_test['id']# 分離出y_train y_train = df_train['relevance'].values # .values 意思是轉(zhuǎn)化numpyX_train = df_train.drop(['id', 'relevance'], axis=1).values X_test = df_test.drop(['id', 'relevance'], axis=1).values#-------------------------- 建立模型 -------------------------------------- params = [2, 6, 7, 9] # 每棵決策樹的最大深度，可以給出更多的參數(shù)值進行篩選 test_scores = [] for param in params:# n_estimators用于指定隨機森林所包含的決策樹個數(shù)# max_depth每棵決策樹的最大深度classfier = RandomForestRegressor(n_estimators=30, max_depth=param)# cv：把數(shù)據(jù)分成5份，5折交叉驗證test_score = np.sqrt(-cross_val_score(classfier, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))test_scores.append(np.mean(test_score))plt.plot(params, test_scores) plt.title("Param vs CV Error") # 大概6~7的時候達(dá)到了最優(yōu)解# 用上面最優(yōu)解深度6這個樹深參數(shù)，跑測試集 rf = RandomForestRegressor(n_estimators=30, max_depth=6) rf.fit(X_train, y_train) y_pred = rf.predict(X_test)# 保存測試集預(yù)測結(jié)果 pd.DataFrame({"id": test_ids, "relevance": y_pred}).to_csv('submission.csv',index=False)

總結(jié)

以上是生活随笔為你收集整理的nlp-关键词搜索的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

nlp-关键词搜索

關(guān)鍵詞搜索

總結(jié)