日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 运维知识 > windows >内容正文

windows

推荐系统算法总结(三)——FM与DNN DeepFM

發布時間:2024/1/17 windows 46 豆豆
生活随笔 收集整理的這篇文章主要介紹了 推荐系统算法总结(三)——FM与DNN DeepFM 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

來源:https://blog.csdn.net/qq_23269761/article/details/81366939,如有不妥,請隨時聯系溝通,謝謝~

0.瘋狂安利一個博客

FM的前世今生:?
https://tracholar.github.io/machine-learning/2017/03/10/factorization-machine.html#%E7%BB%BC%E8%BF%B0

1.FM 與 DNN和embedding的關系

先來復習一下FM?
?
?
對FM模型進行求解后,對于每一個特征xi都能夠得到對應的隱向量vi,那么這個vi到底是什么呢?

想一想Google提出的word2vec,word2vec是word embedding方法的一種,word embedding的意思就是,給出一個文檔,文檔就是一個單詞序列,比如 “A B A C B F G”, 希望對文檔中每個不同的單詞都得到一個對應的向量(往往是低維向量)表示。比如,對于這樣的“A B A C B F G”的一個序列,也許我們最后能得到:A對應的向量為[0.1 0.6 -0.5],B對應的向量為[-0.2 0.9 0.7] 。

所以結論就是:?
FM算法是一個特征組合以及降維的工具,它能夠將原本因為one-hot編碼產生的稀疏特征,進行兩兩組合后還能做一個降維!!降到多少維呢?就是FM中隱因子的個數k

2.FNN

利用FM做預訓練實現embedding,再通過DNN進行訓練?
?
這樣的模型則是考慮了高階特征,而在最后sigmoid輸出時忽略了低階特征本身。

3.DeepFM

鑒于上述理論,目前新出的很多基于深度學習的CTR模型都從wide、deep(即低階、高階)兩方面同時進行考慮,進一步提高模型的泛化能力,比如DeepFM。?
參考博客:https://blog.csdn.net/zynash2/article/details/79348540?
?
可以看到,整個模型大體分為兩部分:FM和DNN。簡單敘述一下模型的流程:借助FNN的思想,利用FM進行embedding,之后的wide和deep模型共享embedding之后的結果。DNN的輸入完全和FNN相同(這里不用預訓練,直接把embedding層看作一層的NN),而通過一定方式組合后,模型在wide上完全模擬出了FM的效果(至于為什么,論文中沒有詳細推導,本文會稍后給出推導過程),最后將DNN和FM的結果組合后激活輸出。

需要著重強調理解的時模型中關于FM的部分,究竟時如何搭建網絡計算2階特征的?
**劃重點:**embedding層對于DNN來說時在提取特征,對于FM來說就是他的2階特征啊!!!!只不過FM和DNN共享embedding層而已。

4.DeepFM代碼解讀

先放代碼鏈接:?
https://github.com/ChenglongChen/tensorflow-DeepFM?
數據下載地址:?
https://www.kaggle.com/c/porto-seguro-safe-driver-prediction

4.0 項目目錄

?
data:存儲訓練數據與測試數據?
output/fig:用來存放輸出結果和訓練曲線?
config:數據獲取和特征工程中一些參數的設置?
DataReader:特征工程,獲得真正用于訓練的特征集合?
main:主程序入口?
mertics:定義了gini指標作為評價指標?
DeepFM:模型定義

4.1 整體過程

推薦一篇此數據集的EDA分析,看過可以對數據集的全貌有所了解:?
https://blog.csdn.net/qq_37195507/article/details/78553581

  • 1._load_data()
  • def _load_data():

  • ?
  • dfTrain = pd.read_csv(config.TRAIN_FILE)

  • dfTest = pd.read_csv(config.TEST_FILE)

  • ?
  • def preprocess(df):

  • cols = [c for c in df.columns if c not in ["id", "target"]]

  • df["missing_feat"] = np.sum((df[cols] == -1).values, axis=1)

  • df["ps_car_13_x_ps_reg_03"] = df["ps_car_13"] * df["ps_reg_03"]

  • return df

  • ?
  • dfTrain = preprocess(dfTrain)

  • dfTest = preprocess(dfTest)

  • ?
  • cols = [c for c in dfTrain.columns if c not in ["id", "target"]]

  • cols = [c for c in cols if (not c in config.IGNORE_COLS)]

  • ?
  • X_train = dfTrain[cols].values

  • y_train = dfTrain["target"].values

  • X_test = dfTest[cols].values

  • ids_test = dfTest["id"].values

  • cat_features_indices = [i for i,c in enumerate(cols) if c in config.CATEGORICAL_COLS]

  • ?
  • return dfTrain, dfTest, X_train, y_train, X_test, ids_test, cat_features_indices

  • 首先讀取原始數據文件TRAIN_FILE,TEST_FILE?
    preprocess(df)添加了兩個特征分別是missing_feat【缺失特征個數】與ps_car_13_x_ps_reg_03【兩個特征的乘積】?
    返回:?
    dfTrain, dfTest :所有特征都存在的Dataframe形式?
    X_train, X_test:刪掉了IGNORE_COLS的ndarray格式 【X_test后面都沒有用到啊】?
    y_train: label?
    ids_test:測試集的id,ndarray?
    cat_features_indices:類別特征的特征indices

    • 利用X_train, y_train 進行了K折均衡交叉驗證切分數據集
    • DeepFM參數設置
    • 2._run_base_model_dfm
  • def _run_base_model_dfm(dfTrain, dfTest, folds, dfm_params):

  • fd = FeatureDictionary(dfTrain=dfTrain, dfTest=dfTest,

  • numeric_cols=config.NUMERIC_COLS,

  • ignore_cols=config.IGNORE_COLS)

  • data_parser = DataParser(feat_dict=fd)

  • Xi_train, Xv_train, y_train = data_parser.parse(df=dfTrain, has_label=True)

  • Xi_test, Xv_test, ids_test = data_parser.parse(df=dfTest)

  • ?
  • dfm_params["feature_size"] = fd.feat_dim

  • dfm_params["field_size"] = len(Xi_train[0])

  • ?
  • y_train_meta = np.zeros((dfTrain.shape[0], 1), dtype=float)

  • y_test_meta = np.zeros((dfTest.shape[0], 1), dtype=float)

  • _get = lambda x, l: [x[i] for i in l]

  • gini_results_cv = np.zeros(len(folds), dtype=float)

  • gini_results_epoch_train = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)

  • gini_results_epoch_valid = np.zeros((len(folds), dfm_params["epoch"]), dtype=float)

  • for i, (train_idx, valid_idx) in enumerate(folds):

  • Xi_train_, Xv_train_, y_train_ = _get(Xi_train, train_idx), _get(Xv_train, train_idx), _get(y_train, train_idx)

  • Xi_valid_, Xv_valid_, y_valid_ = _get(Xi_train, valid_idx), _get(Xv_train, valid_idx), _get(y_train, valid_idx)

  • ?
  • dfm = DeepFM(**dfm_params)

  • dfm.fit(Xi_train_, Xv_train_, y_train_, Xi_valid_, Xv_valid_, y_valid_)

  • ?
  • y_train_meta[valid_idx,0] = dfm.predict(Xi_valid_, Xv_valid_)

  • y_test_meta[:,0] += dfm.predict(Xi_test, Xv_test)

  • ?
  • gini_results_cv[i] = gini_norm(y_valid_, y_train_meta[valid_idx])

  • gini_results_epoch_train[i] = dfm.train_result

  • gini_results_epoch_valid[i] = dfm.valid_result

  • ?
  • y_test_meta /= float(len(folds))

  • ?
  • # save result

  • if dfm_params["use_fm"] and dfm_params["use_deep"]:

  • clf_str = "DeepFM"

  • elif dfm_params["use_fm"]:

  • clf_str = "FM"

  • elif dfm_params["use_deep"]:

  • clf_str = "DNN"

  • print("%s: %.5f (%.5f)"%(clf_str, gini_results_cv.mean(), gini_results_cv.std()))

  • filename = "%s_Mean%.5f_Std%.5f.csv"%(clf_str, gini_results_cv.mean(), gini_results_cv.std())

  • _make_submission(ids_test, y_test_meta, filename)

  • ?
  • _plot_fig(gini_results_epoch_train, gini_results_epoch_valid, clf_str)

  • ?
  • return y_train_meta, y_test_meta

  • 經過?
    DataReader中的FeatureDictionary?
    這個對象中有一個self.feat_dict屬性,長下面這個樣子:

    {'missing_feat': 0, 'ps_ind_18_bin': {0: 254, 1: 255}, 'ps_reg_01': 256, 'ps_reg_02': 257, 'ps_reg_03': 258}
    • ?

    DataReader中的DataParser

  • class DataParser(object):

  • def __init__(self, feat_dict):

  • self.feat_dict = feat_dict #這個feat_dict是FeatureDictionary對象實例

  • ?
  • def parse(self, infile=None, df=None, has_label=False):

  • assert not ((infile is None) and (df is None)), "infile or df at least one is set"

  • assert not ((infile is not None) and (df is not None)), "only one can be set"

  • if infile is None:

  • dfi = df.copy()

  • else:

  • dfi = pd.read_csv(infile)

  • if has_label:

  • y = dfi["target"].values.tolist()

  • dfi.drop(["id", "target"], axis=1, inplace=True)

  • else:

  • ids = dfi["id"].values.tolist()

  • dfi.drop(["id"], axis=1, inplace=True)

  • # dfi for feature index

  • # dfv for feature value which can be either binary (1/0) or float (e.g., 10.24)

  • dfv = dfi.copy()

  • for col in dfi.columns:

  • if col in self.feat_dict.ignore_cols:

  • dfi.drop(col, axis=1, inplace=True)

  • dfv.drop(col, axis=1, inplace=True)

  • continue

  • if col in self.feat_dict.numeric_cols:

  • dfi[col] = self.feat_dict.feat_dict[col]

  • else:

  • dfi[col] = dfi[col].map(self.feat_dict.feat_dict[col])

  • dfv[col] = 1.

  • #dfi.to_csv('dfi.csv')

  • #dfv.to_csv('dfv.csv')

  • ?
  • # list of list of feature indices of each sample in the dataset

  • Xi = dfi.values.tolist()

  • # list of list of feature values of each sample in the dataset

  • Xv = dfv.values.tolist()

  • if has_label:

  • return Xi, Xv, y

  • else:

  • return Xi, Xv, ids

  • 這里Xi,Xv都是二位數組,可以將dfi,dfv存在csv文件中看一下長什么樣子,長的很奇怪【可能后面模型需要吧~】?
    dfi:value值為特征index,也就是上文中feat_dict屬性保存的值?

    dfv:如果是數值變量,則保持原本的值,如果是分類變量,則value為1?

    4.2 模型架構

  • def _init_graph(self):

  • self.graph = tf.Graph()

  • with self.graph.as_default():

  • ?
  • tf.set_random_seed(self.random_seed)

  • ?
  • self.feat_index = tf.placeholder(tf.int32, shape=[None, None],

  • name="feat_index") # None * F

  • self.feat_value = tf.placeholder(tf.float32, shape=[None, None],

  • name="feat_value") # None * F

  • self.label = tf.placeholder(tf.float32, shape=[None, 1], name="label") # None * 1

  • self.dropout_keep_fm = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_fm")

  • self.dropout_keep_deep = tf.placeholder(tf.float32, shape=[None], name="dropout_keep_deep")

  • self.train_phase = tf.placeholder(tf.bool, name="train_phase")

  • ?
  • self.weights = self._initialize_weights()

  • ?
  • # model

  • self.embeddings = tf.nn.embedding_lookup(self.weights["feature_embeddings"],

  • self.feat_index) # None * F * K

  • ?
  • #print(self.weights["feature_embeddings"]) shape=[259,8] n*k個隱向量

  • #print(self.embeddings) shape=[?,39,8] f*k 每個field取出一個隱向量[這不是FFM每個field取是在取非0量,減少計算]

  • feat_value = tf.reshape(self.feat_value, shape=[-1, self.field_size, 1])

  • #print(feat_value) shape=[?,39*1] 某一個樣本的39個Feature值

  • self.embeddings = tf.multiply(self.embeddings, feat_value) #multiply在有一個維度不同時,較少的維度會自行擴展

  • #print(self.embeddings) shape=[?,39*8]

  • # 所以這個multiply之后得到的矩陣是Vixi,方便以后進行<Vi,Vj>*xi*xj=<Vi*xi,Vj*xj>的計算,后面的計算FM被簡化為了

  • # sum_square part-square_sum part的形式,采用上面multiply的形式更方便啊!

  • ?
  • # ---------- first order term ----------

  • self.y_first_order = tf.nn.embedding_lookup(self.weights["feature_bias"], self.feat_index) # None * F * 1

  • self.y_first_order = tf.reduce_sum(tf.multiply(self.y_first_order, feat_value), 2) # None * F

  • self.y_first_order = tf.nn.dropout(self.y_first_order, self.dropout_keep_fm[0]) # None * F

  • ?
  • # ---------- second order term ---------------

  • # sum_square part

  • self.summed_features_emb = tf.reduce_sum(self.embeddings, 1) # None * K

  • self.summed_features_emb_square = tf.square(self.summed_features_emb) # None * K

  • ?
  • # square_sum part

  • self.squared_features_emb = tf.square(self.embeddings)

  • self.squared_sum_features_emb = tf.reduce_sum(self.squared_features_emb, 1) # None * K

  • ?
  • # second order

  • self.y_second_order = 0.5 * tf.subtract(self.summed_features_emb_square, self.squared_sum_features_emb) # None * K

  • self.y_second_order = tf.nn.dropout(self.y_second_order, self.dropout_keep_fm[1]) # None * K

  • ?
  • # ---------- Deep component ----------

  • self.y_deep = tf.reshape(self.embeddings, shape=[-1, self.field_size * self.embedding_size]) # None * (F*K)

  • self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[0])

  • for i in range(0, len(self.deep_layers)):

  • self.y_deep = tf.add(tf.matmul(self.y_deep, self.weights["layer_%d" %i]), self.weights["bias_%d"%i]) # None * layer[i] * 1

  • if self.batch_norm:

  • self.y_deep = self.batch_norm_layer(self.y_deep, train_phase=self.train_phase, scope_bn="bn_%d" %i) # None * layer[i] * 1

  • self.y_deep = self.deep_layers_activation(self.y_deep)

  • self.y_deep = tf.nn.dropout(self.y_deep, self.dropout_keep_deep[1+i]) # dropout at each Deep layer

  • ?
  • # ---------- DeepFM ----------

  • if self.use_fm and self.use_deep:

  • concat_input = tf.concat([self.y_first_order, self.y_second_order, self.y_deep], axis=1)

  • elif self.use_fm:

  • concat_input = tf.concat([self.y_first_order, self.y_second_order], axis=1)

  • elif self.use_deep:

  • concat_input = self.y_deep

  • self.out = tf.add(tf.matmul(concat_input, self.weights["concat_projection"]), self.weights["concat_bias"])

  • 不知道為什么這篇代碼把FM寫的看起來很復雜。人家復雜是有原因的!!避免了使用one-hot編碼后的大大大矩陣?
    其實就是embedding層Deep和FM共用了隱向量【feature_size*k】矩陣

    所以這個實現的重點在embedding層啊,這里的實現方式通過Xi,Xv兩個較小的矩陣【n*field】注意這里field不是FFM中的F,而是未one-hot編碼前的Feature數量。?

    根據內積的公式我們可以得到

    總結

    以上是生活随笔為你收集整理的推荐系统算法总结(三)——FM与DNN DeepFM的全部內容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。