當(dāng)前位置：首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

embedding 层的详细解释

發(fā)布時(shí)間：2025/4/5 编程问答 26 豆豆

生活随笔收集整理的這篇文章主要介紹了 embedding 层的详细解释小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

原文鏈接

在這篇文章中，我們將介紹keras的嵌入層。為此，我創(chuàng)建了一個(gè)僅包含3個(gè)文檔的樣本語(yǔ)料庫(kù)，這足以解釋keras嵌入層的工作。

詞嵌入在各種機(jī)器學(xué)習(xí)應(yīng)用程序中很有用在開(kāi)始之前，讓我們?yōu)g覽一下詞嵌入的一些應(yīng)用：

第一個(gè)吸引我的應(yīng)用程序是在基于協(xié)同過(guò)濾的推薦系統(tǒng)中，我們必須通過(guò)分解包含用戶(hù)項(xiàng)等級(jí)的效用矩陣來(lái)創(chuàng)建用戶(hù)嵌入和電影嵌入。
要查看有關(guān)在Keras中使用詞嵌入的基于CF推薦系統(tǒng)的完整教程，可以遵循我的這篇文章。
第二種用途是在自然語(yǔ)言處理及其相關(guān)應(yīng)用程序中，我們必須為語(yǔ)料庫(kù)文檔中存在的所有單詞創(chuàng)建單詞嵌入。這是我將在此內(nèi)核中使用的術(shù)語(yǔ)。
因此，當(dāng)我們想要?jiǎng)?chuàng)建將高維數(shù)據(jù)嵌入到低維向量空間中的嵌入時(shí)，可以使用Keras中的嵌入層。

#!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Sat Oct 10 16:33:58 2020 @author: lediimport warnings warnings.filterwarnings('always') warnings.filterwarnings('ignore')# data visualisation and manipulation import numpy as np import pandas as pd import matplotlib.pyplot as plt from matplotlib import style import seaborn as sns #configure # sets matplotlib to inline and displays graphs below the corressponding cell. # %matplotlib inline style.use('fivethirtyeight') sns.set(style='whitegrid',color_codes=True)#nltk import nltk#stop-words from nltk.corpus import stopwords stop_words=set(nltk.corpus.stopwords.words('english'))# tokenizing from nltk import word_tokenize,sent_tokenize#keras import keras from keras.preprocessing.text import one_hot,Tokenizer from keras.preprocessing.sequence import pad_sequences from keras.models import Sequential from keras.layers import Dense , Flatten ,Embedding,Input from keras.models import Model# 這可以理解為三篇文章 sample_text_1="bitty bought a bit of butter" sample_text_2="but the bit of butter was a bit bitter" sample_text_3="so she bought some better butter to make the bitter butter better"corp=[sample_text_1,sample_text_2,sample_text_3] no_docs=len(corp)

此后，所有唯一詞都將由一個(gè)整數(shù)表示。為此，我們使用Keras中的one_hot函數(shù)。請(qǐng)注意，vocab_size被指定為足夠大，以確保每個(gè)單詞的唯一整數(shù)編碼。

注意一件重要的事情，即單詞的整數(shù)編碼在不同文檔中保持不變。例如，“butter”在每個(gè)文檔中都用31表示。

指定詞向量的長(zhǎng)度

vocab_size=50 encod_corp=[] for i,doc in enumerate(corp):encod_corp.append(one_hot(doc,50))# print(one_hot(doc,50))print("The encoding for document",i+1," is : ",one_hot(doc,50)) # length of maximum document. will be nedded whenever create embeddings for the words maxlen=-1 for doc in corp:tokens=nltk.word_tokenize(doc)if(maxlen<len(tokens)):maxlen=len(tokens) print("The maximum number of words in any document is : ",maxlen)

Keras嵌入層要求所有單個(gè)文檔的長(zhǎng)度都相同。因此，我們現(xiàn)在將較短的文檔填充0。因此，現(xiàn)在在Keras嵌入層中，“ input_length”將等于具有最大長(zhǎng)度或最大單詞數(shù)的文檔的長(zhǎng)度（即單詞數(shù)）。

為了填充較短的文檔，我使用Keras庫(kù)中的pad_sequences函數(shù)。

# now to create embeddings all of our docs need to be of same length. hence we can pad the docs with zeros. pad_corp=pad_sequences(encod_corp,maxlen=maxlen,padding='post',value=0.0) print("No of padded documents: ",len(pad_corp))

現(xiàn)在所有文檔的長(zhǎng)度相同（填充后）。因此，現(xiàn)在我們可以創(chuàng)建和使用嵌入了。我將這些詞嵌入8維向量中。

# specifying the input shape # input=Input(shape=(no_docs,maxlen),dtype='float64')""" 嵌入層的參數(shù)--- 'input_dim'=我們將選擇的單詞集合大小。換句話說(shuō)，這是詞匯中唯一詞的數(shù)量。 “ output_dim” =我們希望嵌入的尺寸數(shù)。每個(gè)單詞都將由一個(gè)如此大小的向量表示。 'input_length'=最大文檔的長(zhǎng)度。在我們的例子中，它存儲(chǔ)在maxlen變量中。 """''' shape of input. each document has 12 element or words which is the value of our maxlen variable.''' word_input=Input(shape=(maxlen,),dtype='float64') # creating the embedding word_embedding=Embedding(input_dim=vocab_size,output_dim=8,input_length=maxlen)(word_input)word_vec=Flatten()(word_embedding) # flatten embed_model =Model([word_input],word_embedding) # combining all into a Keras model embed_model.summary()embed_model.compile(optimizer=keras.optimizers.Adam(lr=1e-3),loss='binary_crossentropy',metrics=['acc']) # compiling the model. parameters can be tuned as always.print(type(word_embedding)) print(word_embedding)embeddings=embed_model.predict(pad_corp) # finally getting the embeddings.""" 結(jié)果形狀為（3,12,8）。 3 --->文件的數(shù)量 12->每個(gè)文件由12個(gè)字組成，這是我們所有文件的最大長(zhǎng)度。＆8 --->每個(gè)單詞都是8維的。 """ print("Shape of embeddings : ",embeddings.shape) print(embeddings)embeddings=embeddings.reshape(-1,maxlen,8) print("Shape of embeddings : ",embeddings.shape) print(embeddings)

現(xiàn)在，這使我們可以更容易地看到我們有3個(gè)文檔，每個(gè)文檔包含12個(gè)（最大長(zhǎng)度）單詞，每個(gè)單詞映射到8維向量。

如何處理一段真實(shí)的文本
就像上面一樣，我們現(xiàn)在可以使用任何其他文檔。我們可以將文件send_tokenize變成句子。

每個(gè)句子都有一個(gè)單詞列表，我們將使用“ one_hot”函數(shù)對(duì)這些單詞進(jìn)行整數(shù)編碼，如下所示。

現(xiàn)在，每個(gè)句子將具有不同數(shù)量的單詞。因此，我們需要將序列填充到最大單詞數(shù)的句子中。

此時(shí)，我們已經(jīng)準(zhǔn)備好將輸入提供給Keras嵌入層，如上所示。

‘input_dim’=我們將選擇的詞匯表大小

‘output_dim’=我們希望嵌入的尺寸數(shù)

‘input_length’=最大文檔長(zhǎng)度

《新程序員》：云原生和全面數(shù)字化實(shí)踐50位技術(shù)專(zhuān)家共同創(chuàng)作，文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的embedding 层的详细解释的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇： python filter
下一篇： python super 理解（四）