词嵌入 网络嵌入_词嵌入简介
詞嵌入 網絡嵌入
深度學習 , 自然語言處理 (Deep Learning, Natural Language Processing)
Word embedding is a method to capture the “meaning” of the word via low dimension vector and it can be used in a variety of tasks in Natural Language Processing (NLP).
單詞嵌入是一種通過低維向量捕獲單詞“含義”的方法,可用于自然語言處理(NLP)的各種任務中。
Before beginning word embedding tutorial we should have an understanding of vector space and similarity matrix.
在開始單詞嵌入教程之前,我們應該了解向量空間和相似度矩陣 。
向量空間 (Vector Space)
A sequence of numbers that is used to identify a point in space is called vector and if we have a whole bunch of vectors that all belong to the same dataset it will be called a vector space.
用于識別空間中一個點的數字序列稱為向量 ,如果我們有一堆全部屬于同一數據集的向量 ,則稱為向量空間 。
Words in the text can also be represented in the higher dimension in vector space where words having the same meaning will have similar representations. For example,
文本中的單詞也可以在向量空間中以較高的維度表示,其中具有相同含義的單詞將具有相似的表示形式。 例如,
photo by Allision Parrish from Github來自Github的Allision Parrish攝The above image shows a vector representation of words on the scale of cuteness and size of animals. we can see that there is a semantic relationship between words on bases of similar properties. It is difficult to represent the higher dimensional relationship between words but the maths behind is the same so it works similarly in a higher dimension also.
上圖顯示了可愛程度和動物大小的單詞矢量表示。 我們可以看到,基于相似屬性的單詞之間存在語義關系。 很難表示單詞之間的高維關系,但是后面的數學是相同的,因此它在高維上也類似地工作。
相似度矩陣 (Similarity matrix)
It is used to calculate the distance between vectors in the vector space. it measures similarity or distance between two data points in vector space. This allows us to capture words that are used in similar ways to result in having similar representation naturally capturing their meaning. there is a lot of similarity matrix available but we will discuss Euclidean distance and Cosine similarity.
它用于計算向量空間中向量之間的距離。 它測量向量空間中兩個數據點之間的相似度或距離。 這使我們能夠捕獲以相似方式使用的單詞,從而導致具有相似的表示形式自然地捕獲其含義。 有很多可用的相似度矩陣,但我們將討論歐幾里得距離和余弦相似度。
歐氏距離 (Euclidean distance)
One way to calulate how far two data points are in vector space is to calculate Euclidean distance.
計算向量空間中兩個數據點的距離的一種方法是計算歐幾里得距離。
import mathdef distance2d(x1, y1, x2, y2):return math.sqrt((x1 - x2)**2 + (y1 - y2)**2)
So, the distance between “capybara” (70, 30) and “panda” (74, 40) from the above image example:
因此,根據上圖示例,“水豚”(70、30)和“熊貓”(74、40)之間的距離:
… is less than the distance between “tarantula” and “elephant” from the above image example:
…小于上圖示例中的“狼蛛”和“大象”之間的距離:
This shows that “pandas” and “capybara” are more similar as compared to “tarantula” and “elephant”.
這表明“熊貓”和“水豚”比“狼蛛”和“大象”更相似。
余弦相似度 (Cosine similarity)
It is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them.
它是內積空間的兩個非零向量之間相似性的量度,用于測量它們之間角度的余弦。
from numpy import dotfrom numpy.linalg import norm
cos_sim = dot(a, b)/(norm(a)*norm(b))
現在的問題是什么是詞嵌入,為什么我們要使用它們? (Now Question is what is word embedding and why do we use them?)
In simple words, they are a vector representation of words in sentences, documents, etc.,
簡單來說,它們是句子,文檔等中單詞的向量表示,
Word embedding is a learning representation of words in the form of numeric vectors. It learns a densely distributed representation for a predefined fixed-sized vocabulary from a corpus of text. The word embedding representation is capable to reveal many hidden relationships between words. For example, vector(“king”) — vector(“lords”) is similar to vector(“queen”) — vector(“princess”)
單詞嵌入是數字向量形式的單詞的學習表示。 它從文本語料庫中學習預定義的固定大小詞匯的密集分布表示形式。 單詞嵌入表示法能夠揭示單詞之間的許多隱藏關系。 例如,vector(“ king”)— vector(“ lords”)類似于vector(“ queen”)— vector(“ princess”)
It is an improvement over the traditional methods to represent word such as bag-of-word model which produces large sparse vectors which are computationally impractical to represent an entire vocabulary. These representations were sparse due to its vast vocabularies and a given word or document would be represented by a large vector comprised mostly of zero values a sparse representation.
它是對表示單詞的傳統方法(例如單詞袋模型)的一種改進,它產生了較大的稀疏矢量,這些矢量在計算上不切實際,無法代表整個詞匯。 這些表述由于其詞匯量龐大而稀疏,給定的單詞或文檔將由一個大型矢量表示,該矢量主要由零值表示。
Two popular methods of learning word embeddings from the text include:
從文本中學習單詞嵌入的兩種流行方法包括:
1. Word2Vec.
1. Word2Vec 。
2. GloVe.
2. GloVe 。
There are pre-trained models that were trained over a large corpus of text. We can use them for our use case.
有一些經過訓練的模型,這些模型經過大量文本訓練。 我們可以將它們用于我們的用例。
In addition to these methods, a word embedding can be learned using deep learning model. This can be a slower approach but we can design it for our own use case the model will be trained on a specific training dataset as per our own requirement. Keras provides a very easy and flexible Embedding layer that can be used for neural networks on text data.
除了這些方法,還可以使用深度學習模型來學習單詞嵌入。 這可能是一種較慢的方法,但是我們可以針對自己的用例進行設計,然后根據我們自己的要求在特定的訓練數據集上對模型進行訓練。 Keras提供了一個非常簡單和靈活的嵌入層,可用于文本數據的神經網絡。
In
在
導入模塊 (Importing Module)
Let’s get started with importing our dataset, module, and checking its head. I took a dataset from Kaggle IMBD Movie Review-NLP.
讓我們開始導入數據集,模塊并檢查其頭部。 我從Kaggle IMBD電影評論-NLP中獲取了一個數據集。
import pandas as pdimport numpy as np
from numpy import array
from keras.preprocessing.text import one_hot, Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
We’ll use Scikit-learn to divide our dataset into a training set and test set. We’ll train the word embedding on 70% of the data and test it on 30%.
我們將使用Scikit-learn將數據集分為訓練集和測試集。 我們將在70%的數據上訓練嵌入單詞,并在30%的數據上對其進行測試。
完整的編碼所有文件 (INTEGER ENCODING ALL THE DOCUMENTS)
After this, all the unique words will be represented by an integer. For this, we are using one_hot function available in the Keras. Note that the vocab_size is specified as a total number of unique words so as to ensure unique integer encoding for each and every word.
此后,所有唯一詞將由一個整數表示。 為此,我們使用Keras中可用的one_hot函數。 請注意, vocab_size被指定為唯一單詞的總數,以確保每個單詞的唯一整數編碼 。
Note one important thing that the integer encoding for the word remains the same in different text. eg ‘year’ is denoted by 23518 in each and every document.
注意一件事,單詞的整數編碼在不同的文本中保持相同。 例如,在每個文檔中,“年份”都用23518表示。
Let’s now have a look at one of the reviews. We’ll compare this sentence with its transformation as we move in the next steps.
現在讓我們看看其中一項評論。 在下一步中,我們將比較此句子及其轉換。
I really didn't like this movie because it didn't really bring across the messages and ideas L'Engle brought out in her novel. We had read the novel in our English class and i absolutely loved it, i'm afraid i can't say the same for the film. There were some serious differences between the novel and the adapted version and it just didn't do any credit to the imaginative genius that is Madeleine L'Engle! This is the reason i gave it such a poor rating. Don't see this movie if you are a big fan of L'Engle's texts because you will be sorely disappointed. However, if you are watching the movie for entertainment purposes (or educational as was my case) then it is an alright movie!This review will be converted into integer representation where each number represents a unique word.
該評論將轉換為整數表示,其中每個數字代表一個唯一的單詞。
[24608, 32542, 30289, 58025, 50966, 19624, 43296, 35850, 30289, 32542, 31519, 11569, 30465, 7968, 12928, 34105, 8750, 49668, 38039, 40264, 3503, 45016, 63074, 41404, 53275, 30465, 45016, 40264, 28666, 47101, 44909, 12928, 24608, 62202, 46727, 35850, 24425, 5515, 24608, 25601, 35725, 30465, 10577, 55918, 30465, 13875, 62286, 22967, 5067, 9001, 33291, 1247, 30465, 45016, 12928, 30465, 23555, 44142, 12928, 35850, 41976, 30289, 20229, 15687, 7845, 50705, 30465, 58301, 14031, 11556, 1495, 26143, 8750, 50966, 1495, 30465, 63056, 24608, 39847, 35850, 30936, 54227, 33469, 55622, 8193, 3111, 50966, 19624, 9403, 51670, 40033, 54227, 42254, 52367, 44935, 63226, 17625, 43296, 51670, 65642, 30053, 42863, 34757, 32894, 9403, 51670, 40033, 1112, 30465, 19624, 55918, 55169, 57666, 10193, 50176, 59413, 10480, 63135, 56156, 64520, 35850, 1495, 49938, 59074, 19624]填充文本(使相同長度的文本) (Padding theText (to make the very text of the same length))
The Keras Embedding layer requires all individual documents to be of the same length. Hence we will pad the shorter documents with 0 for now. Therefore now in Keras Embedding layer, the ‘input_length’ will be equal to the length (ie no of words) of the document with maximum length or a maximum number of words.
Keras嵌入層要求所有單個文檔的長度都相同。 因此,我們現在將較短的文檔填充0。 因此,現在在Keras嵌入層中, “ input_length”將等于具有最大長度或最大單詞數的文檔的長度(即單詞數)。
To pad the shorter documents I am using pad_sequences function from the Keras library.
為了填充較短的文檔,我使用Keras庫中的pad_sequences函數。
The maximum number of words in any document is : 1719Here, we found that the maximum words that a sentence hold is 1719. so we will be padding according to it. In padding, we will be adding zeros(0) in a shorter sentence than max_length. In shorter length sentences “0 ” will be added at the beginning of the sentence.
在這里,我們發現一個句子容納的最大單詞數為1719。因此我們將根據它進行填充。 在填充中,我們將在比max_length更短的句子中添加zeros(0)。 在較短的句子中,將在句子的開頭添加“ 0”。
For example:
例如:
array([ 0, 0, 0, ..., 32875, 18129, 60728])我們將使用KERAS嵌入層創建嵌入 (WE WILL BE CREATING THE EMBEDDINGS using KERAS EMBEDDING LAYER)
Now all the text are of the same length (after padding). And so now we are ready to create and use the embedding layer.
現在,所有文本的長度相同(填充后)。 因此,現在我們可以創建和使用嵌入層了。
PARAMETERS OF THE EMBEDDING LAYER — -
嵌入層的參數--
‘Input_dim’ = the vocab size that we will choose. It is the number of unique words in the vocabulary.
'Input_dim'=我們將選擇的唱頭大小 。 它是詞匯表中唯一詞的數量。
‘Output_dim’ = the number of dimensions we wish to embed into. Each word can be represented by a vector of the same dimensions.
'Output_dim'=我們希望嵌入的尺寸數 。 每個單詞可以用相同維數的向量表示。
‘Input_length’ = length of the maximum text. which is stored in the maxlen variable in the example.
'Input_length'=最大文本的長度 。 在示例中存儲在maxlen變量中。
Model: "sequential_1"_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
embedding_1 (Embedding) (None, 1719, 8) 527680
_________________________________________________________________
flatten_1 (Flatten) (None, 13752) 0
_________________________________________________________________
dense_1 (Dense) (None, 1) 13753
=================================================================
Total params: 541,433
Trainable params: 541,433
Non-trainable params: 0
_________________________________________________________________
None
Let’s now check the model accuracy on our training set.
現在,讓我們在訓練集中檢查模型的準確性。
6000/6000 [==============================] - 1s 170us/stepTraining Accuracy is 100.0
The next step we can do is check its accuracy on the test set.
下一步,我們可以在測試集上檢查其準確性。
4000/4000 [==============================] - 1s 179us/stepTesting Accuracy is 86.57500147819519
We are getting train accuracy as 100% because on that data we train embedding but for test data, there are some words used which are unseen so we are getting a bit less accuracy.
我們得到的訓練精度為100%,因為在該數據上我們進行了嵌入訓練,但對于測試數據,由于使用了一些看不見的單詞,因此準確性有所降低。
In practice, I would recommend performing a word embedding using a pre-trained embedding that is fixed and trying to perform learning on top of a pre-trained embedding. That will surely improve performance on test data.
實際上,我建議使用固定的預訓練嵌入來執行單詞嵌入,并嘗試在預訓練嵌入的基礎上進行學習。 這肯定會提高測試數據的性能。
下一步是什么 (What’s Next)
Now we have learned how to represent words in the form of continuous numbers. As compared to other forms of text representation such as bag-of-words or TF-IDF(term frequency-inverse document frequency), etc., Word embedding gives much better semantic relationships between words. It can significantly improve the performance of natural language processing(NLP) tasks.
現在我們已經學習了如何以連續數字的形式表示單詞。 與其他形式的文本表示形式(例如詞袋或TF-IDF(術語頻率-反向文檔頻率)等)相比,詞嵌入可提供更好的詞間語義關系。 它可以顯著提高自然語言處理(NLP)任務的性能。
Now, I would suggest you try yourself word embedding on your own NLP task and you will find significant improvement in the performance. you can also experiment with implementing word embeddings on the same dataset by using pre-trained word embeddings such as Word2Vec as fixed and on top of it, you can perform learning.
現在,我建議您嘗試將單詞嵌入到您自己的NLP任務中,您會發現性能有了顯著提高。 您還可以嘗試使用固定訓練的單詞嵌入(例如Word2Vec)在同一數據集上實現單詞嵌入,并在此之上進行學習。
Most often, you will notice that the pre-trained models will have a higher accuracy on the testing set the reason for that is it already had trained on a large variety of NLP datasets. But if you have enough data and want to perform a specific task than it will be a better choice to train your own word embedding.
大多數情況下,您會注意到預訓練的模型在測試集上的準確性更高,原因是它已經在各種NLP數據集上進行了訓練。 但是,如果您有足夠的數據并且想要執行特定任務,那么訓練您自己的單詞嵌入將是一個更好的選擇。
Code for Word Embedding is Available on GitHub.
GitHub上提供了Word嵌入代碼。
Thanks for the read. I hope this helps you understanding Word Embedding and its importance in natural language processing (NLP).
感謝您的閱讀。 我希望這可以幫助您理解單詞嵌入及其在自然語言處理(NLP)中的重要性。
Follow me up at Medium. As always, I welcome feedback and constructive criticism and can be reached on Linkedin.
在Medium跟我來。 與往常一樣,我歡迎您提供反饋和建設性的批評,可以通過Linkedin與我們聯系 。
翻譯自: https://medium.com/towards-artificial-intelligence/introduction-to-word-embedding-5ba5cf97d296
詞嵌入 網絡嵌入
總結
以上是生活随笔為你收集整理的词嵌入 网络嵌入_词嵌入简介的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 三行情书代码_用三行代码优化您的交易策略
- 下一篇: 如何成为数据科学家_成为数据科学家的5大