日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

TensorFlow2简单入门-单词嵌入向量

發(fā)布時(shí)間:2025/4/5 编程问答 12 豆豆
生活随笔 收集整理的這篇文章主要介紹了 TensorFlow2简单入门-单词嵌入向量 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

用數(shù)字表示文本

機(jī)器學(xué)習(xí)模型將向量(數(shù)字?jǐn)?shù)組)作為輸入。在處理文本時(shí),我們必須先想出一種策略,將字符串轉(zhuǎn)換為數(shù)字(或?qū)⑽谋尽跋蛄炕?#xff09;,然后再其饋入模型。在本部分中,我們將探究實(shí)現(xiàn)這一目標(biāo)的三種策略。

獨(dú)熱編碼

作為第一個(gè)想法,我們可以對(duì)詞匯表中的每個(gè)單詞進(jìn)行“獨(dú)熱”編碼??紤]這樣一句話:“The cat sat on the mat”。這句話中的詞匯(或唯一單詞)是(cat、mat、on、sat、the)。為了表示每個(gè)單詞,我們將創(chuàng)建一個(gè)長(zhǎng)度等于詞匯量的零向量,然后在與該單詞對(duì)應(yīng)的索引中放置一個(gè) 1。下圖顯示了這種方法。

為了創(chuàng)建一個(gè)包含句子編碼的向量,我們可以將每個(gè)單詞的獨(dú)熱向量連接起來(lái)。

要點(diǎn):這種方法效率低下。一個(gè)獨(dú)熱編碼向量十分稀疏(這意味著大多數(shù)索引為零)。假設(shè)我們的詞匯表中有 10,000 個(gè)單詞。為了對(duì)每個(gè)單詞進(jìn)行獨(dú)熱編碼,我們將創(chuàng)建一個(gè)其中 99.99% 的元素都為零的向量。

用一個(gè)唯一的數(shù)字編碼每個(gè)單詞

我們可以嘗試的第二種方法是使用唯一的數(shù)字來(lái)編碼每個(gè)單詞。繼續(xù)上面的示例,我們可以將 1 分配給“cat”,將 2 分配給“mat”,依此類推。然后,我們可以將句子“The cat sat on the mat”編碼為一個(gè)密集向量,例如 [5, 1, 4, 3, 5, 2]。這種方法是高效的?,F(xiàn)在,我們有了一個(gè)密集向量(所有元素均已滿),而不是稀疏向量。

但是,這種方法有兩個(gè)缺點(diǎn):

整數(shù)編碼是任意的(它不會(huì)捕獲單詞之間的任何關(guān)系)。

對(duì)于要解釋的模型而言,整數(shù)編碼頗具挑戰(zhàn)。例如,線性分類器針對(duì)每個(gè)特征學(xué)習(xí)一個(gè)權(quán)重。由于任何兩個(gè)單詞的相似性與其編碼的相似性之間都沒(méi)有關(guān)系,因此這種特征權(quán)重組合沒(méi)有意義。

單詞嵌入向量

單詞嵌入向量為我們提供了一種使用高效、密集表示的方法,其中相似的單詞具有相似的編碼。重要的是,我們不必手動(dòng)指定此編碼。嵌入向量是浮點(diǎn)值的密集向量(向量的長(zhǎng)度是您指定的參數(shù))。它們是可以訓(xùn)練的參數(shù)(模型在訓(xùn)練過(guò)程中學(xué)習(xí)的權(quán)重,與模型學(xué)習(xí)密集層權(quán)重的方法相同),無(wú)需手動(dòng)為嵌入向量指定值。8 維的單詞嵌入向量(對(duì)于小型數(shù)據(jù)集)比較常見(jiàn),而在處理大型數(shù)據(jù)集時(shí)最多可達(dá) 1024 維。維度更高的嵌入向量可以捕獲單詞之間的細(xì)粒度關(guān)系,但需要更多的數(shù)據(jù)來(lái)學(xué)習(xí)。

上面是一個(gè)單詞嵌入向量的示意圖。每個(gè)單詞都表示為浮點(diǎn)值的 4 維向量。還可以將嵌入向量視為“查找表”。學(xué)習(xí)完這些權(quán)重后,我們可以通過(guò)在表中查找對(duì)應(yīng)的密集向量來(lái)編碼每個(gè)單詞。


下面來(lái)看看代碼

import io import os import re import shutil import string import tensorflow as tffrom datetime import datetime from tensorflow.keras import Model, Sequential from tensorflow.keras.layers import Activation, Dense, Embedding, GlobalAveragePooling1D from tensorflow.keras.layers.experimental.preprocessing import TextVectorizationprint(tf.__version__) """ 輸出:2.5.0-dev20201226 """

下載IMDB數(shù)據(jù)集

url = "https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz"dataset = tf.keras.utils.get_file("aclImdb_v1.tar.gz", url,untar=True, cache_dir='.',cache_subdir='')dataset_dir = os.path.join(os.path.dirname(dataset), 'aclImdb') os.listdir(dataset_dir) """ 輸出: Downloading data from https://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz 84131840/84125825 [==============================] - 138s 2us/step ['imdb.vocab', 'imdbEr.txt', 'README', 'test', 'train'] """

train文件夾中有pos與neg兩個(gè)關(guān)于電影評(píng)論的文件夾,其中數(shù)據(jù)分別被標(biāo)記為positive與negative,你可以使用這兩個(gè)文件夾中的數(shù)據(jù)去訓(xùn)練一個(gè)二元分類模型

train_dir = os.path.join(dataset_dir, 'train') os.listdir(train_dir) """ 輸出: ['labeledBow.feat','neg','pos','unsup','unsupBow.feat','urls_neg.txt','urls_pos.txt','urls_unsup.txt'] """

在創(chuàng)建數(shù)據(jù)集之前應(yīng)該先移除多余的文件夾,例如unsup

remove_dir = os.path.join(train_dir, 'unsup') shutil.rmtree(remove_dir)

下一步,用tf.keras.preprocessing.text_dataset_from_directory函數(shù)創(chuàng)建一個(gè)tf.data.Dataset。用train文件夾中數(shù)據(jù)創(chuàng)建train與validation數(shù)據(jù)集,validation所占比例為20%(即validation_split為0.2)

batch_size = 1024 seed = 123 train_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='training', seed=seed) val_ds = tf.keras.preprocessing.text_dataset_from_directory('aclImdb/train', batch_size=batch_size, validation_split=0.2, subset='validation', seed=seed) """ 輸出: Found 25000 files belonging to 2 classes. Using 20000 files for training. Found 25000 files belonging to 2 classes. Using 5000 files for validation. """

檢查數(shù)據(jù)集中的評(píng)論數(shù)據(jù)以及對(duì)應(yīng)的標(biāo)簽

for text_batch, label_batch in train_ds.take(1):for i in range(5):print(label_batch[i].numpy(), text_batch.numpy()[i]) """ 輸出: 0 b"Oh My God! Please, for the love of all that is holy, Do Not Watch This Movie! It it 82 minutes of my life I will never get back. Sure, I could have stopped watching half way through. But I thought it might get better. It Didn't. Anyone who actually enjoyed this movie is one seriously sick and twisted individual. No wonder us Australians/New Zealanders have a terrible reputation when it comes to making movies. Everything about this movie is horrible, from the acting to the editing. I don't even normally write reviews on here, but in this case I'll make an exception. I only wish someone had of warned me before I hired this catastrophe" 1 b'This movie is SOOOO funny!!! The acting is WONDERFUL, the Ramones are sexy, the jokes are subtle, and the plot is just what every high schooler dreams of doing to his/her school. I absolutely loved the soundtrack as well as the carefully placed cynicism. If you like monty python, You will love this film. This movie is a tad bit "grease"esk (without all the annoying songs). The songs that are sung are likable; you might even find yourself singing these songs once the movie is through. This musical ranks number two in musicals to me (second next to the blues brothers). But please, do not think of it as a musical per say; seeing as how the songs are so likable, it is hard to tell a carefully choreographed scene is taking place. I think of this movie as more of a comedy with undertones of romance. You will be reminded of what it was like to be a rebellious teenager; needless to say, you will be reminiscing of your old high school days after seeing this film. Highly recommended for both the family (since it is a very youthful but also for adults since there are many jokes that are funnier with age and experience.' 0 b"Alex D. Linz replaces Macaulay Culkin as the central figure in the third movie in the Home Alone empire. Four industrial spies acquire a missile guidance system computer chip and smuggle it through an airport inside a remote controlled toy car. Because of baggage confusion, grouchy Mrs. Hess (Marian Seldes) gets the car. She gives it to her neighbor, Alex (Linz), just before the spies turn up. The spies rent a house in order to burglarize each house in the neighborhood until they locate the car. Home alone with the chicken pox, Alex calls 911 each time he spots a theft in progress, but the spies always manage to elude the police while Alex is accused of making prank calls. The spies finally turn their attentions toward Alex, unaware that he has rigged devices to cleverly booby-trap his entire house. Home Alone 3 wasn't horrible, but probably shouldn't have been made, you can't just replace Macauley Culkin, Joe Pesci, or Daniel Stern. Home Alone 3 had some funny parts, but I don't like when characters are changed in a movie series, view at own risk." 0 b"There's a good movie lurking here, but this isn't it. The basic idea is good: to explore the moral issues that would face a group of young survivors of the apocalypse. But the logic is so muddled that it's impossible to get involved.<br /><br />For example, our four heroes are (understandably) paranoid about catching the mysterious airborne contagion that's wiped out virtually all of mankind. Yet they wear surgical masks some times, not others. Some times they're fanatical about wiping down with bleach any area touched by an infected person. Other times, they seem completely unconcerned.<br /><br />Worse, after apparently surviving some weeks or months in this new kill-or-be-killed world, these people constantly behave like total newbs. They don't bother accumulating proper equipment, or food. They're forever running out of fuel in the middle of nowhere. They don't take elementary precautions when meeting strangers. And after wading through the rotting corpses of the entire human race, they're as squeamish as sheltered debutantes. You have to constantly wonder how they could have survived this long... and even if they did, why anyone would want to make a movie about them.<br /><br />So when these dweebs stop to agonize over the moral dimensions of their actions, it's impossible to take their soul-searching seriously. Their actions would first have to make some kind of minimal sense.<br /><br />On top of all this, we must contend with the dubious acting abilities of Chris Pine. His portrayal of an arrogant young James T Kirk might have seemed shrewd, when viewed in isolation. But in Carriers he plays on exactly that same note: arrogant and boneheaded. It's impossible not to suspect that this constitutes his entire dramatic range.<br /><br />On the positive side, the film *looks* excellent. It's got an over-sharp, saturated look that really suits the southwestern US locale. But that can't save the truly feeble writing nor the paper-thin (and annoying) characters. Even if you're a fan of the end-of-the-world genre, you should save yourself the agony of watching Carriers." 0 b'I saw this movie at an actual movie theater (probably the $2.00 one) with my cousin and uncle. We were around 11 and 12, I guess, and really into scary movies. I remember being so excited to see it because my cool uncle let us pick the movie (and we probably never got to do that again!) and sooo disappointed afterwards!! Just boring and not scary. The only redeeming thing I can remember was Corky Pigeon from Silver Spoons, and that wasn\'t all that great, just someone I recognized. I\'ve seen bad movies before and this one has always stuck out in my mind as the worst. This was from what I can recall, one of the most boring, non-scary, waste of our collective $6, and a waste of film. I have read some of the reviews that say it is worth a watch and I say, "Too each his own", but I wouldn\'t even bother. Not even so bad it\'s good.' """

創(chuàng)建一個(gè)高性能的數(shù)據(jù)集(dataset)

這是加載數(shù)據(jù)時(shí)應(yīng)該使用的兩種重要方法,以確保I/O不會(huì)阻塞

  • .cache():將數(shù)據(jù)從磁盤加載后保留在內(nèi)存中。這將確保數(shù)據(jù)集在訓(xùn)練模型時(shí)不會(huì)成為瓶頸。如果數(shù)據(jù)集太大,無(wú)法放入內(nèi)存,也可以使用此方法創(chuàng)建一個(gè)性能良好的磁盤緩存,它比許多小文件讀取效率更高。
  • .prefetch():使數(shù)據(jù)預(yù)處理與模型的訓(xùn)練交替進(jìn)行
AUTOTUNE = tf.data.AUTOTUNEtrain_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE) val_ds = val_ds.cache().prefetch(buffer_size=AUTOTUNE)

使用嵌入層(Embedding層)

Embedding層可以理解成一個(gè)從整數(shù)索引(代表特定詞匯)映射到密集向量(該單詞對(duì)應(yīng)的embeddings)的一個(gè)查找表。你可以通過(guò)試驗(yàn)確定最佳嵌入維度,就和你確定Dense層的最佳神經(jīng)元個(gè)數(shù)那樣做。

# 輸入1000個(gè)單詞,每個(gè)單詞用5個(gè)維度的向量表示 embedding_layer = tf.keras.layers.Embedding(1000, 5)

當(dāng)你創(chuàng)建Embedding層時(shí),Embedding層的權(quán)重(weights)將會(huì)和其他層(layer)一樣被隨機(jī)初始化。在訓(xùn)練過(guò)程中,權(quán)重會(huì)逐漸通過(guò)反向傳播來(lái)進(jìn)行調(diào)整。訓(xùn)練過(guò)后,embeddings層將會(huì)粗略的編碼詞匯之間的相似性(這個(gè)是針對(duì)你所訓(xùn)練模型的特定問(wèn)題的)。

如果將整數(shù)傳遞給嵌入層,則結(jié)果將用嵌入表中的向量替換每個(gè)整數(shù)。

result = embedding_layer(tf.constant([1,2,3])) result.numpy() """ 輸出: array([[-0.01827962, 0.033703 , 0.02065292, 0.00335936, -0.00998179],[ 0.00618695, -0.02138543, -0.01288087, 0.03814398, -0.02176479],[-0.02900024, 0.03794893, -0.03229412, 0.04951945, 0.03212232]],dtype=float32) """

對(duì)于文本或序列問(wèn)題,嵌入向量層采用整數(shù)組成的 2D 張量,其形狀為 (samples, sequence_length),其中每個(gè)條目都是一個(gè)整數(shù)序列。它可以嵌入可變長(zhǎng)度的序列。您可以在形狀為 (32, 10)(32 個(gè)長(zhǎng)度為 10 的序列組成的批次)或 (64, 15)(64 個(gè)長(zhǎng)度為 15 的序列組成的批次)的批次上方嵌入向量層。

返回的張量比輸入多一個(gè)軸,嵌入向量沿新的最后一個(gè)軸對(duì)齊。向其傳遞 (2, 3) 輸入批次,輸出為 (2, 3, N)

result = embedding_layer(tf.constant([[0,1,2],[3,4,5]])) result.shape """ 輸出:TensorShape([2, 3, 5]) """

當(dāng)給定一個(gè)序列批次作為輸入時(shí),嵌入向量層將返回形狀為 (samples, sequence_length, embedding_dimensionality) 的 3D 浮點(diǎn)張量。


代碼:可在微信公眾號(hào)【明天依舊可好】中回復(fù):05

注: 本文參考了官網(wǎng)并對(duì)其進(jìn)行了刪減以及部分注釋與修改

《新程序員》:云原生和全面數(shù)字化實(shí)踐50位技術(shù)專家共同創(chuàng)作,文字、視頻、音頻交互閱讀

總結(jié)

以上是生活随笔為你收集整理的TensorFlow2简单入门-单词嵌入向量的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。