當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

Tensorflow 自然语言处理

發布時間：2023/12/14 编程问答 39 豆豆

生活随笔收集整理的這篇文章主要介紹了 Tensorflow 自然语言处理小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

前言
基本知識
- 使用API
- Text to sequences
- Padding
新聞標題數據集用于諷刺檢測

前言

基本知識

使用API

import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the word print(word_index)

打印結果：

{'love': 1, 'my': 2, 'i': 3, 'dog': 4, 'cat': 5, 'you': 6}

num_words：需要保留的最大詞數，基于詞頻。只有最常出現的 num_words 詞會被保留。（unique word）詳情
tokenizer.fit_on_texts()：分詞器方法，實現分詞
tokenizer會為您自動除去標點符號(punctutation)，感嘆號(exclamation)并未出現在word_index中。并且大寫會自動改成小寫。

Text to sequences

import tensorflow as tf from tensorflow import keras from tensorflow.keras.preprocessing.text import Tokenizersentenses=['I love my dog','I love my cat','You love my dog!','Do you think my dog is amazing?' ]tokenizer=Tokenizer(num_words=100) tokenizer.fit_on_texts(sentenses) # take in the data and encodes it word_index=tokenizer.word_index # key:word index:the token of the wordsequences=tokenizer.texts_to_sequences(sentenses)print(word_index) print(sequences)

打印結果：

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10} [[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

在上面那段代碼的后面加上：

test_data=['I really love my dog','my dog loves my manatee' ] test_seq=tokenizer.texts_to_sequences(test_data) print(test_seq)

打印結果：

[[4, 2, 1, 3], [1, 3, 1]]

結論：我們需要訓練很多數據，否則可能就會像上面一樣得出my dog my,或者遺失really的句子。

如果我們用一個特殊標識來代表不認識的單詞而不是忽略它，結果又會怎么樣呢？

修改tokenizer：tokenizer=Tokenizer(num_words=100,oov_token="<OOV>")

打印結果：

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11} [[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]

Padding

sequences=tokenizer.texts_to_sequences(sentenses)padded1=pad_sequences(sequences) padded2=pad_sequences(sequences,padding='post') padded3=pad_sequences(sequences,padding='post',maxlen=5)print(padded1) print(padded2) print(padded3)

輸出結果：

[[ 0 0 0 5 3 2 4][ 0 0 0 5 3 2 7][ 0 0 0 6 3 2 4][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0 0 0][ 5 3 2 7 0 0 0][ 6 3 2 4 0 0 0][ 8 6 9 2 4 10 11]][[ 5 3 2 4 0][ 5 3 2 7 0][ 6 3 2 4 0][ 9 2 4 10 11]]

pad_sequences：將多個序列截斷或補齊為相同長度。詳情
padding：字符串，‘pre’ 或 ‘post’ ，在序列的前端補齊還是在后端補齊。
maxlen：整數，所有序列的最大長度。

新聞標題數據集用于諷刺檢測

數據集：CCO public domain dataset：sarcasm detection（嘲諷檢測）

新聞標題數據集用于諷刺檢測：News Headlines Dataset For Sarcasm Detection

Each record consists of three attributes:

is_sarcastic: 1 if the record is sarcastic otherwise 0
headline: the headline of the news article
article_link: link to the original news article. Useful for collecting supplementary data

注意：Laurence為了方便把數據集稍作修改了

import json from tensorflow.keras.preprocessing.text import Tokenizer from tensorflow.keras.preprocessing.sequence import pad_sequenceswith open('sarcasm.json','r') as f:datastore=json.load(f) # 返回一個包含三種數據的列表sentences = [] labels = [] urls = [] for item in datastore:sentences.append(item['headline'])labels.append(item['is_sarcastic'])urls.append(item['article_link'])tokenizer=Tokenizer(oov_token="<OOV>") tokenizer.fit_on_texts(sentences) word_index=tokenizer.word_indexsequences=tokenizer.texts_to_sequences(sentences) padded=pad_sequences(sequences,padding='post')print(padded[0]) print(padded.shape)

打印結果：

[ 308 15115 679 3337 2298 48 382 2576 15116 6 2577 84340 0 0 0 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 00 0 0 0] (26709, 40)

共有26709個不重復的單詞，最長的標題有40個單詞。這些單詞按照詞頻從高到低排序。

總結

以上是生活随笔為你收集整理的Tensorflow 自然语言处理的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： 2018阿里巴巴春招面试
下一篇：论文翻译阅读——Facial Emoti

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

编程问答

Tensorflow 自然语言处理

文章目錄

前言

基本知識

使用API

Text to sequences

Padding

新聞標題數據集用于諷刺檢測

總結