當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TensorFlow2简单入门-加载及预处理文本

發布時間：2025/4/5 编程问答 22 豆豆

生活随笔收集整理的這篇文章主要介紹了 TensorFlow2简单入门-加载及预处理文本小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

博主： 明天依舊可好
代碼： 微信公眾號「明天依舊可好」內回復 04
思維導圖完整版： 回復 tf2思維導圖

import tensorflow as tf import tensorflow_datasets as tfds import osprint(tf.__version__) """ 輸出：2.5.0-dev20201226 """

數據下載

import pathlibDIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/' FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']for name in FILE_NAMES:text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)print(text_dir)parent_dir = os.path.dirname(text_dir) """ 輸出： C:\Users\Administrator\.keras\datasets\cowper.txt C:\Users\Administrator\.keras\datasets\derby.txt C:\Users\Administrator\.keras\datasets\butler.txt """

將文本加載到數據集中

迭代整個文件，將整個文件加載到自己的數據集中。

每個樣本都需要單獨標記，所以請使用 tf.data.Dataset.map 來為每個樣本設定標簽。這將迭代數據集中的每一個樣本并且返回（ example, label ）對。

def labeler(example, index):return example, tf.cast(index, tf.int64) labeled_data_sets = []for i, file_name in enumerate(FILE_NAMES):lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))labeled_data_sets.append(labeled_dataset)

將這些標記的數據集合并到一個數據集中，然后對其進行隨機化操作。

BUFFER_SIZE = 50000#將所有數據合并到一個數據集當中 all_labeled_data = labeled_data_sets[0] for labeled_dataset in labeled_data_sets[1:]:all_labeled_data = all_labeled_data.concatenate(labeled_dataset)#將數據進行打亂 all_labeled_data = all_labeled_data.shuffle(BUFFER_SIZE, reshuffle_each_iteration=False) #顯示前5條數據 for ex in all_labeled_data.take(5):print(ex) """ (<tf.Tensor: shape=(), dtype=string, numpy=b'Instructed duly, and himself, his steps'>, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(), dtype=string, numpy=b'not forget the threat that he had made Achilles, and called his trusty'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(), dtype=string, numpy=b"Standing encompass'd by his dauntless troops,">, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(), dtype=string, numpy=b'held Oechalia, the city of Oechalian Eurytus, these were commanded by'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(), dtype=string, numpy=b'him."'>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) """

將文本編碼成數字

建立詞匯表

#features這個模塊已經被官方棄用，等待更新吧 # tokenizer = tfds.features.text.Tokenizer() #可暫時使用下面這個進行調用 tokenizer = tfds.deprecated.text.Tokenizer()vocabulary_set = set() for text_tensor, _ in all_labeled_data:some_tokens = tokenizer.tokenize(text_tensor.numpy())vocabulary_set.update(some_tokens)vocab_size = len(vocabulary_set) vocab_size """ 輸出：17178 """

通過傳遞 vocabulary_set 到 tfds.features.text.TokenTextEncoder 來構建一個編碼器。編碼器的 encode 方法傳入一行文本，返回一個整數列表。

encoder = tfds.deprecated.text.TokenTextEncoder(vocabulary_set)def encode(text_tensor, label):encoded_text = encoder.encode(text_tensor.numpy())return encoded_text, labeldef encode_map_fn(text, label):#py_func不能設置返回的tensors的shapeencoded_text, label = tf.py_function(encode, inp=[text, label], Tout=(tf.int64, tf.int64))encoded_text.set_shape([None])label.set_shape([])return encoded_text, labelall_encoded_data = all_labeled_data.map(encode_map_fn)#顯示前5條數據 for ex in all_encoded_data.take(5):print(ex) """ 輸出： (<tf.Tensor: shape=(6,), dtype=int64, numpy=array([ 2724, 4813, 14154, 7272, 12376, 16442], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(13,), dtype=int64, numpy= array([12719, 5246, 4778, 6683, 411, 11103, 4013, 14029, 13412,14154, 14991, 12376, 4255], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(7,), dtype=int64, numpy=array([14472, 14592, 8885, 4068, 12376, 16337, 11432], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=0>) (<tf.Tensor: shape=(11,), dtype=int64, numpy= array([10492, 5873, 4778, 14421, 15779, 9325, 4625, 15330, 9176,2358, 4068], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) (<tf.Tensor: shape=(1,), dtype=int64, numpy=array([4992], dtype=int64)>, <tf.Tensor: shape=(), dtype=int64, numpy=2>) """

注：本文參考了官網并對其進行了刪減以及部分注釋與修改

總結

以上是生活随笔為你收集整理的TensorFlow2简单入门-加载及预处理文本的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： TensorFlow2简单入门-图像加载
下一篇： TensorFlow2简单入门-单词嵌入