當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TensorFlow学习笔记——使用TFRecord进行数据保存和加载

發布時間：2023/12/20 编程问答 24 豆豆

生活随笔收集整理的這篇文章主要介紹了 TensorFlow学习笔记——使用TFRecord进行数据保存和加载小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

本篇文章主要介紹如何使用TensorFlow構建自己的圖片數據集TFRecord的方法，并使用最新的數據處理Dataset API進行操作。

TFRecord

TFRecord數據文件是一種對任何數據進行存儲的二進制文件，能更好的利用內存，在TensorFlow中快速的復制，移動，讀取，存儲等，只要生成一次TFRecord，之后的數據讀取和加工處理的效率都會得到提高。

一般來說，我們使用TensorFlow進行數據讀取的方式有以下4種：

（1）預先把所有數據加載進內存
（2）在每輪訓練中使用原生Python代碼讀取一部分數據，然后使用feed_dict輸入到計算圖
（3）利用Threading和Queues從TFRecord中分批次讀取數據
（4）使用Dataset API

(1)方案對于數據量不大的場景來說是足夠簡單而高效的，但是隨著數據量的增長，勢必會對有限的內存空間帶來極大的壓力，還有長時間的數據預加載，甚至導致我們十分熟悉的OutOfMemoryError。

(2)方案可以一定程度上緩解了方案(1)的內存壓力問題，但是由于在單線程環境下我們的IO操作一般都是同步阻塞的，勢必會在一定程度上導致學習時間的增加，尤其是相同的數據需要重復多次讀取的情況下。

而方案(3)和方案(4)都利用了我們的TFRecord，由于使用了多線程使得IO操作不再阻塞我們的模型訓練，同時為了實現線程間的數據傳輸引入了Queues。

在本文中，我們主要使用方案(4)進行操作。

建立TFRecord

整體上建立TFRecord文件的流程主要如下；

在TFRecord數據文件中，任何數據都是以bytes列表或float列表或int64列表的形式存儲（注意:是列表形式）,因此，將每條數據轉化為列表格式。
創建的每條數據列表都必須由一個Feature類包裝，并且，每個feature都存儲在一個key-value鍵值對中，其中key對應每個feature的名稱。這些key將在后面從TFRecord提取數據時使用。
當所需的字典創建完之后，會傳遞給Features類。
最后，將features對象作為輸入傳遞給example類，然后這個example類對象會被追加到TFRecord中。
對于所有數據，重復上述過程。

接下來，對一個簡單數據創建TFRecord。我們創建了兩條樣例數據，包含了整型、浮點型、字符串型和列表型，如下所示:

import tensorflow as tf # 案例數據 data_arr = [{'int_data':108, # 整型'float_data':2.45, #浮點型'str_data':'string 100'.encode(), # 字符串型，python3下轉化為byte'float_list_data':[256.78,13.9] # 列表型},{'int_data': 2108,'float_data': 12.45,'str_data': 'string 200'.encode(),'float_list_data': [1.34,256.78, 65.22]} ]

首先，我們將原始數據的每一個值轉換成列表形式。需要注意的是每條數據對應的數據類型。

#處理一條數據 def get_example_object(data_record):# 將數據轉化為int64 float 或bytes類型的列表# 注意都是list形式int_list1 = tf.train.Int64List(value = [data_record['int_data']])float_list1 = tf.train.FloatList(value = [data_record['float_data']])str_list1 = tf.train.BytesList(value = [data_record['str_data']])float_list2 = tf.train.FloatList(value = data_record['float_list_data'])

然后，使用Feature類對每個數據列表進行包裝，并且以key-value的字典格式存儲。

# 將數據封裝成一個dictfeature_key_value_pair = {'int_list':tf.train.Feature(int64_list = int_list1),'float_list': tf.train.Feature(float_list=float_list1),'str_list': tf.train.Feature(bytes_list=str_list1),'float_list2': tf.train.Feature(float_list=float_list2),}

接著，將創建好的feature字典傳遞給features類，并且使用Example類處理成一個example。

# 創建一個featuresfeatures = tf.train.Features(feature = feature_key_value_pair)# 創建一個exampleexample = tf.train.Example(features = features)return example

最后，遍歷所有數據集，將每條數據寫入tfrecord中。

with tf.python_io.TFRecordWriter('example.tfrecord') as tfwriter:#遍歷所有數據for data_record in data_arr:example = get_example_object(data_record)# 寫入tfrecord數據文件tfwriter.write(example.SerializeToString())

運行整個代碼之后，我們在磁盤中將看到一個’example.tfrecord’文件

$ ls |grep *.tfrecordexample.tfrecord

該文件中存儲的就是上面我們定義好的兩條數據，接下來，我們將圖像數據保存到TFRecord文件中。

圖像數據-TFRecord

通過上面一個簡單例子，我們基本了解了如何為包含字典和列表的文本類型的數據創建TFRecord，接下來，我們對圖像數據創建TFRecord。我們使用kaggle上面的貓狗數據集。

該數據集可以從:kaggle貓狗進行下載。

下載完之后，我們會得到兩個文件夾

test train

其中train文件夾中主要是訓練數據集，test文件夾中主要是預測數據集，主要對train數據集進行操作。

ls |wc -w25000

該訓練集中一共有25000張圖像，其中貓狗圖像各一半，接下來我們看看數據格式。

$ ls cat.124.jpg cat.3750.jpg cat.6250.jpg cat.8751.jpg dog.11250.jpg dog.2500.jpg dog.5000.jpg dog.7501.jpg ...

在train文件夾中，我們可以看到圖片數據主要是以.jpg結尾的，并且文件名中包含了該圖像的所屬標簽，我們需要從文件名中提取每張圖像對應的標簽類別。

對圖像數據進行保存，主要有兩種方式。首先我們來看看常見的方式，即首先讀取這些圖像數據，然后將這些數值化的圖像數據轉化為字符串形式，并存儲到TFRecord。

import tensorflow as tf import os import time from glob import glob import progressbar from PIL import Imageclass GenerateTFRecord():def __init__(self,labels):self.labels = labelsdef _get_label_with_filename(self,filename):basename = os.path.basename(filename).split(".")[0]return self.labels[basename]def _convert_image(self,img_path,is_train=True):label = self._get_label_with_filename(img_path)image_data = Image.open(img_path)image_data = image_data.resize((227, 227)) # 重新定義圖片的大小image_str = image_data.tobytes()filename = os.path.basename(img_path)

首先，我們創建一個生成TFRecorf類——GenerateTFRecord，其中，label一般是一個字典格式，將文本型的標簽轉化為對應的數值型標簽，比如，這里，我們令0表示貓，1表示狗，從而label為

labels = {"cat":0,'dog':1}

另外，函數_get_label_with_fielname主要是從文件名中提取對應的標簽類別。

接著，我們定義一個轉換函數-_convert_image,

img_path:表示一張圖片的具體路徑
is_train:表示是否是訓練集，上面我們下載了兩份數據，訓練數據集中帶有標簽，而test數據集中沒有標簽，在保存成TFRecord時，令test的數據label為-1

首先使用Image讀取數據，接著將數據大小統一成227x227x3（這里只是一個案例，一般我們在構建模型之前會將圖像數據大小統一成一個指定的大小），然后將圖像數據轉化為二進制格式。

處理完原始圖像數據之后，構建一個example。

if is_train:feature_key_value_pair = {'filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[filename.encode()])),'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_str])),'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))}else:feature_key_value_pair = {'filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[filename.encode()])),'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_str])),'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[-1]))}feature = tf.train.Features(feature = feature_key_value_pair)example = tf.train.Example(features = feature)return example

這里，我們保存了三個信息，即文件名、處理之后的圖像信息和圖像標簽（當然還可以保存其他數據，只要按照上面格式定義好就行了）。

每張圖像處理模式完成之后，遍歷所有train數據集，并保存到tfrecord中。

def convert_image_folder(self,img_folder,tfrecord_file_name):img_paths = [img_path for img_path in glob(os.path.join(img_folder,'*'))]with tf.python_io.TFRecordWriter(tfrecord_file_name) as tfwriter:widgets = ['[INFO] write image to tfrecord: ', progressbar.Percentage(), " ",progressbar.Bar(), " ", progressbar.ETA()]pbar = progressbar.ProgressBar(maxval=len(img_paths), widgets=widgets).start()for i,img_path in enumerate(img_paths):example = self._convert_image(img_path,is_train=True)tfwriter.write(example.SerializeToString())pbar.update(i)pbar.finish()

其中：

img_folder:原始圖像存放的路徑
tfrecord_file_name：tfrecord文件保存路徑

上面，我們使用了progressbar模塊，該模塊是一個進度條顯示模塊，可以幫助我們很好的監控數據處理情況。

最后，加入下列代碼，并運行整個代碼以完成train數據集的tfrexord構建。

if __name__ == "__main__":start = time.time()labels = {"cat":0,'dog':1}t = GenerateTFRecord(labels)t.convert_image_folder('train','train.tfrecord')print("Took %f seconds." % (time.time() - start))

該方法使用了約115s完成了整個train數據集的TFRecord生成過程，在目錄中，我們生成了一個名為train.tfrecord的文件。

$ ls -lht11G train.tfrecord

該文件大小居然達到了11G（注意：該文件直接保存的是原始圖像，不是處理之后的，因為需要跟另一種方法進行比較）。從前面，我們知道該train數據集中只有25000張圖像數據，每張圖像大小差不多50kb左右，25000張圖像大小總共差不多1.2G左右，而生成的TFRecord文件居然達到11G，那么對于imagenet的數據集，可能會發生磁盤裝不下的。這或許是許多人不喜歡使用TFRecord的一個原因吧。

為什么TFRecord變得如此巨大?

我們來簡單的分析下，通過查看每張圖像的shape，比如cat.8739.jpg，

import matplotlib.image as mpimg from PIL import Image img_path = 'train/cat.8739.jpg' img_data = mpimg.imread(img_path) img_data.shape# output:(324,319,3)

該貓圖像數據的shape是(324,319,3)。對每個維度進行相乘，即324x319x3=310068，那么在numpy數據格式中（假設數據類型為unit8)，該圖片以310069個整數表示。當我們調用.tobytes()時，這些數字將按順序存在在一個二進制序列中。我們假設每一個數字都是大于100的，也就是需要三個字符，如果每個數字之間使用’，'分割，則對于該圖片，我們需要:

310068 x(3+1) = 1240232個字符，如果一個字符對應一個字節，那么一張圖片就差不多需要1MB。

上面只是個人計算，也許本身就不對的。

如何解決?

我們從另一個角度考慮:圖片的存儲大小，即上面我們分析每張圖片差不多就50kb左右。其實在實際應用中，很多訓練數據集的圖像存儲大小一般都在幾kb到幾百kb左右。因此，我們可以直接存儲圖像的bytes到tfrecord中。tensorflow模塊提供了一個tf.gfile.FastGFile類，可以直接讀取圖像的bytes形式。我們來看看tf.gfile.FastGFile主要讀取的是什么內容。

path_jpg = img_path = 'train/cat.8739.jpg' image_raw_data = tf.gfile.FastGFile(path_jpg,'rb').read()with tf.Session() as sess:print(image_raw_data)

你將在屏幕上看到一大串的bytes，比如；

b'\xff\xd8\xff\xe0\x00\x10JFIF\x00\x01\x01\x00\x00\x01\x00\x01\x00\x00\xff\xdb\x00C\x00\n\x07\x07\x08\x07\x06\n\x08\x08\x08\x0b\n\n\x0b\x0e\x18\x10\x0e\r\r\x0e\x1d\x15\x16\x11\x18#\x1f%$"\x1f"!&+7/&)4)! ....

我們可以看到tf.gfile.FastGFile讀取的不在是原始圖像的內容，也不是numpy格式。

因此，我們將讀取圖像部分代碼替換為:

with tf.gfile.FastGFile(img_path,'rb') as fid:image_str = fid.read()

其他保持不變，并且保存為train2.tfrecord文件。即：

if __name__ == "__main__":start = time.time()labels = {"cat":0,'dog':1}t = GenerateTFRecord(labels)t.convert_image_folder('train','train2.tfrecord')print("Took %f seconds." % (time.time() - start))

該方法只使用了約8s完成了整個train數據集的TFRecord生成過程，在目錄中，我們生成了一個新的train2.tfrecord的文件

$ ls -lht548M train2.tfrecord

從結果中可以看到，新的TFRecord文件只有548M，相比原先的11G，減小了很多。因此使用tf.gfile.FastGFile讀取圖像數據，明顯的好處有:

縮短了讀取數據時間
降低了磁盤使用大小

當然還有其他辦法可以再進一步降低大小，但是可能會改變圖像的內容。因此，這里就不做描述了。因為這種降低已經可以滿足我目前的項目需求了。

從TFRecord中提取數據

上面我們已經對數據生成了TFRecord文件，接下來，我們將從中讀取出數據。具體如下：

首先，對生成的TFRecord初始化一個TFRecordDataset類
接著，從TFRecord中提取數據，這里就需要利用到我們之前設定的key值，另外。如果我們知道每個值列表中的大小（即大小相同的），那么我們可以使用FixedLenFeature,否則，我們應該使用VarLenFeature。
最后，使用parse_single_example api從每條data record中提取我們定義的數據字典。

下面，我們通過一個簡單的提取數據代碼來說明整個過程。

import tensorflow as tf def extract_fn(data_record):features = {'int_list':tf.FixedLenFeature([],tf.int64),'float_list':tf.FixedLenFeature([],tf.float32),'str_list':tf.FixedLenFeature([],tf.string),# 如果不同的record中的大小不一樣，則使用VarLenFeature'float_list2':tf.VarLenFeature(tf.float32)}sample = tf.parse_single_example(data_record,features)return sample

上面的extract_fn函數對應了整個過程，下面我們使用Dataset模塊處理數據

# 使用dataset模塊讀取數據 dataset = tf.data.TFRecordDataset(filenames=['example.tfrecord']) # 對每一條record進行解析 dataset = dataset.map(extractz_fn) iterator = dataset.make_one_shot_iterator() next_example = iterator.get_next()

首先，對TFRrecord初始化一個TFRecordDataset類，然后通過map函數對TFRecords中的每條記錄提取數據，最后通過一個迭代器一條條返回數據。

# eager 模式下 tf.enable_eager_execution() try:while True:next_example = iterator.get_next()print(next_example) except:pass# 非eager模式 with tf.Session() as sess:try:while True:data_record = sess.run(next_example)print(data_record)except:pass

從TFRecord中提取圖像

在對圖像TFRecord數據文件提取數據時，需要利用tf.image.decode_image API，可以對圖像數據進行解碼，直接看代碼：

import tensorflow as tf import os class TFRecordExtractor():def __init__(self,tfrecord_file,epochs,batch_size):self.tfrecord_file = os.path.abspath(tfrecord_file)self.epochs = epochsself.batch_size = batch_size

其中:

tfrecord_file:tfrecord數據文件路徑
epochs：模型訓練的epochs
batch_size: batch的大小，每次返回的數據量

定義一個提取數據函數，該函數后面通過map函數對每個data record進行解析。類似于生成TFRecord的feature格式，解析成字典格式，主要是通過key值獲取對應的數據。

def _extract_fn(self,tfrecord):# 解碼器# 解析出一條數據，如果需要解析多條數據，可以使用parse_example函數# tf提供了兩種不同的屬性解析方法：## 1. tf.FixdLenFeature:得到的是一個Tensor## 2. tf.VarLenFeature:得到的是一個sparseTensor，用于處理稀疏數據features ={'filename': tf.FixedLenFeature([],tf.string),'image': tf.FixedLenFeature([],tf.string),'label': tf.FixedLenFeature([],tf.int64)}

下面，使用tf.image.decode_image API對圖像數據進行解碼，并重新定義圖像的大小（由于使用tf.gfile.FastGFile讀取圖像數據時無法重新定義圖像大小，因此我們在解碼時候進行重新定義圖像大小）。最后返回圖像數據、標簽和文件名。

sample = tf.parse_single_example(tfrecord,features)image = tf.image.decode_jpeg(sample['image'])image = tf.image.resize_images(image, (227, 227),method=1)label = sample['label']filename = sample['filename']return [image,label,filename]

使用Dataset對TFRecord文件進行操作：

def extract_image(self):dataset = tf.data.TFRecordDataset([self.tfrecord_file])dataset = dataset.map(self._extract_fn)dataset = dataset.repeat(count = self.epochs).batch(batch_size=self.batch_size)return dataset

首先，對TFRecord文件初始化一個 tf.data.TFRecordDataset類。接著使用map函數對每條data record進行_extract_fn解析。這里的epochs和batch_size跟模型訓練有關，該函數最后返回一個迭代器，每次調取的是batch大小的數據量。

if __name__ == "__main__":#tf.enable_eager_execution()t = TFRecordExtractor('train2.tfrecord',epochs=1,batch_size=10)dataset = t.extract_image()for (batch,batch_data) in enumerate(dataset):pass　

完成代碼

我將兩個功能何在一個TFRecord類中，主要是方便后續使用。

# encoding:utf-8 import tensorflow as tf import os from glob import glob import progressbarclass TFRecord():def __init__(self, labels, tfrecord_file):self.labels = labelsself.tfrecord_file = tfrecord_filedef _get_label_with_filename(self, filename):basename = os.path.basename(filename).split(".")[0]return self.labels[basename]def _convert_image(self, img_path, is_train=True):label = self._get_label_with_filename(img_path)filename = os.path.basename(img_path)with tf.gfile.FastGFile(img_path, 'rb') as fid:image_str = fid.read()if is_train:feature_key_value_pair = {'filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[filename.encode()])),'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_str])),'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[label]))}else:feature_key_value_pair = {'filename': tf.train.Feature(bytes_list=tf.train.BytesList(value=[filename.encode()])),'image': tf.train.Feature(bytes_list=tf.train.BytesList(value=[image_str])),'label': tf.train.Feature(int64_list=tf.train.Int64List(value=[-1]))}feature = tf.train.Features(feature=feature_key_value_pair)example = tf.train.Example(features=feature)return exampledef convert_image_folder(self, img_folder):img_paths = [img_path for img_path in glob(os.path.join(img_folder, '*'))]with tf.python_io.TFRecordWriter(self.tfrecord_file) as tfwriter:widgets = ['[INFO] write image to tfrecord: ', progressbar.Percentage(), " ",progressbar.Bar(), " ", progressbar.ETA()]pbar = progressbar.ProgressBar(maxval=len(img_paths), widgets=widgets).start()for i, img_path in enumerate(img_paths):example = self._convert_image(img_path, is_train=True)tfwriter.write(example.SerializeToString())pbar.update(i)pbar.finish()def _extract_fn(self, tfrecord):# 解碼器# 解析出一條數據，如果需要解析多條數據，可以使用parse_example函數# tf提供了兩種不同的屬性解析方法：## 1. tf.FixdLenFeature:得到的是一個Tensor## 2. tf.VarLenFeature:得到的是一個sparseTensor，用于處理稀疏數據features = {'filename': tf.FixedLenFeature([], tf.string),'image': tf.FixedLenFeature([], tf.string),'label': tf.FixedLenFeature([], tf.int64)}sample = tf.parse_single_example(tfrecord, features)image = tf.image.decode_jpeg(sample['image'])image = tf.image.resize_images(image, (227, 227), method=1)label = sample['label']filename = sample['filename']return [image, label, filename]def extract_image(self, shuffle_size,batch_size):dataset = tf.data.TFRecordDataset([self.tfrecord_file])dataset = dataset.map(self._extract_fn)dataset = dataset.shuffle(shuffle_size).batch(batch_size)return dataset

總結

以上是生活随笔為你收集整理的TensorFlow学习笔记——使用TFRecord进行数据保存和加载的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。