當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

TFRecord存储维度（秩、rank、dimension）较多的数据以及创建Dataset的过程

發(fā)布時間：2023/12/20 编程问答 51 豆豆

生活随笔收集整理的這篇文章主要介紹了 TFRecord存储维度（秩、rank、dimension）较多的数据以及创建Dataset的过程小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

TFRecord存儲維度（秩、rank、dimension）較多的數(shù)據(jù)以及創(chuàng)建Dataset的過程

第一次接觸tensorflow的Dataset和Estimator是在閱讀BERT的特定任務(wù)的代碼，原本用低階API需要寫很長的代碼，在Estimator模式中簡化了許多。
原本代碼中的輸入數(shù)據(jù)（即Dataset中的Example）的每個特征（即每個Example含有的Feature）的秩都為1，即矢量如v=[1,2,3]，其rank為1，shape為(3,)。而后續(xù)為了引入更多新的特征，比如charCNN或者charRNN來捕捉詞語的形態(tài)特征，則需要在原來每個時間步的維度上再擴展一個維度，用于放置該時間步的字母。如[‘Are’, ’ you’, ‘OK’]，則輸入為[[‘A’,‘r’,‘e’],[‘y’,‘o’,‘u’],[‘O’,‘K’]]，此時該Feature的rank為2，shape為(3,3)（此處將‘OK’ pad為長度為3的序列即可）。
那么這種多維度，rank>=2的形式的Feature應(yīng)該怎么存儲呢，后續(xù)又應(yīng)該怎么讀出到Dataset并解析呢。

保留該Feature的Shape信息后拉直（Flatten）Feature

這里借YJango大神的例子來舉個栗子，然后再寫寫我的啦。

大神的例子

這里有三個example，每個example都有四類feature，分別是標量、向量、矩陣和張量，它們的shape分別為()，(3,)，(2,3)和(806,806,3)。

寫入tfrecord

那應(yīng)該怎么寫入這些形態(tài)各異的特征呢？兩種方法。

將其flatten成list形式，即rank=1的向量形式，然后按照list形式寫入，如int64_list = tf.train.Int64List(value=輸入)或float_list = tf.train.FloatList(value=輸入)。
轉(zhuǎn)成string類型：將張量用.tostring()轉(zhuǎn)換成string類型，再用tf.train.Feature(bytes_list=tf.train.BytesList(value=[input.tostring()]))來存儲。

這兩種方法都會丟失數(shù)據(jù)的維度，因此需要將其存儲以備后續(xù)使用或者提前將這些參數(shù)預(yù)設(shè)好即可。

# 打開一個tfrecord文件，準備進行寫入 writer = tf.python_io.TFRecordWriter('%s.tfrecord' %'test') # 這里我們將會寫3個樣本，每個樣本里有4個feature：標量，向量，矩陣，張量 for i in range(3):# 創(chuàng)建字典features={}# 寫入標量，類型Int64，由于是標量，所以"value=[scalars[i]]" 變成listfeatures['scalar'] = tf.train.Feature(int64_list=tf.train.Int64List(value=[scalars[i]]))# 寫入向量，類型float，本身就是list，所以"value=vectors[i]"沒有中括號features['vector'] = tf.train.Feature(float_list = tf.train.FloatList(value=vectors[i]))# 寫入矩陣，類型float，本身是矩陣，一種方法是將矩陣flatten成listfeatures['matrix'] = tf.train.Feature(float_list = tf.train.FloatList(value=matrices[i].reshape(-1)))# 然而矩陣的形狀信息(2,3)會丟失，需要存儲形狀信息，隨后可轉(zhuǎn)回原形狀features['matrix_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=matrices[i].shape))# 寫入張量，類型float，本身是三維張量，另一種方法是轉(zhuǎn)變成字符類型存儲，隨后再轉(zhuǎn)回原類型features['tensor'] = tf.train.Feature(bytes_list=tf.train.BytesList(value=[tensors[i].tostring()]))# 存儲丟失的形狀信息(806,806,3)features['tensor_shape'] = tf.train.Feature(int64_list = tf.train.Int64List(value=tensors[i].shape))# 將存有所有feature的字典送入tf.train.Features中tf_features = tf.train.Features(feature= features)# 再將其變成一個樣本exampletf_example = tf.train.Example(features = tf_features)# 序列化該樣本tf_serialized = tf_example.SerializeToString()# 寫入一個序列化的樣本writer.write(tf_serialized)# 由于上面有循環(huán)3次，所以到此我們已經(jīng)寫了3個樣本# 關(guān)閉文件 writer.close()

建立Datasets

由于從tfrecord文件中導入的樣本是剛才寫入的tf_serialized序列化樣本，所以我們需要對每一個樣本進行解析。
這里就用dataset.map(parse_function)來對dataset里的每個樣本進行相同的解析操作。而parse_function的解析過程幾乎就是上述過程的逆過程。此外，我們還能在parse_function里進行很多其他操作，比如轉(zhuǎn)換數(shù)據(jù)的dtype，給每個數(shù)據(jù)加入噪音等等。總之，在parse_function內(nèi)，我們處理的對象就是一個序列化后的serialized_example，我們要對serialized_example進行解碼獲得example，然后返回這個example。
其解析函數(shù)的寫法為：

def parse_function(example_proto):# 只接受一個輸入：example_proto，也就是序列化后的樣本tf_serializeddics = {# 這里沒用default_value，隨后的都是None'scalar': tf.FixedLenFeature(shape=(), dtype=tf.int64, default_value=None), # vector的shape刻意從原本的(3,)指定成(1,3)'vector': tf.FixedLenFeature(shape=(1,3), dtype=tf.float32), # 因為這里還不知道m(xù)atrix的shape，所以使用 VarLenFeature來解析。'matrix': tf.VarLenFeature(dtype=dtype('float32')), 'matrix_shape': tf.FixedLenFeature(shape=(2,), dtype=tf.int64), # tensor在寫入時使用了toString()，shape是()# 但這里的type不是tensor的原type，而是字符化后所用的tf.string，隨后再回轉(zhuǎn)成原tf.uint8類型'tensor': tf.FixedLenFeature(shape=(), dtype=tf.string), 'tensor_shape': tf.FixedLenFeature(shape=(3,), dtype=tf.int64)}# 把序列化樣本和解析字典送入函數(shù)里得到解析的樣本parsed_example = tf.parse_single_example(example_proto, dics)# 解碼字符parsed_example['tensor'] = tf.decode_raw(parsed_example['tensor'], tf.uint8)# 稀疏表示轉(zhuǎn)為密集表示parsed_example['matrix'] = tf.sparse_tensor_to_dense(parsed_example['matrix'])# 轉(zhuǎn)變matrix形狀parsed_example['matrix'] = tf.reshape(parsed_example['matrix'], parsed_example['matrix_shape'])# 轉(zhuǎn)變tensor形狀parsed_example['tensor'] = tf.reshape(parsed_example['tensor'], parsed_example['tensor_shape'])# 返回所有featurereturn parsed_example

此處如果我們有matrix的shape的一些信息，就并不需要用VarLenFeature進行解析，可以直接將matrix的shape中每個數(shù)相乘即可得到flatten后的matrix的list的信息，即’matrix’: tf.FixedLenFeature(shape=[matrix.shape()[0]*matrix.shape()[1]],dtype=dtype(‘float32’))。
寫好解析函數(shù)以后，將這個解析函數(shù)作為dataset的map方法的輸入即可。
剩下的batch，shuffle等操作就不再贅述了。建立迭代器的操作有這篇博客講得很好了。

我的破例子

第二個函數(shù)返回的是一個函數(shù)的閉包，主要用于estimator模式下的數(shù)據(jù)輸入。這是本人基于BERT做NER改進的charCNN-BERT-CRF模型，有興趣的可以去我GitHub看看哈。

寫這篇博客的初衷

為啥要寫這篇博客呢？因為我在解決這個問題時走了一個彎路，就是使用了FeatureList。即將每個單詞的字母切分作為Feature，然后添加為FeatureList的元素。然而FeatureList的解碼相對比較復(fù)雜難寫，盡管程序沒有報錯，但是在運行時，卻顯示讀出的樣本數(shù)為0，即無法讀出樣本，一個樣本都沒有進入網(wǎng)絡(luò)。當然有了前面提到的方法，這個FeatureList的作用到底大不大呢，應(yīng)用廣不廣呢，Feature和它相比有什么做不到的地方嗎（我好像看到目標識別好像有用到這個作為data pipeline）？這幾天如果有時間我再根據(jù)這篇博客介紹的方法試試，到時再更新啦！也歡迎各位大佬對我進行指正！
今晚對Featurelist的方法實現(xiàn)了一下，發(fā)現(xiàn)也是可以實現(xiàn)同樣的功能，代碼如下：

def filed_based_convert_examples_to_features(examples, tokenizer, output_file):""":param examples::param tokenizer::param output_file::param mode::return: number of small example"""num_examples = 0writer = tf.python_io.TFRecordWriter(output_file)# 遍歷訓練數(shù)據(jù)for (ex_index, example) in enumerate(examples):# 對于每一個訓練樣本,example_list = convert_single_example(example, tokenizer)num_examples += len(example_list)def create_int_feature(values):f = tf.train.Feature(int64_list=tf.train.Int64List(value=list(values)))return f# 新增一個轉(zhuǎn)化featurelist的方法def create_feature_list(values_list):fl = tf.train.FeatureList(feature=[tf.train.Feature(int64_list=tf.train.Int64List(value=values)) for values in values_list])return fldef flatten(tensor):return sum(tensor, [])for f in example_list:if num_examples%5000 == 0:tf.logging.info("Writing example %d of %d" % (num_examples, len(examples)))features = collections.OrderedDict()# 給featurelists也申請一個字典features_list = collections.OrderedDict()features["input_ids"] = create_int_feature(f.input_ids)features["input_mask"] = create_int_feature(f.input_mask)features["segment_ids"] = create_int_feature(f.segment_ids)features["tag_ids"] = create_int_feature(f.tag_ids)# 這里轉(zhuǎn)化為featurelist，但是感覺這樣的寫法，其實也沒有方便多少啦！# 個人感覺featurelist的用法應(yīng)該不是單純這樣用的，不然就這樣最多也只是能轉(zhuǎn)個二維，有啥意思呢？# 歡迎各位指正啦features_list["char_ids"] = create_feature_list(f.char_ids)# 這里要用SequenceExample啦！同理分別將features和featurelists裝進context和feature_liststf_example = tf.train.SequenceExample(context=tf.train.Features(feature=features),feature_lists=tf.train.FeatureLists(feature_list=features_list))writer.write(tf_example.SerializeToString())writer.close()return num_examplesdef file_based_input_fn_builder(input_file, seq_length, char_length, is_training, drop_remainder):name_to_features = {"input_ids": tf.FixedLenFeature([seq_length], tf.int64),"input_mask": tf.FixedLenFeature([seq_length], tf.int64),"segment_ids": tf.FixedLenFeature([seq_length], tf.int64),"tag_ids": tf.FixedLenFeature([seq_length], tf.int64),}# featurelist的解碼name_to_features_list = {"char_ids": tf.FixedLenSequenceFeature([char_length], tf.int64),}def _decode_record(record, name_to_features, name_to_features_list):# 這里有兩個返回值，一個返回feature即context的內(nèi)容，另一份是featurelist即sequence的內(nèi)容context_example, sequence_example = tf.parse_single_sequence_example(record,context_features=name_to_features,sequence_features=name_to_features_list)for name in list(context_example.keys()):t = context_example[name]if t.dtype == tf.int64:t = tf.to_int32(t)context_example[name] = tfor name in list(sequence_example.keys()):tl = sequence_example[name]if tl.dtype == tf.int64:tl = tf.to_int32(tl)sequence_example[name] = tlreturn context_example, sequence_exampledef input_fn(params):batch_size = params["batch_size"]d = tf.data.TFRecordDataset(input_file)if is_training:d = d.repeat()d = d.shuffle(buffer_size=100)d = d.apply(tf.contrib.data.map_and_batch(lambda record: _decode_record(record, name_to_features, name_to_features_list),batch_size=batch_size,drop_remainder=drop_remainder))return dreturn input_fndef main(_):tf.logging.set_verbosity(tf.logging.INFO)train_data_dir = ['training-PHI-Gold-Set2']wordpiece_vocab = tokenization_ner.build_wordpiece_vocab(root_path, bert_path, 'vocab.txt')wptokenizer = tokenization_ner.WPTokenizer(wordpiece_vocab, FLAGS.max_seq_length, FLAGS.max_char_length)train_file = os.path.join(FLAGS.output_dir, "train.tf_record")if not os.path.exists(os.path.join(FLAGS.output_dir, "train.tf_record")):train_examples = load_examples(train_data_dir)num_train_examples = filed_based_convert_examples_to_features(train_examples, wptokenizer, train_file)train_input_fn = file_based_input_fn_builder(input_file=train_file,seq_length=FLAGS.max_seq_length,char_length=FLAGS.max_char_length,is_training=True,drop_remainder=True)params = {}params["batch_size"] = FLAGS.train_batch_sizedataset = train_input_fn(params)iterator = dataset.make_one_shot_iterator()with tf.Session() as sess:for _ in range(1):try:context, sequence = sess.run(iterator.get_next())print(sequence['char_ids'])except tf.errors.OutOfRangeError:break

最后也正確輸出啦。。。

總結(jié)

以上是生活随笔為你收集整理的TFRecord存储维度（秩、rank、dimension）较多的数据以及创建Dataset的过程的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。

上一篇：安全狗云原生安全能力守护中国联通安全发展
下一篇：大众点评 mtgisg分析