當前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

OFRecord 数据格式

發布時間：2023/11/28 生活经验 32 豆豆

生活随笔收集整理的這篇文章主要介紹了 OFRecord 数据格式小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

OFRecord 數據格式
深度學習應用需要復雜的多階段數據預處理流水線，數據加載是流水線的第一步，OneFlow 支持多種格式數據的加載，其中 OFRecord 格式是 OneFlow 原生的數據格式。
OFRecord 的格式定義參考了 TensorFlow 的 TFRecord，熟悉 TFRecord 的用戶，可以很快上手 OneFlow 的 OFRecord。
本文將介紹：
? OFRecord 使用的數據類型
? 如何將數據轉化為 OFRecord 對象并序列化
? OFRecord 文件格式
有助于學習加載與準備 OFRecord 數據集。
OFRecord 相關數據類型
OneFlow 內部采用Protocol Buffers 描述 OFRecord 的序列化格式。相關的 .proto 文件在 oneflow/core/record/record.proto 中，具體定義如下：
syntax = “proto2”;
package oneflow;

message BytesList {
repeated bytes value = 1;
}

message FloatList {
repeated float value = 1 [packed = true];
}

message DoubleList {
repeated double value = 1 [packed = true];
}

message Int32List {
repeated int32 value = 1 [packed = true];
}

message Int64List {
repeated int64 value = 1 [packed = true];
}

message Feature {
oneof kind {
BytesList bytes_list = 1;
FloatList float_list = 2;
DoubleList double_list = 3;
Int32List int32_list = 4;
Int64List int64_list = 5;
}
}

message OFRecord {
map<string, Feature> feature = 1;
}
先對以上的重要數據類型進行解釋：
? OFRecord: OFRecord 的實例化對象，可用于存儲所有需要序列化的數據。它由任意多個 string->Feature 的鍵值對組成；
? Feature: Feature 可存儲 BytesList、FloatList、DoubleList、Int32List、Int64List 各類型中的任意一種；
? OFRecord、Feature、XXXList 等類型，均由 Protocol Buffers 生成對應的同名接口，使得可以在 Python 層面構造對應對象。
轉化數據為 Feature 格式
可以通過調用 ofrecord.xxxList 及 ofrecord.Feature 將數據轉為 Feature 格式，為了更加方便，需要對 protocol buffers 生成的接口進行簡單封裝：
import oneflow.core.record.record_pb2 as ofrecord

def int32_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(int32_list=ofrecord.Int32List(value=value))

def int64_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(int64_list=ofrecord.Int64List(value=value))

def float_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(float_list=ofrecord.FloatList(value=value))

def double_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
return ofrecord.Feature(double_list=ofrecord.DoubleList(value=value))

def bytes_feature(value):
if not isinstance(value, (list, tuple)):
value = [value]
if not six.PY2:
if isinstance(value[0], str):
value = [x.encode() for x in value]
return ofrecord.Feature(bytes_list=ofrecord.BytesList(value=value))
創建 OFRecord 對象并序列化
在下例子中，將創建有2個 feature 的 OFRecord 對象，并且調用它的 SerializeToString 方法序列化。
obserations = 28 * 28

f = open("./dataset/part-0", “wb”)

for loop in range(0, 3):
image = [random.random() for x in range(0, obserations)]
label = [random.randint(0, 9)]

  topack = {"images": float_feature(image),"labels": int64_feature(label),}ofrecord_features = ofrecord.OFRecord(feature=topack)serilizedBytes = ofrecord_features.SerializeToString()

通過以上例子，可以總結序列化數據的步驟：
? 將需要序列化的數據，通過調用 ofrecord.Feature 及 ofrecord.XXXList 轉為 Feature 對象；
? 將上一步得到的各個 Feature 對象，以 string->Feature 鍵值對的形式，存放在 Python 字典中；
? 調用 ofrecord.OFRecord 創建 OFRecord 對象
? 調用 OFRecord 對象的 SerializeToString 方法得到序列化結果
序列化的結果，可以存為 ofrecord 格式的文件。
OFRecord 格式的文件
將 OFRecord 對象序列化后按 OneFlow 約定的格式存文件，就得到 OFRecord文件。
1個 OFRecord 文件中可存儲多個 OFRecord 對象，OFRecord 文件可用于 OneFlow 數據流水線，具體操作可見加載與準備 OFRecord 數據集
OneFlow 約定，對于每個 OFRecord 對象，用以下格式存儲：
uint64 length
byte data[length]
即頭8個字節存入數據長度，然后存入序列化數據本身。
length = ofrecord_features.ByteSize()

f.write(struct.pack(“q”, length))
f.write(serilizedBytes)
代碼
以下完整代碼展示如何生成 OFRecord 文件，并調用 protobuf 生成的 OFRecord 接口手工讀取 OFRecord 文件中的數據。
實際上，OneFlow 提供了 flow.data.decode_ofrecord 等接口，可以更方便地提取 OFRecord 文件（數據集）中的內容。詳細內容請參見加載與準備 OFRecord 數據集。
將 OFRecord 對象寫入文件
以下腳本，模擬了3個樣本，每個樣本為28*28的圖片，并且包含對應標簽。將三個樣本轉化為 OFRecord 對象后，按照 OneFlow 約定格式，存入文件。
代碼：ofrecord_to_string.py
從 OFRecord 文件中讀取數據
以下腳本，讀取上例中生成的 OFRecord 文件，調用 FromString 方法反序列化得到 OFRecord 對象，并最終顯示數據：
代碼：ofrecord_from_string.py

總結

以上是生活随笔為你收集整理的OFRecord 数据格式的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。