當(dāng)前位置：首頁 > 人文社科 > 生活经验 >内容正文

生活经验

旷视MegEngine数据加载与处理

發(fā)布時間：2023/11/28 生活经验 42 豆豆

生活随笔收集整理的這篇文章主要介紹了旷视MegEngine数据加载与处理小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

曠視MegEngine數(shù)據(jù)加載與處理
在網(wǎng)絡(luò)訓(xùn)練與測試中，數(shù)據(jù)的加載和預(yù)處理往往會耗費大量的精力。 MegEngine 提供了一系列接口來規(guī)范化這些處理工作。
利用 Dataset 封裝一個數(shù)據(jù)集
數(shù)據(jù)集是一組數(shù)據(jù)的集合，例如 MNIST、Cifar10等圖像數(shù)據(jù)集。 Dataset 是 MegEngine 中表示數(shù)據(jù)集的抽象類。自定義的數(shù)據(jù)集類應(yīng)該繼承 Dataset 并重寫下列方法：
? init() ：一般在其中實現(xiàn)讀取數(shù)據(jù)源文件的功能。也可以添加任何其它的必要功能；
? getitem() ：通過索引操作來獲取數(shù)據(jù)集中某一個樣本，使得可以通過 for 循環(huán)來遍歷整個數(shù)據(jù)集；
? len() ：返回數(shù)據(jù)集大小；
下面是一個簡單示例。根據(jù)下圖所示的二分類數(shù)據(jù)，創(chuàng)建一個 Dataset 。每個數(shù)據(jù)是一個二維平面上的點，橫坐標(biāo)和縱坐標(biāo)在 [-1, 1] 之間。共有兩個類別標(biāo)簽（圖1中的藍(lán)色 * 和紅色 +），標(biāo)簽為0的點處于一、三象限；標(biāo)簽為1的點處于二、四象限。

圖1
該數(shù)據(jù)集的創(chuàng)建過程如下：
? 在 init() 中利用 NumPy 隨機(jī)生成 ndarray 作為數(shù)據(jù)；
? 在 getitem() 中返回 ndarray 中的一個樣本；
? 在 len() 中返回整個數(shù)據(jù)集中樣本的個數(shù)；
import numpy as np
from typing import Tuple

導(dǎo)入需要被繼承的 Dataset 類

from megengine.data.dataset import Dataset

class XORDataset(Dataset):
def init(self, num_points):
“”"
生成如圖1所示的二分類數(shù)據(jù)集，數(shù)據(jù)集長度為 num_points
“”"
super().init()

    # 初始化一個維度為 (50000, 2) 的 NumPy 數(shù)組。# 數(shù)組的每一行是一個橫坐標(biāo)和縱坐標(biāo)都落在 [-1, 1] 區(qū)間的一個數(shù)據(jù)點 (x, y)self.data = np.random.rand(num_points, 2).astype(np.float32) * 2 - 1# 為上述 NumPy 數(shù)組構(gòu)建標(biāo)簽。每一行的 (x, y) 如果符合 x*y < 0，則對應(yīng)標(biāo)簽為1，反之，標(biāo)簽為0self.label = np.zeros(num_points, dtype=np.int32)for i in range(num_points):self.label[i] = 1 if np.prod(self.data[i]) < 0 else 0# 定義獲取數(shù)據(jù)集中每個樣本的方法
def __getitem__(self, index: int) -> Tuple:return self.data[index], self.label[index]# 定義返回數(shù)據(jù)集長度的方法
def __len__(self) -> int:return len(self.data)

np.random.seed(2020)

構(gòu)建一個包含 30000 個點的訓(xùn)練數(shù)據(jù)集

xor_train_dataset = XORDataset(30000)
print(“The length of train dataset is: {}”.format(len(xor_train_dataset)))

通過 for 遍歷數(shù)據(jù)集中的每一個樣本

for cor, tag in xor_train_dataset:
print(“The first data point is: {}, {}”.format(cor, tag))
break

print(“The second data point is: {}”.format(xor_train_dataset[1]))
輸出：
The length of train dataset is: 30000
The first data point is: [0.97255366 0.74678389], 0
The second data point is: (array([ 0.01949105, -0.45632857]), 1)
MegEngine 中也提供了一些已經(jīng)繼承自 Dataset 的數(shù)據(jù)集類，方便使用，比如 ArrayDataset 。 ArrayDataset 允許通過傳入單個或多個 NumPy 數(shù)組，對它進(jìn)行初始化。其內(nèi)部實現(xiàn)如下：
? init() ：檢查傳入的多個 NumPy 數(shù)組的長度是否一致；不一致則無法成功創(chuàng)建；
? getitem() ：將多個 NumPy 數(shù)組相同索引位置的元素構(gòu)成一個 tuple 并返回；
? len() ：返回數(shù)據(jù)集的大小；
以圖1所示的數(shù)據(jù)集為例，可以通過坐標(biāo)數(shù)據(jù)和標(biāo)簽數(shù)據(jù)的數(shù)組直接構(gòu)造 ArrayDataset ，無需用戶自己定義數(shù)據(jù)集類。
from megengine.data.dataset import ArrayDataset

準(zhǔn)備 NumPy 形式的 data 和 label 數(shù)據(jù)

np.random.seed(2020)
num_points = 30000
data = np.random.rand(num_points, 2).astype(np.float32) * 2 - 1
label = np.zeros(num_points, dtype=np.int32)
for i in range(num_points):
label[i] = 1 if np.prod(data[i]) < 0 else 0

利用 ArrayDataset 創(chuàng)建一個數(shù)據(jù)集類

xor_dataset = ArrayDataset(data, label)
通過 Sampler 從 Dataset 中采樣
Dataset 僅能通過一個固定的順序（其 getitem 實現(xiàn)）訪問所有樣本，而 Sampler 使得可以以所期望的方式從 Dataset 中采樣，生成訓(xùn)練和測試的批（minibatch）數(shù)據(jù)。 Sampler 本質(zhì)上是一個數(shù)據(jù)集中數(shù)據(jù)索引的迭代器，接收 Dataset 的實例和批大小（batch_size）來進(jìn)行初始化。
MegEngine 中提供各種常見的采樣器，如 RandomSampler （通常用于訓(xùn)練）、 SequentialSampler （通常用于測試）等。
下面示例，來熟悉 Sampler 的基本用法：

導(dǎo)入 MegEngine 中采樣器

from megengine.data import RandomSampler

創(chuàng)建一個隨機(jī)采樣器

random_sampler = RandomSampler(dataset=xor_dataset, batch_size=4)

獲取迭代sampler時每次返回的數(shù)據(jù)集索引

for indices in random_sampler:
print(indices)
break
輸出：
[19827, 2614, 8788, 8641]
可以看到，在 batch_size 為4時，每次迭代 sampler 返回的是長度為4的列表，列表中的每個元素是隨機(jī)采樣出的數(shù)據(jù)索引。
如果創(chuàng)建的是一個序列化采樣器 SequentialSampler ，那么每次返回的就是順序索引。
from megengine.data import SequentialSampler

sequential_sampler = SequentialSampler(dataset=xor_dataset, batch_size=4)

獲取迭代sampler時返回的數(shù)據(jù)集索引信息

for indices in sequential_sampler:
print(indices)
break
輸出：
[0, 1, 2, 3]
用戶也可以繼承 Sampler 自定義采樣器，這里不做詳述。
用 DataLoader 生成批數(shù)據(jù)
MegEngine 中，DataLoader 本質(zhì)上是一個迭代器，它通過 Dataset 和 Sampler 生成 minibatch 數(shù)據(jù)。
下列代碼通過 for 循環(huán)獲取每個 minibatch 的數(shù)據(jù)。
from megengine.data import DataLoader

創(chuàng)建一個 DataLoader，并指定數(shù)據(jù)集和順序采樣器

xor_dataloader = DataLoader(
dataset=xor_dataset,
sampler=sequential_sampler,
)
print(“The length of the xor_dataloader is: {}”.format(len(xor_dataloader)))

從 DataLoader 中迭代地獲取每批數(shù)據(jù)

for idx, (cor, tag) in enumerate(xor_dataloader):
print("iter %d : " % (idx), cor, tag)
break
輸出：
The length of the xor_dataloader is: 7500
iter 0 : [[ 0.97255366 0.74678389]
[ 0.01949105 -0.45632857]
[-0.32616254 -0.56609147]
[-0.44704571 -0.31336881]] [0 1 0 0]
DataLoader 中的數(shù)據(jù)變換（Transform）
在深度學(xué)習(xí)模型的訓(xùn)練中，經(jīng)常需要對數(shù)據(jù)進(jìn)行各種轉(zhuǎn)換，比如，歸一化、各種形式的數(shù)據(jù)增廣等。 Transform 是數(shù)據(jù)變換的基類，其各種派生類提供了常見的數(shù)據(jù)轉(zhuǎn)換功能。 DataLoader 構(gòu)造函數(shù)可以接收一個 Transform 參數(shù)，在構(gòu)建 minibatch 時，對該批數(shù)據(jù)進(jìn)行相應(yīng)的轉(zhuǎn)換操作。
接下來通過 MNIST 數(shù)據(jù)集（MegEngine 提供了 MNIST Dataset）來熟悉 Transform 的使用。首先構(gòu)建一個不做 Transform 的 MNIST DataLoader，并可視化第一個 minibatch 數(shù)據(jù)。

從 MegEngine 中導(dǎo)入 MNIST 數(shù)據(jù)集

from megengine.data.dataset import MNIST

若是第一次下載 MNIST 數(shù)據(jù)集，download 需設(shè)置成 True

若已經(jīng)下載 MNIST 數(shù)據(jù)集，通過 root 指定 MNIST數(shù)據(jù)集 raw 路徑

通過設(shè)置 train=True/False 獲取訓(xùn)練集或測試集

mnist_train_dataset = MNIST(root="./dataset/MNIST", train=True, download=True)

mnist_test_dataset = MNIST(root="./dataset/MNIST", train=False, download=True)

sequential_sampler = SequentialSampler(dataset=mnist_train_dataset, batch_size=4)

mnist_train_dataloader = DataLoader(
dataset=mnist_train_dataset,
sampler=sequential_sampler,
)

for i, batch_sample in enumerate(mnist_train_dataloader):
batch_image, batch_label = batch_sample[0], batch_sample[1]
# 下面可以將 batch_image, batch_label 傳遞給網(wǎng)絡(luò)做訓(xùn)練，這里省略
# trainging code …
# 中斷
break

print(“The shape of minibatch is: {}”.format(batch_image.shape))

導(dǎo)入可視化 Python 庫，若沒有，安裝

import matplotlib.pyplot as plt

def show(batch_image, batch_label):
for i in range(4):
plt.subplot(1, 4, i+1)
plt.imshow(batch_image[i][:,:,-1], cmap=‘gray’)
plt.xticks([])
plt.yticks([])
plt.title(“l(fā)abel: {}”.format(batch_label[i]))
plt.show()

可視化數(shù)據(jù)

show(batch_image, batch_label)
輸出：
The shape of minibatch is: (4, 28, 28, 1)
可視化第一批 MNIST 數(shù)據(jù)：

圖2
然后，構(gòu)建一個做 RandomResizedCrop transform 的 MNIST DataLoader，并查看此時第一個 minibatch 的圖片。

導(dǎo)入 MegEngine 已支持的一些數(shù)據(jù)增強(qiáng)操作

from megengine.data.transform import RandomResizedCrop

dataloader = DataLoader(
mnist_train_dataset,
sampler=sequential_sampler,
# 指定隨機(jī)裁剪后的圖片的輸出size
transform=RandomResizedCrop(output_size=28),
)

for i, batch_sample in enumerate(dataloader):
batch_image, batch_label = batch_sample[0], batch_sample[1]
break

show(batch_image, batch_label)
可視化第一個批數(shù)據(jù)：

圖3
可以看到，此時圖片經(jīng)過了隨機(jī)裁剪并 resize 回原尺寸。
組合變換（Compose Transform）
經(jīng)常需要做一系列數(shù)據(jù)變換。比如：
? 數(shù)據(jù)歸一化：可以通過 Transform 中提供的 Normalize 類來實現(xiàn)；
? Pad：對圖片的每條邊補(bǔ)零以增大圖片尺寸，通過 Pad 類來實現(xiàn)；
? 維度轉(zhuǎn)換：將 (Batch-size, Hight, Width, Channel) 維度的 minibatch 轉(zhuǎn)換為 (Batch-size, Channel, Hight, Width)（因為這是 MegEngine 支持的數(shù)據(jù)格式），通過 ToMode 類來實現(xiàn)；
? 其它的轉(zhuǎn)換操作
為了方便使用，MegEngine 中的 Compose 類允許組合多個 Transform 并傳遞給 DataLoader 的 transform 參數(shù)。
接下來通過 Compose 類將之前的 RandomResizedCrop 操作與 Normalize 、 Pad 和 ToMode 操作組合起來，實現(xiàn)多種數(shù)據(jù)轉(zhuǎn)換操作的混合使用。運行如下代碼查看轉(zhuǎn)換 minibatch 的維度信息。
from megengine.data.transform import RandomResizedCrop, Normalize, ToMode, Pad, Compose

利用 Compose 組合多個 Transform 操作

dataloader = DataLoader(
mnist_train_dataset,
sampler=sequential_sampler,
transform=Compose([
RandomResizedCrop(output_size=28),
# mean 和 std 分別是 MNIST 數(shù)據(jù)的均值和標(biāo)準(zhǔn)差，圖片數(shù)值范圍是 0~255
Normalize(mean=0.1307255, std=0.3081255),
Pad(2),
# 'CHW’表示把圖片由 (height, width, channel) 格式轉(zhuǎn)換成 (channel, height, width) 格式
ToMode(‘CHW’),
])
)

for i, batch_sample in enumerate(dataloader):
batch_image, batch_label = batch_sample[0], batch_sample[1]
break

print(“The shape of the batch is now: {}”.format(batch_image.shape))
輸出：
The shape of the batch is now: (4, 1, 32, 32)
可以看到，此時 minibatch 數(shù)據(jù)的 channel 維換了位置，且圖片尺寸變?yōu)?2。
DataLoader 中其他參數(shù)的用法請參考 DataLoader 文檔。

總結(jié)

以上是生活随笔為你收集整理的旷视MegEngine数据加载与处理的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯，歡迎將生活随笔推薦給好友。