日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

下载MNIST数据集并使用python将数据转换成NumPy数组(源码解析)

發(fā)布時間:2025/3/12 python 47 豆豆
生活随笔 收集整理的這篇文章主要介紹了 下载MNIST数据集并使用python将数据转换成NumPy数组(源码解析) 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

下載MNIST數(shù)據(jù)集并使用python將數(shù)據(jù)轉(zhuǎn)換成NumPy數(shù)組

    • 首先來分析init_mnist函數(shù)
    • 接下來繼續(xù)分析load_mnist函數(shù)
    • 實現(xiàn)數(shù)據(jù)集轉(zhuǎn)換的python腳本的代碼
    • 顯示MNIST圖像并確認數(shù)據(jù)

下載MNIST數(shù)據(jù)集并將數(shù)據(jù)轉(zhuǎn)換成NumPy數(shù)組的Python腳本里面最重要的就是load_mnist函數(shù),其他項目想要調(diào)用數(shù)據(jù)集的話,就可以調(diào)用load_mnist函數(shù),得到一個字典類型的數(shù)據(jù),字典的值是一個Numpy數(shù)組。

這些過程是如何實現(xiàn)的,現(xiàn)在開始逐字逐句分析源碼:

在load_mnist函數(shù)中第一句話是

if not os.path.exists(save_file):init_mnist()

如果說數(shù)據(jù)沒有被下載,那么就調(diào)用init_mnist()函數(shù)。

首先來分析init_mnist函數(shù)

在init_mnist()函數(shù)中,可以發(fā)現(xiàn)調(diào)用了download_mnist()函數(shù)。

def init_mnist():download_mnist()dataset = _convert_numpy()print("Creating pickle file ...")with open(save_file, 'wb') as f:pickle.dump(dataset, f, -1)print("Done!")

在download_mnist()函數(shù)中,可以看到又調(diào)用了_download(v)函數(shù)。

def download_mnist():for v in key_file.values():_download(v)

在_download(v)函數(shù)中,可以看出,它最重要的一句話就是urllib.request.urlretrieve,這個語句的意思就是把數(shù)據(jù)集下載到file_path路徑下的文件里面。

def _download(file_name):file_path = dataset_dir + "/" + file_nameif os.path.exists(file_path):returnprint("Downloading " + file_name + " ... ")urllib.request.urlretrieve(url_base + file_name, file_path)print("Done") url_base = 'http://yann.lecun.com/exdb/mnist/' key_file = {'train_img':'train-images-idx3-ubyte.gz','train_label':'train-labels-idx1-ubyte.gz','test_img':'t10k-images-idx3-ubyte.gz','test_label':'t10k-labels-idx1-ubyte.gz' }

然后回到download_mnist()函數(shù),這里面調(diào)用了_convert_numpy函數(shù)

# download_mnist()函數(shù)dataset = _convert_numpy()print("Creating pickle file ...")with open(save_file, 'wb') as f:pickle.dump(dataset, f, -1)print("Done!")

我們看 _convert_numpy函數(shù):這函數(shù)返回一個字典數(shù)據(jù)類型,也就是鍵值對。這個函數(shù)里面調(diào)用了 _load_img函數(shù)。

def _convert_numpy():dataset = {}dataset['train_img'] = _load_img(key_file['train_img'])dataset['train_label'] = _load_label(key_file['train_label']) dataset['test_img'] = _load_img(key_file['test_img'])dataset['test_label'] = _load_label(key_file['test_label'])return dataset

我們看 _load_img函數(shù),由print(“Converting " + file_name + " to NumPy Array …”)可以了解到,這個函數(shù)是用來將數(shù)據(jù)集轉(zhuǎn)換成numpy數(shù)組的。

_load_img函數(shù)里面gzip.open(file_path, ‘rb’),數(shù)據(jù)集是gz后綴的,這句話就是把這個數(shù)據(jù)給讀出來。

def _load_img(file_name):file_path = dataset_dir + "/" + file_nameprint("Converting " + file_name + " to NumPy Array ...") with gzip.open(file_path, 'rb') as f:data = np.frombuffer(f.read(), np.uint8, offset=16)data = data.reshape(-1, img_size)print("Done")return data

_load_img函數(shù)里面data = np.frombuffer(f.read(), np.uint8, offset=16)這句話,是把f.read()里面的數(shù)據(jù)轉(zhuǎn)化成numpy數(shù)組,而且數(shù)組元素類型是uint8,讀取的起始位置是16,為什么是16,可以看數(shù)據(jù)集TRAINING SET IMAGE FILE (train-images-idx3-ubyte)的存儲內(nèi)容:

[offset] [type] [value] [description]` `0000 32 bit integer 0x00000803(2051) magic number` `0004 32 bit integer 60000 number of images` `0008 32 bit integer 28 number of rows` `0012 32 bit integer 28 number of columns` `0016 unsigned byte ?? pixel` `0017 unsigned byte ?? pixel` `........` `xxxx unsigned byte ?? pixel

這部分是訓(xùn)練集的image信息,image信息是通過灰度值存儲的,前16字節(jié)是數(shù)據(jù)集的信息,后面的字節(jié)都是圖片的信息。所以要存圖片的信息,就從16字節(jié)開始。

后面的data = data.reshape(-1, img_size)這句話,意思是把這個numpy數(shù)組變成行為1,列為img_size的樣子。那么img_size函數(shù)最后就返回一個numpy數(shù)組。至此, _load_img函數(shù)已經(jīng)解析完。

再看_convert_numpy函數(shù),返回的dataset也就是一個字典,鍵是字符串,值是numpy數(shù)組。

回到init_mnist()函數(shù)里面,由print(“Creating pickle file …”)可以看到得到dataset之后,該函數(shù)進行的是創(chuàng)建pickle文件的操作。with open(save_file, ‘wb’) as f 這句話,意思是以二進制格式打開名字為save_file的文件只用于寫入。我們的save_file = dataset_dir + “/mnist.pkl”,所以就是創(chuàng)建了一個pkl文件。那么寫入什么呢,接下來看pickle.dump(dataset, f, -1)這句話,這句話表明,將對象dataset保存到我們的pkl文件中去,這個-1是pickle進行轉(zhuǎn)換的協(xié)議版本。那么至此,init_mnist函數(shù)已經(jīng)分析完,它返回一個pickle文件。

def init_mnist():download_mnist()dataset = _convert_numpy()print("Creating pickle file ...")with open(save_file, 'wb') as f:pickle.dump(dataset, f, -1)print("Done!")

接下來繼續(xù)分析load_mnist函數(shù)

下面有一行,with open(save_file, ‘rb’) as f: dataset = pickle.load(f),把之前的pickle文件重構(gòu)為原來的python對象,給dataset。

load_mnist的參數(shù)normalize=True,這是將輸入圖像正規(guī)化為0-1的值,各個像素取值在0-255之間,dataset[key] /= 255.0就變成0-1之間了。

load_mnist的參數(shù)one_hot_label如果為True的話,設(shè)置將標(biāo)簽保存為ont-hot表示,one-hot表示是僅正確解標(biāo)簽為1,其余皆為0的數(shù)組。調(diào)用了 _change_one_hot_label函數(shù)來實現(xiàn)。

def _change_one_hot_label(X):T = np.zeros((X.size, 10))for idx, row in enumerate(T):row[X[idx]] = 1return T

load_mnist的參數(shù)flatten設(shè)置為True,則輸入圖像會保存為由784個元素構(gòu)成的一維數(shù)組,設(shè)置為False,則輸入圖像為1*28 *28的三維數(shù)組。

最后load_mnist返回字典類型的dataset。鍵分別是train_img、train_label、test_img、test_label,值是由后綴為.gz數(shù)據(jù)集文件轉(zhuǎn)換得到的Numpy數(shù)組。

def load_mnist(normalize=True, flatten=True, one_hot_label=False):if not os.path.exists(save_file):init_mnist()with open(save_file, 'rb') as f:dataset = pickle.load(f)if normalize:for key in ('train_img', 'test_img'):dataset[key] = dataset[key].astype(np.float32)dataset[key] /= 255.0if one_hot_label:dataset['train_label'] = _change_one_hot_label(dataset['train_label'])dataset['test_label'] = _change_one_hot_label(dataset['test_label'])if not flatten:for key in ('train_img', 'test_img'):dataset[key] = dataset[key].reshape(-1, 1, 28, 28)return (dataset['train_img'], dataset['train_label']), (dataset['test_img'], dataset['test_label'])

至此,load_mnist函數(shù)已經(jīng)分析完畢,下載MNIST數(shù)據(jù)集并使用python將數(shù)據(jù)轉(zhuǎn)換成NumPy數(shù)組的全部代碼:

實現(xiàn)數(shù)據(jù)集轉(zhuǎn)換的python腳本的代碼

# coding: utf-8 try:import urllib.request except ImportError:raise ImportError('You should use Python 3.x') import os.path import gzip import pickle import os import numpy as npurl_base = 'http://yann.lecun.com/exdb/mnist/' key_file = {'train_img':'train-images-idx3-ubyte.gz','train_label':'train-labels-idx1-ubyte.gz','test_img':'t10k-images-idx3-ubyte.gz','test_label':'t10k-labels-idx1-ubyte.gz' }dataset_dir = os.path.dirname(os.path.abspath(__file__)) save_file = dataset_dir + "/mnist.pkl"train_num = 60000 test_num = 10000 img_dim = (1, 28, 28) img_size = 784def _download(file_name):file_path = dataset_dir + "/" + file_nameif os.path.exists(file_path):returnprint("Downloading " + file_name + " ... ")urllib.request.urlretrieve(url_base + file_name, file_path)print("Done")def download_mnist():for v in key_file.values():_download(v)def _load_label(file_name):file_path = dataset_dir + "/" + file_nameprint("Converting " + file_name + " to NumPy Array ...")with gzip.open(file_path, 'rb') as f:labels = np.frombuffer(f.read(), np.uint8, offset=8)print("Done")return labelsdef _load_img(file_name):file_path = dataset_dir + "/" + file_nameprint("Converting " + file_name + " to NumPy Array ...") with gzip.open(file_path, 'rb') as f:data = np.frombuffer(f.read(), np.uint8, offset=16)data = data.reshape(-1, img_size)print("Done")return datadef _convert_numpy():dataset = {}dataset['train_img'] = _load_img(key_file['train_img'])dataset['train_label'] = _load_label(key_file['train_label']) dataset['test_img'] = _load_img(key_file['test_img'])dataset['test_label'] = _load_label(key_file['test_label'])return datasetdef init_mnist():download_mnist()dataset = _convert_numpy()print("Creating pickle file ...")with open(save_file, 'wb') as f:pickle.dump(dataset, f, -1)print("Done!")def _change_one_hot_label(X):T = np.zeros((X.size, 10))for idx, row in enumerate(T):row[X[idx]] = 1return Tdef load_mnist(normalize=True, flatten=True, one_hot_label=False):"""讀入MNIST數(shù)據(jù)集Parameters----------normalize : 將圖像的像素值正規(guī)化為0.0~1.0one_hot_label : one_hot_label為True的情況下,標(biāo)簽作為one-hot數(shù)組返回one-hot數(shù)組是指[0,0,1,0,0,0,0,0,0,0]這樣的數(shù)組flatten : 是否將圖像展開為一維數(shù)組Returns-------(訓(xùn)練圖像, 訓(xùn)練標(biāo)簽), (測試圖像, 測試標(biāo)簽)"""if not os.path.exists(save_file):init_mnist()with open(save_file, 'rb') as f:dataset = pickle.load(f)if normalize:for key in ('train_img', 'test_img'):dataset[key] = dataset[key].astype(np.float32)dataset[key] /= 255.0if one_hot_label:dataset['train_label'] = _change_one_hot_label(dataset['train_label'])dataset['test_label'] = _change_one_hot_label(dataset['test_label'])if not flatten:for key in ('train_img', 'test_img'):dataset[key] = dataset[key].reshape(-1, 1, 28, 28)return (dataset['train_img'], dataset['train_label']), (dataset['test_img'], dataset['test_label']) if __name__ == '__main__':init_mnist()

顯示MNIST圖像并確認數(shù)據(jù)

首先調(diào)用前面寫的load_mnist函數(shù)(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)得到x_train、t_train、x_test、t_test這幾個字典類型的對象。

要看訓(xùn)練集的第一個數(shù)據(jù),就可以通過img = x_train[0]讀出來第一個圖片,label = t_train[0]讀出來數(shù)據(jù)集里面放的第一個標(biāo)簽。輸出出來發(fā)現(xiàn),數(shù)據(jù)集里第一個圖是5 。

展示圖片用的是img_show函數(shù),這個函數(shù)里面用的Image.fromarray作用是將array數(shù)據(jù)轉(zhuǎn)成PIL能用的數(shù)據(jù)格式,從而輸出圖片。

import sys, os sys.path.append(os.pardir) # 為了導(dǎo)入父目錄的文件而進行的設(shè)定 import numpy as np from dataset.mnist import load_mnist from PIL import Imagedef img_show(img):pil_img = Image.fromarray(np.uint8(img))pil_img.show()(x_train, t_train), (x_test, t_test) = load_mnist(flatten=True, normalize=False)img = x_train[0] label = t_train[0] print(label) # 5print(img.shape) # (784,) img = img.reshape(28, 28) # 把圖像的形狀變?yōu)樵瓉淼某叽?/span> print(img.shape) # (28, 28)img_show(img)

輸出:

Downloading train-images-idx3-ubyte.gz ... Done Downloading train-labels-idx1-ubyte.gz ... Done Downloading t10k-images-idx3-ubyte.gz ... Done Downloading t10k-labels-idx1-ubyte.gz ... Done Converting train-images-idx3-ubyte.gz to NumPy Array ... Done Converting train-labels-idx1-ubyte.gz to NumPy Array ... Done Converting t10k-images-idx3-ubyte.gz to NumPy Array ... Done Converting t10k-labels-idx1-ubyte.gz to NumPy Array ... Done Creating pickle file ... Done! 5 (784,) (28, 28)Process finished with exit code 0 創(chuàng)作挑戰(zhàn)賽新人創(chuàng)作獎勵來咯,堅持創(chuàng)作打卡瓜分現(xiàn)金大獎

總結(jié)

以上是生活随笔為你收集整理的下载MNIST数据集并使用python将数据转换成NumPy数组(源码解析)的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。