python 机器学习管道_构建机器学习管道-第1部分
python 機(jī)器學(xué)習(xí)管道
Below are the usual steps involved in building the ML pipeline:
以下是構(gòu)建ML管道所涉及的通常步驟:
問(wèn)題陳述和數(shù)據(jù)獲取 (Problem Statement and Getting the Data)
I’m using a relatively bigger and more complicated data set to demonstrate the process. Refer to the Kaggle competition — IEEE-CIS Fraud Detection.
我正在使用相對(duì)較大和較復(fù)雜的數(shù)據(jù)集來(lái)演示該過(guò)程。 請(qǐng)參閱Kaggle競(jìng)賽-IEEE-CIS欺詐檢測(cè) 。
Navigate to Data Explorer and you will see something like this:
導(dǎo)航到“數(shù)據(jù)資源管理器”,您將看到以下內(nèi)容:
Select train_transaction.csv and it will show you a glimpse of what data looks like. Click on the download icon highlighted by a red arrow to get the data.
選擇train_transaction.csv ,它將向您展示數(shù)據(jù)的外觀。 單擊以紅色箭頭突出顯示的下載圖標(biāo)以獲取數(shù)據(jù)。
Other than the usual library import statements, you will need to check for 2 additional libraries —
除了通常的庫(kù)導(dǎo)入語(yǔ)句外,您還需要檢查另外兩個(gè)庫(kù)-
pip安裝pyarrow (pip install pyarrow)
點(diǎn)安裝fast_ml (pip install fast_ml)
主要亮點(diǎn) (Key Highlights)
This is the first article in the series of building machine learning pipeline. In this article, we will focus on optimizations around importing the data in Jupyter notebook and executing things faster.
這是構(gòu)建機(jī)器學(xué)習(xí)管道系列文章中的第一篇。 在本文中,我們將專注于圍繞Jupyter Notebook中導(dǎo)入數(shù)據(jù)并更快地執(zhí)行操作的優(yōu)化。
There are 3 key things to note in this article —
本文中有3個(gè)關(guān)鍵要注意的地方-
Python zipfile
Python壓縮檔
Reducing the memory usage of the dataset
減少數(shù)據(jù)集的內(nèi)存使用量
A faster way of saving/loading working datasets
保存/加載工作數(shù)據(jù)集的更快方法
1:導(dǎo)入數(shù)據(jù) (1: Import Data)
After you have downloaded the zipped file. It is so much better to use python to unzip the file.
下載壓縮文件后。 使用python解壓縮文件要好得多。
Tip 1: Use function from python zipfile library to unzip the file.
提示1: 使用python zipfile庫(kù)中的函數(shù)來(lái)解壓縮文件。
import zipfilewith zipfile.ZipFile('train_transaction.csv.zip', mode='r') as zip_ref:zip_ref.extractall('data/')
This will create a folder data and unzip the CSV file train_transaction.csv in that folder.
這將創(chuàng)建一個(gè)文件夾data ,并將CSV文件train_transaction.csv解壓縮到該文件夾??中。
We will use pandas read_csv method to load the data set into Jupyter notebook.
我們將使用pandas read_csv方法將數(shù)據(jù)集加載到Jupyter筆記本中。
%time trans = pd.read_csv('train_transaction.csv')df_size = trans.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 23.2 s, sys: 7.87 s, total: 31 s
Wall time: 32.5 sMemory usage of dataframe is 1775.1524047851562 MB
Shape of dataframe is (590540, 394)
This data is ~1.5 GB with more than half a million rows.
該數(shù)據(jù)約為1.5 GB,具有超過(guò)一百萬(wàn)行。
Tip 2: We will use a function from fast_ml to reduce this memory usage.
提示2: 我們將使用fast_ml中的函數(shù)來(lái)減少此內(nèi)存使用量。
from fast_ml.utilities import reduce_memory_usage%time trans = reduce_memory_usage(trans, convert_to_category=False)---- Output ----Memory usage of dataframe is 1775.15 MB
Memory usage after optimization is: 542.35 MB
Decreased by 69.4%CPU times: user 2min 25s, sys: 2min 57s, total: 5min 23s
Wall time: 5min 56s
This step took almost 5 mins but it has reduced the memory size by almost 70% that’s quite a significant reduction
此步驟花費(fèi)了將近5分鐘,但已將內(nèi)存大小減少了將近70%,這是一個(gè)相當(dāng)大的減少
For further analysis, we will create a sample dataset of 200k records just so that our data processing steps won’t take a long time to run.
為了進(jìn)行進(jìn)一步的分析,我們將創(chuàng)建一個(gè)包含20萬(wàn)條記錄的樣本數(shù)據(jù)集,以便我們的數(shù)據(jù)處理步驟不會(huì)花費(fèi)很長(zhǎng)時(shí)間。
# Take a sample of 200k records%time trans = trans.sample(n=200000)#reset the index because now index would have shuffled
trans.reset_index(inplace = True, drop = True)df_size = trans.memory_usage().sum() / 1024**2
print(f'Memory usage of sample dataframe is {df_size} MB')---- Output ----
CPU times: user 1.39 s, sys: 776 ms, total: 2.16 s
Wall time: 2.43 s
Memory usage of sample dataframe is 185.20355224609375 MB
Now, we will save this in our local drive — CSV Format
現(xiàn)在,我們將其保存在本地驅(qū)動(dòng)器中-CSV格式
import osos.makedirs('data', exist_ok=True) trans.to_feather('data/train_transaction_sample')Tip 3: use the feather format instead of csv
提示3: 使用羽毛格式而不是csv
import osos.makedirs('data', exist_ok=True)trans.to_feather('data/train_transaction_sample')
Once you load the data from these 2 sources you will observe the significant performance improvements.
從這兩個(gè)來(lái)源加載數(shù)據(jù)后,您將觀察到顯著的性能改進(jìn)。
Load the saved sample data — CSV Format
加載保存的樣本數(shù)據(jù)-CSV格式
%time trans = pd.read_csv('data/train_transaction_sample.csv')df_size = tras.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')
print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 7.37 s, sys: 1.06 s, total: 8.42 s
Wall time: 8.5 sMemory usage of dataframe is 601.1964111328125 MB
Shape of dataframe is (200000, 394)
Load the saved sample data — Feather Format
加載保存的樣本數(shù)據(jù)—羽毛格式
%time trans = pd.read_feather('tmp/train_transaction_sample')df_size = trans.memory_usage().sum() / 1024**2print(f'Memory usage of dataframe is {df_size} MB')
print (f'Shape of dataframe is {trans.shape}')---- Output ----
CPU times: user 1.32 s, sys: 930 ms, total: 2.25 s
Wall time: 892 msMemory usage of dataframe is 183.67779541015625 MB
Shape of dataframe is (200000, 394)
注意這里兩件事: (Notice 2 things here :)
i. The amount of time it took to load the CSV file is almost 10 times the time it took for loading feather format data.
一世。 加載CSV文件所花費(fèi)的時(shí)間幾乎是加載羽毛格式數(shù)據(jù)所花費(fèi)的時(shí)間的10倍。
ii. Size of the data set loaded was retained in feather format whereas in CSV format the data set is again consuming high memory and we will have to run the reduce_memory_usage function again.
ii。 加載的數(shù)據(jù)集的大小以羽毛格式保留,而在CSV格式中,數(shù)據(jù)集再次占用大量?jī)?nèi)存,我們將不得不再次運(yùn)行reduce_memory_usage函數(shù)。
結(jié)束語(yǔ): (Closing Notes:)
- Please feel free to write your thoughts/suggestions/feedback. 請(qǐng)隨時(shí)寫(xiě)下您的想法/建議/反饋。
- We will use the new sample data set what we created for our further analysis. 我們將使用我們創(chuàng)建的新樣本數(shù)據(jù)集進(jìn)行進(jìn)一步分析。
- We will talk about Exploratory Data Analysis in our next article. 在下一篇文章中,我們將討論探索性數(shù)據(jù)分析。
Github link
Github鏈接
翻譯自: https://towardsdatascience.com/building-a-machine-learning-pipeline-part-1-b19f8c8317ae
python 機(jī)器學(xué)習(xí)管道
總結(jié)
以上是生活随笔為你收集整理的python 机器学习管道_构建机器学习管道-第1部分的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 《满江红》称将起诉传谣言者 网友神嘲讽:
- 下一篇: Python中的线性回归:Sklearn