一行代码不用写,就可以训练、测试、使用模型,这个 star 量 1.5k 的项目帮你做到...
公眾號關(guān)注?“小詹學(xué)Python”
設(shè)為“星標(biāo)”,第一時間知曉最新干貨~
轉(zhuǎn)自 | 機(jī)器之心
igel 是 GitHub 上的一個熱門工具,基于 scikit-learn 構(gòu)建,支持 sklearn 的所有機(jī)器學(xué)習(xí)功能,如回歸、分類和聚類。用戶無需編寫一行代碼即可使用機(jī)器學(xué)習(xí)模型,只要有 yaml 或 json 文件,來描述你想做什么即可。
一行代碼不用寫,就可以訓(xùn)練、測試和使用模型,還有這樣的好事?
最近,軟件工程師 Nidhal Baccouri 就在 GitHub 上開源了一個這樣的機(jī)器學(xué)習(xí)工具——igel,并登上了 GitHub 熱榜。目前,該項(xiàng)目 star 量已有 1.5k。
項(xiàng)目地址:https://github.com/nidhaloff/igel
該項(xiàng)目旨在為每一個人(包括技術(shù)和非技術(shù)人員)提供使用機(jī)器學(xué)習(xí)的便捷方式。
項(xiàng)目作者這樣描述創(chuàng)建 igel 的動機(jī):「有時候我需要一個用來快速創(chuàng)建機(jī)器學(xué)習(xí)原型的工具,不管是進(jìn)行概念驗(yàn)證還是創(chuàng)建快速 draft 模型。我發(fā)現(xiàn)自己經(jīng)常為寫樣板代碼或思考如何開始而犯愁。于是我決定創(chuàng)建 igel。」
igel 基于 scikit-learn 構(gòu)建,支持 sklearn 的所有機(jī)器學(xué)習(xí)功能,如回歸、分類和聚類。用戶無需編寫一行代碼即可使用機(jī)器學(xué)習(xí)模型,只要有 yaml 或 json 文件,來描述你想做什么即可。
其基本思路是在人類可讀的 yaml 或 json 文件中將所有配置進(jìn)行分組,包括模型定義、數(shù)據(jù)預(yù)處理方法等,然后讓 igel 自動化執(zhí)行一切操作。用戶在 yaml 或 json 文件中描述自己的需求,之后 igel 使用用戶的配置構(gòu)建模型,進(jìn)行訓(xùn)練,并給出結(jié)果和元數(shù)據(jù)。
igel 目前支持的所有配置如下所示:
# dataset operations dataset:type: csv # [str] -> type of your datasetread_data_options: # options you want to supply for reading your data (See the detailed overview about this in the next p)sep: # [str] -> Delimiter to use.delimiter: # [str] -> Alias for sep.header: # [int, list of int] -> Row number(s) to use as the column names, and the start of the data.names: # [list] -> List of column names to useindex_col: # [int, str, list of int, list of str, False] -> Column(s) to use as the row labels of the DataFrame,usecols: # [list, callable] -> Return a subset of the columnssqueeze: # [bool] -> If the parsed data only contains one column then return a Series.prefix: # [str] -> Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, …mangle_dupe_cols: # [bool] -> Duplicate columns will be specified as ‘X’, ‘X.1’, …’X.N’, rather than ‘X’…’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.dtype: # [Type name, dict maping column name to type] -> Data type for data or columnsengine: # [str] -> Parser engine to use. The C engine is faster while the python engine is currently more feature-complete.converters: # [dict] -> Dict of functions for converting values in certain columns. Keys can either be integers or column labels.true_values: # [list] -> Values to consider as True.false_values: # [list] -> Values to consider as False.skipinitialspace: # [bool] -> Skip spaces after delimiter.skiprows: # [list-like] -> Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.skipfooter: # [int] -> Number of lines at bottom of file to skipnrows: # [int] -> Number of rows of file to read. Useful for reading pieces of large files.na_values: # [scalar, str, list, dict] -> Additional strings to recognize as NA/NaN.keep_default_na: # [bool] -> Whether or not to include the default NaN values when parsing the data.na_filter: # [bool] -> Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.verbose: # [bool] -> Indicate number of NA values placed in non-numeric columns.skip_blank_lines: # [bool] -> If True, skip over blank lines rather than interpreting as NaN values.parse_dates: # [bool, list of int, list of str, list of lists, dict] -> try parsing the datesinfer_datetime_format: # [bool] -> If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them.keep_date_col: # [bool] -> If True and parse_dates specifies combining multiple columns then keep the original columns.dayfirst: # [bool] -> DD/MM format dates, international and European format.cache_dates: # [bool] -> If True, use a cache of unique, converted dates to apply the datetime conversion.thousands: # [str] -> the thousands operatordecimal: # [str] -> Character to recognize as decimal point (e.g. use ‘,’ for European data).lineterminator: # [str] -> Character to break file into lines.escapechar: # [str] -> One-character string used to escape other characters.comment: # [str] -> Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character.encoding: # [str] -> Encoding to use for UTF when reading/writing (ex. ‘utf-8’).dialect: # [str, csv.Dialect] -> If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quotingdelim_whitespace: # [bool] -> Specifies whether or not whitespace (e.g. ' ' or ' ') will be used as the seplow_memory: # [bool] -> Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference.memory_map: # [bool] -> If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.split: # split optionstest_size: 0.2 #[float] -> 0.2 means 20% for the test data, so 80% are automatically for trainingshuffle: true # [bool] -> whether to shuffle the data before/while splittingstratify: None # [list, None] -> If not None, data is split in a stratified fashion, using this as the class labels.preprocess: # preprocessing optionsmissing_values: mean # [str] -> other possible values: [drop, median, most_frequent, constant] check the docs for moreencoding:type: oneHotEncoding # [str] -> other possible values: [labelEncoding]scale: # scaling optionsmethod: standard # [str] -> standardization will scale values to have a 0 mean and 1 standard deviation | you can also try minmaxtarget: inputs # [str] -> scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled# model definition model:type: classification # [str] -> type of the problem you want to solve. | possible values: [regression, classification, clustering]algorithm: NeuralNetwork # [str (notice the pascal case)] -> which algorithm you want to use. | type igel algorithms in the Terminal to know morearguments: # model arguments: you can check the available arguments for each model by running igel help in your terminaluse_cv_estimator: false # [bool] -> if this is true, the CV class of the specific model will be used if it is supportedcross_validate:cv: # [int] -> number of kfold (default 5)n_jobs: # [signed int] -> The number of CPUs to use to do the computation (default None)verbose: # [int] -> The verbosity level. (default 0)# target you want to predict target: # list of strings: basically put here the column(s), you want to predict that exist in your csv dataset- put the target you want to predict here- you can assign many target if you are making a multioutput prediction這款工具具備以下特性:
支持所有機(jī)器學(xué)習(xí) SOTA 模型(甚至包括預(yù)覽版模型);
支持不同的數(shù)據(jù)預(yù)處理方法;
既能寫入配置文件,又能提供靈活性和數(shù)據(jù)控制;
支持交叉驗(yàn)證;
支持 yaml 和 json 格式;
支持不同的 sklearn 度量,進(jìn)行回歸、分類和聚類;
支持多輸出 / 多目標(biāo)回歸和分類;
在并行模型構(gòu)建時支持多處理。
如前所示,igel 支持回歸、分類和聚類模型,包括我們熟悉的線性回歸、貝葉斯回歸、支持向量機(jī)、Adaboost、梯度提升等。
igel 支持的回歸、分類和聚類模型。
快速入門
為了讓大家快速上手 igel,項(xiàng)目作者在「README」文件中提供了詳細(xì)的入門指南。
運(yùn)行以下命令可以獲取 igel 的幫助信息:
$ igel --help# or just$ igel -h"""Take some time and read the output of help command. You ll save time later if you understand how to use igel."""第一步是提供一份 yaml 文件(你也可以使用 json)。你可以手動創(chuàng)建一個. yaml 文件并自行編輯。但如何你很懶,也可以選擇使用 igel init 命令來快速啟動:
"""igel init <args>possible optional args are: (notice that these args are optional, so you can also just run igel init if you want)-type: regression, classification or clustering-model: model you want to use-target: target you want to predictExample:If I want to use neural networks to classify whether someone is sick or not using the indian-diabetes dataset,then I would use this command to initliaze a yaml file: $ igel init -type "classification" -model "NeuralNetwork" -target "sick"""" $ igel init運(yùn)行該命令之后,當(dāng)前的工作目錄中就有了一個 igel.yaml 文檔。你可以檢查這個文件并進(jìn)行修改,也可以一切從頭開始。
在下面這個例子中,作者使用隨機(jī)森林來判斷一個人是否患有糖尿病。他用到的數(shù)據(jù)集是著名的「Pima Indians Diabetes Database」。
# model definitionmodel:# in the type field, you can write the type of problem you want to solve. Whether regression, classification or clustering# Then, provide the algorithm you want to use on the data. Here I'm using the random forest algorithmtype: classificationalgorithm: RandomForest # make sure you write the name of the algorithm in pascal casearguments:n_estimators: 100 # here, I set the number of estimators (or trees) to 100max_depth: 30 # set the max_depth of the tree# target you want to predict# Here, as an example, I'm using the famous indians-diabetes dataset, where I want to predict whether someone have diabetes or not.# Depending on your data, you need to provide the target(s) you want to predict heretarget:- sick注意,作者將 n_estimators 和 max_depth 傳遞給了模型,用作模型的附加參數(shù)。如果你不提供參數(shù),模型就會使用默認(rèn)參數(shù)。你不需要記住每個模型的參數(shù)。相反,你可以在終端運(yùn)行 igel models 進(jìn)入交互模式。在交互模式下,系統(tǒng)會提示你輸入你想要使用的模型以及你想要解決的問題的類型。接下來,Igel 將展示出有關(guān)模型的信息和鏈接。通過該鏈接,你可以看到可用參數(shù)列表以及它們的使用方法。
igel 的使用方式應(yīng)該是從終端(igel CLI):
在終端運(yùn)行以下命令來擬合 / 訓(xùn)練模型,你需要提供數(shù)據(jù)集和 yaml 文件的路徑。
$ igel fit --data_path 'path_to_your_csv_dataset.csv' --yaml_file 'path_to_your_yaml_file.yaml'# or shorter$ igel fit -dp 'path_to_your_csv_dataset.csv' -yml 'path_to_your_yaml_file.yaml'"""That's it. Your "trained" model can be now found in the model_results folder(automatically created for you in your current working directory).Furthermore, a description can be found in the description.json file inside the model_results folder."""接下來,你可以評估訓(xùn)練 / 預(yù)訓(xùn)練好的模型:
$ igel evaluate -dp 'path_to_your_evaluation_dataset.csv'"""This will automatically generate an evaluation.json file in the current directory, where all evaluation results are stored"""如果你對評估結(jié)果比較滿意,就可以使用這個訓(xùn)練 / 預(yù)訓(xùn)練好的模型執(zhí)行預(yù)測。
$ igel predict -dp 'path_to_your_test_dataset.csv'"""This will generate a predictions.csv file in your current directory, where all predictions are stored in a csv file"""你可以使用一個「experiment」命令將訓(xùn)練、評估和預(yù)測結(jié)合到一起:
$ igel experiment -DP "path_to_train_data path_to_eval_data path_to_test_data" -yml "path_to_yaml_file""""This will run fit using train_data, evaluate using eval_data and further generate predictions using the test_data"""當(dāng)然,如果你想寫代碼也是可以的:
交互模式
交互模式是 v0.2.6 及以上版本中新添加的,該模式可以讓你按照自己喜歡的方式寫參數(shù)。
也就是說,你可以使用 fit、evaluate、predict、experiment 等命令而無需指定任何額外的參數(shù),比如:
igel fit如果你只是編寫這些內(nèi)容并點(diǎn)擊「enter」,系統(tǒng)將提示你提供額外的強(qiáng)制參數(shù)。0.2.5 及以下版本會報錯,所以你需要使用 0.2.6 及以上版本。
如 demo 所示,你不需要記住這些參數(shù),igel 會提示你輸入這些內(nèi)容。具體而言,Igel 會提供一條信息,解釋你需要輸入哪個參數(shù)。括號之間的值表示默認(rèn)值。
端到端訓(xùn)練示例
項(xiàng)目作者給出了使用 igel 進(jìn)行端到端訓(xùn)練的完整示例,即使用決策樹算法預(yù)測某人是否患有糖尿病。你需要創(chuàng)建一個 yaml 配置文件,數(shù)據(jù)集可以在 examples 文件夾中找到。
擬合 / 訓(xùn)練模型:
model:type: classificationalgorithm: DecisionTreetarget:- sick $ igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file現(xiàn)在,igel 將擬合你的模型,并將其保存在當(dāng)前目錄下的 model_results 文件夾中。
評估模型:
現(xiàn)在開始評估預(yù)訓(xùn)練模型。Igel 從 model_results 文件夾中加載預(yù)訓(xùn)練模型并進(jìn)行評估。你只需要運(yùn)行 evaluate 命令并提供評估數(shù)據(jù)的路徑即可。
$ igel evaluate -dp path_to_the_evaluation_datasetIgel 進(jìn)行模型評估,并將 statistics/results 保存在 model_results 文件夾中的 evaluation.json 文件中。
預(yù)測:
這一步使用預(yù)訓(xùn)練模型預(yù)測新數(shù)據(jù)。這由 igel 自動完成,你只需提供預(yù)測數(shù)據(jù)的路徑即可。
$ igel predict -dp path_to_the_new_datasetIgel 使用預(yù)訓(xùn)練模型執(zhí)行預(yù)測,并將其保存在 model_results 文件夾中的 predictions.csv 文件中。
高階用法
你還可以通過在 yaml 文件中提供某些預(yù)處理方法或其他操作來執(zhí)行它們。關(guān)于 yaml 配置文件請參考 GitHub 詳細(xì)介紹。在下面的示例中,將數(shù)據(jù)拆分為訓(xùn)練集 80%,驗(yàn)證 / 測試集 20%。同樣,數(shù)據(jù)在拆分時會被打亂。
此外,可以通過用均值替換缺失值來對數(shù)據(jù)進(jìn)行預(yù)處理:
# dataset operations dataset:split:test_size: 0.2shuffle: Truestratify: defaultpreprocess: # preprocessing optionsmissing_values: mean # other possible values: [drop, median, most_frequent, constant] check the docs for moreencoding:type: oneHotEncoding # other possible values: [labelEncoding]scale: # scaling optionsmethod: standard # standardization will scale values to have a 0 mean and 1 standard deviation | you can also try minmaxtarget: inputs # scale inputs. | other possible values: [outputs, all] # if you choose all then all values in the dataset will be scaled# model definition model:type: classificationalgorithm: RandomForestarguments:# notice that this is the available args for the random forest model. check different available args for all supported models by running igel helpn_estimators: 100max_depth: 20# target you want to predict target:- sick然后,可以通過運(yùn)行 igel 命令來擬合模型:
$ igel fit -dp path_to_the_dataset -yml path_to_the_yaml_file評估:
$ igel evaluate -dp path_to_the_evaluation_dataset預(yù)測:
$ igel predict -dp path_to_the_new_dataset參考鏈接:https://medium.com/@nidhalbacc/machine-learning-without-writing-code-984b238dd890
由于微信平臺算法改版,公號內(nèi)容將不再以時間排序展示,如果大家想第一時間看到我們的推送,強(qiáng)烈建議星標(biāo)我們和給我們多點(diǎn)點(diǎn)【在看】。星標(biāo)具體步驟為:(1)點(diǎn)擊頁面最上方“小詹學(xué)Python”,進(jìn)入公眾號主頁。 (2)點(diǎn)擊右上角的小點(diǎn)點(diǎn),在彈出頁面點(diǎn)擊“設(shè)為星標(biāo)”,就可以啦。 感謝支持,比心。總結(jié)
以上是生活随笔為你收集整理的一行代码不用写,就可以训练、测试、使用模型,这个 star 量 1.5k 的项目帮你做到...的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 国庆期间,我造了台计算机
- 下一篇: 节后的第一个周末,来领取一个Ipad吧!