當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用Scikit-learn和TensorFlow进行机器学习（二）

發布時間：2025/3/19 编程问答 31 豆豆

生活随笔收集整理的這篇文章主要介紹了用Scikit-learn和TensorFlow进行机器学习（二）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

一個完整的機器學習項目
- 一、真實數據
- 二、項目概述
- - 1、劃定問題
  - 2、選擇性能指標（損失函數）
  - - （1）回歸任務
    - （2）平均絕對誤差（MAE，Mean Absolute Error）
    - （3）范數
  - 3、核實假設
- 三、獲取數據
- - 1、os模塊
  - 2、urllib.request.urlretrieve
- 四、查看數據結構
- - 1、數據信息查看
  - 2、可視化描述——每個屬性的柱狀圖
- 五、數據準備
- - 1、測試集
  - - （1）實現（造輪子）
    - （2）知識點
    - （3）存在問題
  - （4）實現（sklearn）
  - （5）分層采樣
- 六、數據探索和可視化、發現規律
- - 1、地理數據可視化
  - - （1）地理位置可視化
    - （2）基本地理位置的房價可視化
  - 2、查找關聯
  - - （1）corr()
    - （2）scatter_matrix()
    - （3）本項目
  - 3、屬性組合試驗==》新屬性
- 七、為機器學習算法準備數據
- - 1、數據清洗
  - - （1）DataFrame對象
    - （2）Scikit-Learn 提供的 Imputer 類處理缺失值
    - （3）scikit-learn設計原則
  - 2、處理文本和類別屬性
  - - （1）將文本標簽轉換為數字
    - （2）One-Hot Encoding（獨熱編碼）
    - （3）LabelBinarizer（文本分類=》one-hot）
    - （4）CategoricalEncoder類（文本分類=》one-hot）
  - 3、自定義轉換器
  - 4、特征縮放（重要）
  - - （1）線性函數歸一化（Min-Max scaling）
    - （2）標準化
  - 5、轉換流水線
  - - （1）數值屬性Pipeline
    - （2）多Pipeline——FeatureUnion
- 六、選擇并訓練模型
- - 1、線性回歸模型
  - 2、決策樹回歸
  - 3、使用交叉驗證做評估
  - 4、模型保存
- 七、模型微調
- - 1、網格搜索——GridSearchCV
  - 2、隨機搜索——RandomizedSearchCV
  - 3、集成方法
  - 4、分析最佳模型和它們的誤差
- 八、用測試集評估系統
- 九、啟動、監控、維護系統

一個完整的機器學習項目

主要步驟：

項目概述。

獲取數據。

發現并可視化數據，發現規律。

為機器學習算法準備數據。

選擇模型，進行訓練。

微調模型。

給出解決方案。

部署、監控、維護系統。

一、真實數據

流行的開源數據倉庫：

UC Irvine Machine Learning Repository
Kaggle datasets
Amazon’s AWS datasets

準入口（提供開源數據列表）

http://dataportals.org/
http://opendatamonitor.eu/
http://quandl.com/

其它列出流行開源數據倉庫的網頁：

Wikipedia’s list of Machine Learning datasets
Quora.com question
Datasets subreddit

二、項目概述

StatLib 的加州房產價格數據集（1990年），利用加州普查數據，建立一個加州房價模型。這個數據包含每個街區組的人口、收入中位數、房價中位數等指標。學習并根據其他指標預測任何街區的房價中位數。

1、劃定問題

問題
（1）商業目標是什么？如何使用、并從模型受益？
==》劃定問題、選擇算法、評估模型的性能指標。

（2）現在的解決方案效果如何？
==》參考性能、解決問題。

本項目：監督學習中的回歸任務

2、選擇性能指標（損失函數）

（1）回歸任務

RMSE（均方根誤差）
$RMSE(X,h)=1m∑i=1m(h(x(i))?y(i))2RMSE(X,h)=\sqrt{\frac{1}{m}\sum_{i=1}^{m}(h(x^{(i)})-y^{(i)})^2}$
其中， $m$ 表示RMSE的數據集中的實例數量； $h$ 表示系統的預測函數，也稱假設（hypothesis）； $x^{(i)}$ 表示數據集第 $i$ 個實例的所有特征值（不含標簽）的向量， $y^{(i)}$ 是它的標簽； $X$ 表示數據集中所有實例的所有特征值（不含標簽）的矩陣，每一行是一個實例，第 $i$ 行是 $x^{(i)}$ 的轉置，記作 $x^{(i)T}$ 。

（2）平均絕對誤差（MAE，Mean Absolute Error）

適用：存在許多異常的值
$MAE(X,h)=1m∑i=1m∣h(x(i))?y(i))∣MAE(X,h)=\frac{1}{m}\sum_{i=1}^{m}|h(x^{(i)})-y^{(i)})|$

（3）范數

L2范數（歐幾里得范數的RMSE）： $_2或||·||$
L1范數（曼哈頓范數）：絕對值(MAE)和 $_1$
一般化，包含 $n$ 個元素的向量 $v$ 的 $L_k$ 范數（K階閔氏范數）
$∣∣v∣∣k=(∣v0∣k+∣v1∣k+...+∣vn∣k)1k||v||_k=(|v_0|^k+|v_1|^k+...+|v_n|^k)^{\frac{1}{k}}$
L0范數：非零元素個數；
L $∞_\infty$ :切比雪夫范數：向量中最大的絕對值.

范數的指數越高，就越關注大的值而忽略小的值。這就是為什么 RMSE 比 MAE 對異常值更敏感。但是當異常值是指數分布的（類似正態曲線）， RMSE 就會表現很好。

3、核實假設

三、獲取數據

import os import tarfile from six.moves import urllib import pandas as pdDOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/" HOUSING_PATH = "datasets/housing" HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"## 獲取數據 def fetch_housing_data(housing_url = HOUSING_URL, housing_path = HOUSING_PATH):## os.path.isdir()函數判斷某一路徑是否為目錄if not os.path.isdir(housing_path):os.makedirs(housing_path)## 路徑拼接tgz_path = os.path.join(housing_path,"housing.tgz")urllib.request.urlretrieve(housing_url, tgz_path)## 解壓文件：打開、提取、關閉housing_tgz = tarfile.open(tgz_path)housing_tgz.extractall(path=housing_path)housing_tgz.close()## 加載數據，返回DataFrame對象 def load_housing_data(housing_path = HOUSING_PATH):csv_path = os.path.join(housing_path,"housing.csv")return pd.read_csv(csv_path)fetch_housing_data() housing = load_housing_data()

相關函數解析

1、os模塊

os.path.isdir(path) ——判斷路徑是否為目錄，存在返回True
os.path.join(path1[, path2[, …]])——將一個或多個路徑正確地連接起來
os.makedirs(path, mode=0o777)——遞歸創建目錄

2、urllib.request.urlretrieve

urllib.request.urlretrieve(url, filename=None, reporthook=None, data=None)
將URL地址的文件復制到本地filename的路徑中

四、查看數據結構

1、數據信息查看

對DataFrame對象

head() 方法：查看前5行數據；
info() 方法：快速查看數據描述，特別是總行數、每個屬性的類型和非空值的數量
housing[“ocean_proximity”].value_counts() ——該項（ocean_proximity）中的類別統計
describe()——數值屬性的概況

## DataFrame 的 head() 方法查看該數據集的前5行 print(housing.head()) ## info()方法：快速查看數據的描述 ## 特別是總行數、每個屬性的類型和非空值的數量 print(housing.info())

輸出結果

longitude latitude ... median_house_value ocean_proximity 0 -122.23 37.88 ... 452600.0 NEAR BAY 1 -122.22 37.86 ... 358500.0 NEAR BAY 2 -122.24 37.85 ... 352100.0 NEAR BAY 3 -122.25 37.85 ... 341300.0 NEAR BAY 4 -122.25 37.85 ... 342200.0 NEAR BAY[5 rows x 10 columns] <class 'pandas.core.frame.DataFrame'> RangeIndex: 20640 entries, 0 to 20639 Data columns (total 10 columns): longitude 20640 non-null float64 latitude 20640 non-null float64 housing_median_age 20640 non-null float64 total_rooms 20640 non-null float64 total_bedrooms 20433 non-null float64 population 20640 non-null float64 households 20640 non-null float64 median_income 20640 non-null float64 median_house_value 20640 non-null float64 ocean_proximity 20640 non-null object dtypes: float64(9), object(1) memory usage: 1.6+ MB None

==》total_bedrooms 20433 non-null float64
==》存在207個空值需要處理

print(housing["ocean_proximity"].value_counts()) ## 該項中的類別統計 print('----------'*5) print(housing.describe()) ## 數值屬性的概況

輸出結果

<1H OCEAN 9136 INLAND 6551 NEAR OCEAN 2658 NEAR BAY 2290 ISLAND 5 Name: ocean_proximity, dtype: int64 --------------------------------------------------longitude ... median_house_value count 20640.000000 ... 20640.000000 mean -119.569704 ... 206855.816909 std 2.003532 ... 115395.615874 min -124.350000 ... 14999.000000 25% -121.800000 ... 119600.000000 50% -118.490000 ... 179700.000000 75% -118.010000 ... 264725.000000 max -114.310000 ... 500001.000000[8 rows x 9 columns]

注意：describe() 中忽略空值，eg：total_rooms為20433

2、可視化描述——每個屬性的柱狀圖

柱狀圖（的縱軸）展示了特定范圍的實例的個數。

hist() 方法：對完整數據調用該方法，可畫出每個數值屬性的柱狀圖

import matplotlib.pyplot as plt housing.hist(bins=50, figsize=(20,15)) plt.show()

分析可知：

收入中位數貌似不是美元（ USD）。數據經過預處理：過高收入中位數的會變為 15（實際為 15.0001），過低的會變為 5（實際為 0.4999）

房屋年齡中位數和房屋價值中位數也被設了上限。由于房屋價值中位數是標簽，則預測的價格不會超過這個界限。==》需要重新確認

屬性值有不同的度量。==》特征縮放

許多柱狀圖的尾巴很長，分布不均==》變換到正態分布

五、數據準備

為了避免數據透視偏差，創建測試集

1、測試集

（1）實現（造輪子）

def split_train_test(data, test_ratio):shuffled_indices = np.random.permutation(len(data))test_set_size = int(len(data) * test_ratio)test_indices = shuffled_indices[:test_set_size]train_indices = shuffled_indices[test_set_size:]return data.iloc[train_indices], data.iloc[test_indices]train_set, test_set = split_train_test(housing, 0.2) print(len(train_set), "train +", len(test_set),"test")

輸出結果

16512 train + 4128 test

（2）知識點

1、random中shuffle與permutation的區別

函數 shuffle 與 permutation 都是對原來的數組進行重新洗牌（即隨機打亂原來的元素順序）；

區別：shuffle 直接在原來的數組上進行操作，改變原來數組的順序，無返回值。而 permutation 不直接在原來的數組上進行操作，而是返回一個新的打亂順序的數組，并不改變原來的數組。

a = np.arange(12) print a np.random.shuffle(a) print a print a = np.arange(12) print a b = np.random.permutation(a) print b print a

輸出結果

[ 0 1 2 3 4 5 6 7 8 9 10 11] [11 6 4 10 3 0 7 1 9 2 5 8][ 0 1 2 3 4 5 6 7 8 9 10 11] [10 4 8 11 1 7 6 2 0 9 5 3] [ 0 1 2 3 4 5 6 7 8 9 10 11]

（3）存在問題

程序再次運行，則產生不同的測試集。

解決方法：

保存第一次運行的結果，之后過程加載。

設置隨機數生成器種子 np.random.seed(2019)，可使得每次產生相同的 shuffled indices

若數據集更新，則上述方法均失敗。
==》
解決方法：使用每個實例的ID來判定這個實例是否應該放入測試集（假設每個實例都有唯一并且不變的ID）。
例如，你可以計算出每個實例ID的哈希值，只保留其最后一個字節，如果該值小于等于 51（約為 256 的 20%），就將其放入測試集。這樣可以保證在多次運行中，測試集保持不變，即使更新了數據集。新的測試集會包含新實例中的 20%，但不會有之前位于訓練集的實例。

如果使用行索引作為唯一識別碼，你需要保證新數據都放到現有數據的尾部，且沒有行被刪除。

用最穩定的特征來創建唯一識別碼。例如，一個區的維度和經度用最穩定的特征來創建唯一識別碼。例如，一個區的維度和經度。

import hashlib def test_set_check(identifier, test_ratio, hash)return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratiodef split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5)ids = data[id_column]in_test_set = ids.apply(lambda id_:test_set_check(id_, test_ratio, hash))return data.loc[~in_test_set],data.loc[in_test_set]## 方法1：將行索引作為ID housing_with_id = housing.reset_index() # adds an `index` column train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")## 方法2：使用經度緯度作為唯一標識度 housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"] train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

（4）實現（sklearn）

train_test_split()

random_state 參數:隨機生成器種子設置；
可以將種子傳遞給多個行數相同的數據集，可以在相同的索引上分割數據集

適用于：數據集很大時（尤其是和屬性相比）；

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=2019)

若數據集不大，則會有采樣偏差的風險
==》分層采樣（stratified sampling）
==》每個分層都要有足夠的實例

（5）分層采樣

loc、iloc、ix區別：https://blog.csdn.net/u012736685/article/details/86610946

## 收入中位數除以 1.5（以限制收入分類的數量）,ceil返回不小于x的最小整數 housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) ## 將所有大于5的分類歸入類別5 housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True)## 分層采樣——StratifiedShuffleSplit from sklearn.model_selection import StratifiedShuffleSplitsplit = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=2019) print(split) for train_index, test_index in split.split(housing, housing["income_cat"]):strat_train_set = housing.loc[train_index]strat_test_set = housing.loc[test_index]## 驗證數據集中收入分類的比例 print(housing["income_cat"].value_counts()/len(housing))## 刪除income_cat屬性，使數據回到初始狀態 for set in (strat_train_set, strat_test_set):set.drop(["income_cat"], axis=1, inplace=True)

輸出結果

StratifiedShuffleSplit(n_splits=1, random_state=2019, test_size=0.2,train_size=None) 3.0 0.350581 2.0 0.318847 4.0 0.176308 5.0 0.114438 1.0 0.039826 Name: income_cat, dtype: float64

六、數據探索和可視化、發現規律

1、地理數據可視化

（1）地理位置可視化

存在地理信息==》散點圖

## 存在地理信息==》散點圖 housing.plot(kind="scatter", x="longitude", y="latitude") plt.show()

## 顯示高密度區域的散點圖 housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1) plt.show()

（2）基本地理位置的房價可視化

關于房價散點圖：每個圈的半徑表示街區的人口（選項 s ），顏色代表價格（選項 c）。我們用預先定義的名為 jet 的顏色圖（選項 cmap），它的范圍是從藍色（低價）到紅色（高價）

==》房價和位置（比如，靠海）和人口密度聯系密切

2、查找關聯

（1）corr()

corr()方法計算出每對屬性間的標準相關系數（ standard correlation coefficient，也稱作皮爾遜相關系數）

相關系數的范圍是 [-1, 1]。當接近 1 時，意味強正相關；當相關系數接近 -1 時，意味強負相關。

corr_matrix = housing.corr() # 每個屬性和房價中位數的關聯度 corr_matrix_house_value = corr_matrix["median_house_value"].sort_values(ascending=False) print(corr_matrix_house_value)

輸出結果

median_house_value 1.000000 median_income 0.687894 total_rooms 0.135763 housing_median_age 0.108102 households 0.067783 total_bedrooms 0.050826 population -0.024467 longitude -0.049271 latitude -0.139948 Name: median_house_value, dtype: float64

（2）scatter_matrix()

pandas 的 scatter_matrix()：畫出每個數值屬性對每個其他數值屬性的圖。eg：有 $d$ 個屬性，則有 $d^2$ 個圖。

只關注幾個與房價中位數最有可能相關的屬性

from pandas.plotting import scatter_matrixattributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"] scatter_matrix(housing[attributes],figsize=(12, 8)) plt.show()

（3）本項目

最有希望用來預測房價中位數的屬性是收入中位數。

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1) plt.show()

==》

相關性非常高。向上趨勢，不是非常分散
最高價位于500000美元
存在不是太明顯的直線： 450000 美元、350000 美元、 280000 美元…

3、屬性組合試驗==》新屬性

思考：目標與已有屬性的關聯

housing["room_per_household"] = housing["total_rooms"]/housing["households"] housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"] housing["population_per_household"] = housing["population"]/housing["households"] # 查看相關矩陣 corr_matrix = housing.corr() corr_matrix_house_value = corr_matrix["median_house_value"].sort_values(ascending=False) print(corr_matrix_house_value)

輸出結果

median_house_value 1.000000 median_income 0.687894 room_per_household 0.146690 total_rooms 0.135763 housing_median_age 0.108102 households 0.067783 total_bedrooms 0.050826 population -0.024467 population_per_household -0.025585 longitude -0.049271 latitude -0.139948 bedrooms_per_room -0.253689 Name: median_house_value, dtype: float64

==》新的 bedrooms_per_room 屬性與房價中位數的關聯更強。臥室數/總房間數的比例越低，房價就越高。

七、為機器學習算法準備數據

數據轉換函數，適用于任何數據集上==》復用

注意：drop() 創建數據的備份，不改變原始數據。

訓練集（干凈的）的劃分

housing = strat_train_set.drop("median_house_value", axis=1) housing_labels = strat_train_set["median_house_value"].copy()

1、數據清洗

特征缺失

（1）DataFrame對象

dropna() 方法：去掉缺失的樣本
drop() 方法：去掉缺失的屬性
fillna() 方法：賦值填充

housing.dropna(subset=["total_bedrooms"]) housing.drop("total_bedrooms", axis=1) median = housing["total_bedrooms"].median() housing["total_bedroom"].fillna(median)

（2）Scikit-Learn 提供的 Imputer 類處理缺失值

創建一個 Imputer 實例對象，指定用某屬性的中位數來替換該屬性所有的缺失值；

準備數據：數值型

fit() 方法擬合訓練數據；

transform()方法將數據轉換

類型轉換（非必需）：ndarray->DataFrame

from sklearn.preprocessing import Imputer## 1.實例化Imputer對象 imputer = Imputer(strategy="median")## 2.準備數據 ## 由于只有數值屬性才有中位數==》不包括 ocean_proximity 的數據副本 housing_num = housing.drop("ocean_proximity", axis = 1)## 3.fit()擬合數據 imputer.fit(housing_num) ## 中位數位于實例變量 statistics_ 中 print(imputer.statistics_) print(housing_num.median().values) ## 等價 ## 轉換，結果類型為 numpy 數組## 4.transform()轉換數據 X = imputer.transform(housing_num) # print(type(X)) # <class 'numpy.ndarray'>## 5.數據格式轉換 ## 格式轉換：numpy數組 -> DataFrame格式 housing_tr = pd.DataFrame(X, columns=housing_num.columns)

（3）scikit-learn設計原則

一致性：所有對象接口簡單且一致。

估計器（estimator）：基于數據集對參數進行估計的對象。fit() 方法。
轉換器（transformer）：轉換數據集。transform() 方法。
預測器（predictor）：根據數據集作出預測。predict() 方法對新實例的數據集做出相應的預測。score() 方法對預測進行衡量。

可檢驗：超參數訪問
①實例的public變量直接訪問（eg：imputer.strategy）；
②實例變量名加下劃線（eg：imputer.statistics_）
類不可擴散
可組合
合理的默認值

2、處理文本和類別屬性

（1）將文本標簽轉換為數字

==》
單列文本特征：LabelEncoder
多列文本特征：factorize()

from sklearn.preprocessing import LabelEncoderencoder = LabelEncoder() housing_cat = housing["ocean_proximity"] housing_cat_encoded = encoder.fit_transform(housing_cat) print(housing_cat_encoded[:20])## 多個文本特征列——factorize()方法 housing_cat_encoded, housing_categories = housing_cat.factorize() print(housing_cat_encoded[:20])## 查看映射表 print(encoder.classes_)

輸出結果

[0 0 0 1 4 0 0 1 1 0 0 0 0 4 1 4 0 4 0 0] [0 0 0 1 2 0 0 1 1 0 0 0 0 2 1 2 0 2 0 0] ['<1H OCEAN' 'INLAND' 'ISLAND' 'NEAR BAY' 'NEAR OCEAN']

存在問題：ML算法會認為臨近的值比兩個疏遠的值更相似。
==》One-Hot Encoding

（2）One-Hot Encoding（獨熱編碼）

sklearn 提供 OneHotEncoder 編碼器，將整數分類值變為one-hot。

注意：fit_transform() 用于2D數組。

from sklearn.preprocessing import OneHotEncoderencoder = OneHotEncoder() housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1, 1)) print(type(housing_cat_1hot))## 轉換為 NumPy 數組：toarray() print(housing_cat_1hot.toarray())

<class ‘scipy.sparse.csr.csr_matrix’>
輸出結果是一個SciPy稀疏矩陣，只存儲非零元素位置，可以像一個2D數據那樣使用。

（3）LabelBinarizer（文本分類=》one-hot）

應用于標簽列的轉換，輸出結果是 ndarray 數組

參數：spare_output=True 可得到稀疏矩陣

## 一步轉換：由文本分類到one-hot編碼 from sklearn.preprocessing import LabelBinarizer# encoder = LabelBinarizer(sparse_output=True) # 結果為稀疏矩陣 encoder = LabelBinarizer() # 結果為 ndarray 數組 housing_cat_1hot = encoder.fit_transform(housing_cat) print(housing_cat_1hot) # ndarray數組

（4）CategoricalEncoder類（文本分類=》one-hot）

## from sklearn.preprocessing import CategoricalEncodercat_encoder = CategoricalEncoder() housing_cat_reshaped = housing_cat.values.reshape(-1, 1) housing_cat_1hot = cat_encoder.fit_transform(housing_cat_reshaped) print(housing_cat_1hot)

3、自定義轉換器

sklearn是依賴鴨子類型的（而不是繼承），所以創建一個類并執行三個方法：fit()、transform() 和 fit_transform()。

若通過添加 TransformMixin 作為基類，可以容易獲得最后一個；
若添加 BaseEstimator 作為基類（且構造器中避免使用 *args 和 **kargs），你就能得到兩個額外的方法（get_params() 和 set_params() ），二者可以方便地進行超參數自動微調。

from sklearn.base import BaseEstimator, TransformerMixin rooms_ix, bedrooms_ix, population_ix, household_ix = 3, 4, 5, 6class CombinedAttributesAdder(BaseEstimator, TransformerMixin):# 超參數 add_bedrooms_per_roomdef __init__(self, add_bedrooms_per_room=True):self.add_bedrooms_per_room = add_bedrooms_per_roomdef fit(self, X, y=None):return selfdef transform(self, X, y=None):rooms_per_household = X[:, rooms_ix] / X[:, household_ix]population_per_household = X[:, population_ix] / X[:, household_ix]if self.add_bedrooms_per_room:bedrooms_per_room = X[:, bedrooms_ix] / X[:, household_ix]return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]else:return np.c_[X, rooms_per_household, population_per_household] attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False) housing_extra_attribs = attr_adder.transform(housing.values)

超參數 add_bedrooms_per_room ，默認設為 True （提供一個合理的默認值很有幫助）。這個超參數可以讓你方便地發現添加了這個屬性是否對機器學習算法有幫助。更一般地，你可以為每個不能完全確保的數據準備步驟添加一個超參數。

4、特征縮放（重要）

通常來說，當輸入的數值屬性量度不同時，ML算法的性能都不會好。==》特征縮放

線性函數歸一化（Min-Max scaling）
標準化（standardization）

（1）線性函數歸一化（Min-Max scaling）

也稱歸一化（normalization）：值被轉變、重新縮放，
直到范圍變成 0 到 1。

手動方法：通過減去最小值，然后再除以最大值與最小值的差值，來進行歸一化。

sklearn中MinMaxScaler。參數：feature_range，該參數可以改變范圍

（2）標準化

首先減去平均值（所以標準化值的平均值總是 0），然后除以方差，使得到的分布具有單位方差。

標準化不會限定值到某個特定的范圍，受異常值的影響很小。

sklearn中StandardScaler

注意：縮放器只能向訓練集擬合，而不是向完整的數據集。==》使用縮放器轉換訓練集和測試集

5、轉換流水線

sklearn中Pipeline類，可以實現一系列的轉換。定義步驟順序的名字/估計器對的列表

（1）數值屬性Pipeline

示例：數值屬性的小流水線

from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScalernum_pipeline = Pipeline([('imputer', Imputer(strategy="median")),("attribs_adder", CombinedAttributesAdder()),('std_scaler', StandardScaler()), ]) housing_num_tr = num_pipeline.fit_transform(housing_num)

（2）多Pipeline——FeatureUnion

完整的處理數值和類別屬性的Pipeline

from sklearn.pipeline import FeatureUnion from sklearn_features.transformers import DataFrameSelectornum_attribs = list(housing_num) cat_attribs = ["ocean_proximity"]num_pipeline = Pipeline([('selector', DataFrameSelector(num_attribs)),('imputer', Imputer(strategy="median")),('attribs_adder', CombinedAttributesAdder()),('std_scaler', StandardScaler()), ]) cat_pipeline = Pipeline([('selector', DataFrameSelector(cat_attribs)),('label_binarizer', CategoricalEncoder()), ]) full_pipeline = FeatureUnion(transformer_list=[("num_pipeline", num_pipeline),("cat_pipeline", cat_pipeline), ])housing_prepared = full_pipeline.fit_transform(housing) print(housing_prepared.toarray()) print(housing_prepared.shape)

輸出結果

[[ 0.82875658 -0.77511404 -0.45095287 ... 0. 0.0. ][-1.23341542 0.81679116 -1.00737005 ... 0. 0.0. ][ 0.71890722 -0.76572227 -0.21248836 ... 0. 0.0. ]...[ 0.95857854 -0.81737701 -1.24583456 ... 0. 0.0. ][ 1.25317454 -1.16956843 -0.92788188 ... 0. 0.0. ][-1.57794295 1.26290029 -0.21248836 ... 0. 0.0. ]] (16512, 16)

也可自定義轉換器

## 自定義轉換器 from sklearn.base import BaseEstimator, TransformerMixinclass DataFrameSeclector(BaseEstimator, TransformerMixin):def __init__(self, attribute_names):self.attribute_names = attribute_namesdef fit(self, X, y=None):return selfdef transform(self, X):return X[self.attribute_names].values

六、選擇并訓練模型

在訓練集上訓練和評估

1、線性回歸模型

from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_errorlin_reg = LinearRegression() lin_reg.fit(housing_prepared, housing_labels)## 部分數據的測試 some_data = housing.iloc[:5] some_labels = housing_labels.iloc[:5] some_data_prepared = full_pipeline.transform(some_data) some_predict = lin_reg.predict(some_data_prepared) print("Predictions:\t", some_predict) print("Labels:\t", list(some_labels))## 計算 rmse housing_predictions = lin_reg.predict(housing_prepared) lin_mse = mean_squared_error(housing_labels, housing_predictions) lin_rmse = np.sqrt(lin_mse) print(lin_rmse)

輸出結果：68669.95539695179。
==》欠擬合
==》原因：特征沒有提供足夠多的信息來做出一個好的預測，或者模型并不強大。
==》改進方面：

更強大的模型；
更好的特征；
去掉模型上的限制（正則化過多）

2、決策樹回歸

可以發現數據中復雜的非線性關系。

from sklearn.tree import DecisionTreeRegressortree_reg = DecisionTreeRegressor() tree_reg.fit(housing_prepared, housing_labels)housing_predictions = tree_reg.predict(housing_prepared) tree_mse = mean_squared_error(housing_labels, housing_predictions) tree_rmse = np.sqrt(tree_mse) print(tree_rmse)

輸出結果：0.0
==》模型嚴重過擬合

3、使用交叉驗證做評估

常用方法：

使用函數 train_test_split 來分割訓練集，訓練集、驗證集、測試集；
交叉驗證：K折交叉驗證(K-fold cross-validation)

三種模型：LR、決策樹回歸、隨機森林回歸的交叉驗證

from sklearn.model_selection import cross_val_scoredef display_scores(scores):print("Scores:", scores)print("Mean:", scores.mean())print("Standard deviation:", scores.std()) tree_scores = cross_val_score(tree_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10) tree_rmse_scores = np.sqrt(-tree_scores) display_scores(tree_rmse_scores) print("---------"*4)lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10) lin_rmse_scores = np.sqrt(-lin_scores) display_scores(lin_rmse_scores)print("----------"*4) from sklearn.ensemble import RandomForestRegressorforest_reg = RandomForestRegressor() forest_reg.fit(housing_prepared, housing_labels) forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels,scoring="neg_mean_squared_error", cv=10) forest_rmse_scores = np.sqrt(-forest_scores) display_scores(forest_rmse_scores)

輸出結果

Scores: [70822.23047418 71152.99791399 70767.60492457 69174.9504963772622.10092238 69728.83829471 66654.37564791 70054.6115042669280.92370212 74907.80020052] Mean: 70516.64340810133 Standard deviation: 2082.183642340021 ------------------------------------ Scores: [ 70442.28429562 69617.76028683 64863.46929222 66655.7594600369140.8730363 69983.30339185 168909.38005488 69421.9216788569133.39326617 72247.69581812] Mean: 79041.58405808883 Standard deviation: 30017.242297265897 ---------------------------------------- Scores: [51250.15421462 51550.55413458 50450.47743545 49847.2665263152580.05326516 53701.83169532 53254.54063586 53543.9832143551547.57591096 54118.87113271] Mean: 52184.53081653092 Standard deviation: 1390.819447961666

解決過擬合可以通過簡化模型，給模型加限制（即，規整化），或用更多的訓練數據。

4、模型保存

保存模型，方便后續的使用。要確保有超參數和訓練參數，以及交叉驗證評分和實際的預測值。

python中 pickle模塊
sklearn中 sklearn.externals.joblib

from sklearn.externals import joblib## dump joblib.dump(forest_reg, "my_model.pkl") ## load my_model_loaded = joblib.load("my_model.pkl")

七、模型微調

1、網格搜索——GridSearchCV

告訴 GridSearchCV 要試驗有哪些超參數，要試驗什么值， GridSearchCV 就能用交叉驗證試驗所有可能超參數值的組合。

from sklearn.model_selection import GridSearchCVparam_grid = [{'n_estimators':[3, 10, 30], 'max_features':[2, 4, 6, 8]},{'bootstrap':[False], 'n_estimators':[3, 10], 'max_features':[2, 3, 4]}, ] forest_reg = RandomForestRegressor() grid_search = GridSearchCV(forest_reg, param_grid, cv=5, scoring='neg_mean_squared_error') grid_search.fit(housing_prepared, housing_labels)## 獲得參數的最佳組合 print(grid_search.best_params_) ## 獲取最佳的估計器 print(grid_search.best_estimator_) ## 得到評估得分 cvres = grid_search.cv_results_ for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):print(np.sqrt(-mean_score), params)

輸出結果

{'max_features': 6, 'n_estimators': 30}RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,max_features=6, max_leaf_nodes=None, min_impurity_decrease=0.0,min_impurity_split=None, min_samples_leaf=1,min_samples_split=2, min_weight_fraction_leaf=0.0,n_estimators=30, n_jobs=1, oob_score=False, random_state=None,verbose=0, warm_start=False)64779.22756782305 {'max_features': 2, 'n_estimators': 3} 55261.50069764705 {'max_features': 2, 'n_estimators': 10} 52361.133957894344 {'max_features': 2, 'n_estimators': 30} 59781.94102696423 {'max_features': 4, 'n_estimators': 3} 51630.24533131685 {'max_features': 4, 'n_estimators': 10} 49858.27456556619 {'max_features': 4, 'n_estimators': 30} 58919.396444692095 {'max_features': 6, 'n_estimators': 3} 51688.869762217924 {'max_features': 6, 'n_estimators': 10} 49706.749116241685 {'max_features': 6, 'n_estimators': 30} 58580.04583044209 {'max_features': 8, 'n_estimators': 3} 51316.919104777364 {'max_features': 8, 'n_estimators': 10} 49836.46832731868 {'max_features': 8, 'n_estimators': 30} 61793.95302711806 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3} 54158.3503067861 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10} 59230.45284179936 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3} 51852.484216931596 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10} 57991.28909825388 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3} 51045.46342488829 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

注意：如果 GridSearchCV 是以（默認值） refit=True 開始運行的，則一旦用交叉驗證找到了最佳的估計器，就會在整個訓練集上重新訓練。這是一個好方法，因為用更多數據訓練會提高性能。

擴展：可以像超參數一樣處理數據準備的步驟。
eg：網格搜索可以自動判斷是否添加一個你不確定的特征（比如，使用轉換器 CombinedAttributesAdder 的超參數 add_bedrooms_per_room ）。它還能用相似的方法來自動找到處理異常值、缺失特征、特征選擇等任務的最佳方法。

2、隨機搜索——RandomizedSearchCV

適用于：超參數的搜索空間很大時，它通過選擇每個超參數的一個隨機值的特定數量的隨機組合。

優點：

可設置搜索次數，控制超參數搜索的計算量；
例如運行1000次，就可探索每個超參數的1000個不同的值。

3、集成方法

將表現最好的模型組合起來。

4、分析最佳模型和它們的誤差

feature_importances = grid_search.best_estimator_.feature_importances_ print(feature_importances)print("------------"*4) # 將重要性分數和屬性名放到一起 extra_attribs = ["rooms_per_hhold","pop_per_hhold","bedrooms_per_room"] cat_one_hot_attribs = list(encoder.classes_) attributes = num_attribs + extra_attribs + cat_one_hot_attribs print(sorted(zip(feature_importances, attributes), reverse=True))

輸出結果

[9.16165799e-02 7.26401545e-02 3.98792143e-02 1.86271235e-021.60430050e-02 1.73210114e-02 1.56763513e-02 3.50405341e-016.66148402e-02 1.06807615e-01 2.44534680e-02 1.61218489e-021.50921731e-01 2.34067365e-04 4.78731917e-03 7.85032928e-03] ------------------------------------------------ [(0.350405341367853, 'median_income'), (0.15092173111904114, 'INLAND'), (0.10680761466868184, 'pop_per_hhold'), (0.09161657987744719, 'longitude'), (0.07264015445556546, 'latitude'), (0.06661484017044574, 'rooms_per_hhold'), (0.0398792143039908, 'housing_median_age'), (0.024453467989156156, 'bedrooms_per_room'), (0.01862712349143468, 'total_rooms'), (0.01732101143541747, 'population'), (0.0161218489227077, '<1H OCEAN'), (0.01604300503679087, 'total_bedrooms'), (0.015676351349524945, 'households'), (0.007850329279641088, 'NEAR OCEAN'), (0.00478731916701119, 'NEAR BAY'), (0.00023406736529078728, 'ISLAND')]

八、用測試集評估系統

過程：

從測試集得到預測值和標簽；
運行 full_pipeline 轉換數據（調用 transform() ，而不是 fit_transform() ！）;
再用測試集評估最終模型：

final_model = grid_search.best_estimator_X_test = strat_test_set.drop("median_house_value", axis=1) y_test = strat_test_set["median_house_value"].copy()X_test_prepared = full_pipeline.transform(X_test) final_predictions = final_model.predict(X_test_prepared)final_mse = mean_squared_error(y_test, final_predictions) final_rmse = np.sqrt(final_mse)

評估結果通常要比交叉驗證的效果差一點

九、啟動、監控、維護系統

準備：接入輸入數據源、編寫測試、監控代碼、新數據滾動

總結

以上是生活随笔為你收集整理的用Scikit-learn和TensorFlow进行机器学习（二）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： Pandas——loc、iloc、ix
下一篇：用Scikit-learn和Tensor

编程问答

用Scikit-learn和TensorFlow进行机器学习（二）

文章目錄

一個完整的機器學習項目

一、真實數據

二、項目概述

1、劃定問題

2、選擇性能指標（損失函數）

（1）回歸任務

（2）平均絕對誤差（MAE，Mean Absolute Error）

（3）范數

3、核實假設

三、獲取數據

1、os模塊

2、urllib.request.urlretrieve

四、查看數據結構

1、數據信息查看

2、可視化描述——每個屬性的柱狀圖

五、數據準備

1、測試集

（1）實現（造輪子）

（2）知識點

（3）存在問題

（4）實現（sklearn）

（5）分層采樣

六、數據探索和可視化、發現規律

1、地理數據可視化

（1）地理位置可視化

（2）基本地理位置的房價可視化

2、查找關聯

（1）corr()

（2）scatter_matrix()

（3）本項目

3、屬性組合試驗==》新屬性

七、為機器學習算法準備數據

1、數據清洗

（1）DataFrame對象

（2）Scikit-Learn 提供的 Imputer 類處理缺失值

（3）scikit-learn設計原則

2、處理文本和類別屬性

（1）將文本標簽轉換為數字

（2）One-Hot Encoding（獨熱編碼）

（3）LabelBinarizer（文本分類=》one-hot）

（4）CategoricalEncoder類（文本分類=》one-hot）

3、自定義轉換器

4、特征縮放（重要）

（1）線性函數歸一化（Min-Max scaling）

（2）標準化

5、轉換流水線

（1）數值屬性Pipeline

（2）多Pipeline——FeatureUnion

六、選擇并訓練模型

1、線性回歸模型

2、決策樹回歸

3、使用交叉驗證做評估

4、模型保存

七、模型微調

1、網格搜索——GridSearchCV

2、隨機搜索——RandomizedSearchCV

3、集成方法

4、分析最佳模型和它們的誤差

八、用測試集評估系統

九、啟動、 監控、 維護系統

總結

九、啟動、監控、維護系統