當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python 数据分析实践--（1）收入预测分析

發(fā)布時(shí)間：2023/12/10 python 35 豆豆

生活随笔收集整理的這篇文章主要介紹了 python 数据分析实践--（1）收入预测分析小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

收入分析預(yù)測(cè)

- 說(shuō)明：
- 1. 預(yù)覽數(shù)據(jù)集，明確分析目的
- 2. 導(dǎo)入數(shù)據(jù)集，預(yù)處理數(shù)據(jù)
- 3. 探索數(shù)據(jù)背后的特征
- - 3.1 數(shù)值型變量統(tǒng)計(jì)描述
  - 3.2 離散型變量統(tǒng)計(jì)描述
  - 3 .3 了解數(shù)據(jù)的分布形狀
- 4. 數(shù)據(jù)建模
- - 4.1 對(duì)離散變量重編碼
  - 4.2 拆分?jǐn)?shù)據(jù)集
  - 4.3 搭建模型
  - 4.4 模型網(wǎng)格搜索法，探尋模型最佳參數(shù)
  - 4.5 模型預(yù)測(cè)與評(píng)估
  - - 4.5.1 K近鄰模型在測(cè)試集上的預(yù)測(cè)
    - 4.5.2 K近鄰網(wǎng)格搜索模型在測(cè)試集上的預(yù)測(cè)
    - 4.5.3 GBDT模型在測(cè)試集上的預(yù)測(cè)
    - 4.5.4 網(wǎng)格搜索的GBDT模型在測(cè)試集上的預(yù)測(cè)
- 5.實(shí)驗(yàn)總結(jié)

說(shuō)明：

本文用途只做學(xué)習(xí)記錄：

參考書籍：從零開(kāi)始學(xué)Python數(shù)據(jù)分析與挖掘／劉順祥著．—北京：清華大學(xué)出版社，2018
數(shù)據(jù)下載：鏈接：https://pan.baidu.com/s/1VhnNfUNgNLICIFRyrlteOg提取碼：m1dl

首先看一下劉老師介紹的數(shù)據(jù)分析和數(shù)據(jù)挖掘的區(qū)別：

1. 預(yù)覽數(shù)據(jù)集，明確分析目的

通過(guò)Excel工具打開(kāi)income文件，可發(fā)現(xiàn)該數(shù)據(jù)集一共有 32 561條樣本數(shù)據(jù)，共有15個(gè)數(shù)據(jù)變量，其中9個(gè)離散型變量，6個(gè)數(shù)值型變量。數(shù)據(jù)項(xiàng)主要包括：年齡，工作類型，受教育程度，收入等，具體可見(jiàn)下面兩個(gè)圖：

拿到上面的數(shù)據(jù)集，我們觀察這些數(shù)據(jù)都有什么用，想一想這張數(shù)據(jù)表中income比較特殊，有分析價(jià)值，其他變量的不同會(huì)對(duì)income產(chǎn)生一定的影響。

實(shí)驗(yàn)?zāi)康?#xff1a;因此，基于上面的數(shù)據(jù)集，需要預(yù)測(cè)居民的年收入是否會(huì)超過(guò)5萬(wàn)美元？

2. 導(dǎo)入數(shù)據(jù)集，預(yù)處理數(shù)據(jù)

在jupyter notebook中導(dǎo)入相應(yīng)包，讀取數(shù)據(jù)，進(jìn)行預(yù)處理。在上述數(shù)據(jù)集中，有許多變量都是離散型的，如受教育程度、婚姻狀態(tài)、職業(yè)、性別等。通常數(shù)據(jù)拿到手后，都需要對(duì)其進(jìn)行清洗，例如檢查數(shù)據(jù)中是否存在重復(fù)觀測(cè)、缺失值、異常值等，而且，如果建模的話，還需要對(duì)字符型的離散變量做相應(yīng)的重編碼。

import numpy as np import pandas as pd import seaborn as sns# 下載的數(shù)據(jù)集存放的路徑： income = pd.read_excel(r'E:\Data\1\income.xlsx')# 查看數(shù)據(jù)集是否存在缺失 income.apply(lambda x:np.sum(x.isnull())) age 0 workclass 1836 fnlwgt 0 education 0 education-num 0 marital-status 0 occupation 1843 relationship 0 race 0 sex 0 capital-gain 0 capital-loss 0 hours-per-week 0 native-country 583 income 0 dtype: int64

從上面的結(jié)果可以發(fā)現(xiàn)，居民的收入數(shù)據(jù)集中有3個(gè)變量存在數(shù)值缺失，分別是居民的工作類型（離散型）缺1836、職業(yè)（離散型）缺1843和國(guó)籍（離散型）缺583。缺失值的存在一般都會(huì)影響分析或建模的結(jié)果，所以需要對(duì)缺失數(shù)值做相應(yīng)的處理。
缺失值的處理一般采用三種方法：

1.刪除法，缺失的數(shù)據(jù)較少時(shí)適用；
2.替換法，用常數(shù)替換缺失變量，離散變量用眾數(shù)，數(shù)值變量用均值或中位數(shù)；
3.插補(bǔ)法：用未缺失的預(yù)測(cè)該缺失變量。

根據(jù)上述方法，三個(gè)缺失變量都為離散型，可用眾數(shù)替換。pandas中fillna()方法，能夠使用指定的方法填充NA/NaN值。
函數(shù)形式：
fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
參數(shù)：

value：用于填充的空值的值。（該處為字典）
method： {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None。定義了填充空值的方法， pad/ffill表示用前面行/列的值，填充當(dāng)前行/列的空值， backfill / bfill表示用后面行/列的值，填充當(dāng)前行/列的空值。
axis：軸。0或’index’，表示按行刪除；1或’columns’，表示按列刪除。
inplace：是否原地替換。布爾值，默認(rèn)為False。如果為True，則在原DataFrame上進(jìn)行操作，返回值為None。
limit：int，default None。如果method被指定，對(duì)于連續(xù)的空值，這段連續(xù)區(qū)域，最多填充前 limit 個(gè)空值（如果存在多段連續(xù)區(qū)域，每段最多填充前 limit 個(gè)空值）。如果method未被指定，在該axis下，最多填充前 limit 個(gè)空值（不論空值連續(xù)區(qū)間是否間斷）
downcast：dict, default is None，字典中的項(xiàng)，為類型向下轉(zhuǎn)換規(guī)則。或者為字符串“infer”，此時(shí)會(huì)在合適的等價(jià)類型之間進(jìn)行向下轉(zhuǎn)換，比如float64 to int64 if possible。

# 缺失值處理，采用眾數(shù)替換法（mode（）方法取眾數(shù)） income.fillna(value={'workclass':income['workclass'].mode()[0],'ouccupation':income['occupation'].mode()[0],'native-country':income['native-country'].mode()[0]},inplace = True)

3. 探索數(shù)據(jù)背后的特征

對(duì)缺失值采用了替換處理的方法，接下來(lái)對(duì)居民收入數(shù)據(jù)集做簡(jiǎn)單的探索性分析，目的是了解數(shù)據(jù)背后的特征如數(shù)據(jù)的集中趨勢(shì)、離散趨勢(shì)、數(shù)據(jù)形狀和變量間的關(guān)系等。首先，需要知道每個(gè)變量的基本統(tǒng)計(jì)值，如均值、中位數(shù)、眾數(shù)等，只有了解了所需處理的數(shù)據(jù)特征，才能做到“心中有數(shù)”。

3.1 數(shù)值型變量統(tǒng)計(jì)描述

# 3.1 數(shù)值型變量統(tǒng)計(jì)描述 income.describe() agefnlwgteducation-numcapital-gaincapital-losshours-per-weekcountmeanstdmin25%50%75%max

32561.000000	3.256100e+04	32561.000000	32561.000000	32561.000000	32561.000000
38.581647	1.897784e+05	10.080679	1077.648844	87.303830	40.437456
13.640433	1.055500e+05	2.572720	7385.292085	402.960219	12.347429
17.000000	1.228500e+04	1.000000	0.000000	0.000000	1.000000
28.000000	1.178270e+05	9.000000	0.000000	0.000000	40.000000
37.000000	1.783560e+05	10.000000	0.000000	0.000000	40.000000
48.000000	2.370510e+05	12.000000	0.000000	0.000000	45.000000
90.000000	1.484705e+06	16.000000	99999.000000	4356.000000	99.000000

上面的結(jié)果描述了有關(guān)數(shù)值型變量的簡(jiǎn)單統(tǒng)計(jì)值，包括非缺失觀測(cè)的個(gè)數(shù)（count）、平均值（mean）、標(biāo)準(zhǔn)差（std）、最小值（min）、下四分位數(shù)（25%）、中位數(shù)（50%）、上四分位數(shù)（75%）和最大值（max）。

3.2 離散型變量統(tǒng)計(jì)描述

# 2. 離散型變量統(tǒng)計(jì)描述 income.describe(include= ['object']) workclasseducationmarital-statusoccupationrelationshipracesexnative-countryincomecountuniquetopfreq

32561	32561	32561	30718	32561	32561	32561	32561	32561
8	16	7	14	6	5	2	41	2
Private	HS-grad	Married-civ-spouse	Prof-specialty	Husband	White	Male	United-States	<=50K
24532	10501	14976	4140	13193	27816	21790	29753	24720

上面為離散變量的統(tǒng)計(jì)值，包含每個(gè)變量非缺失觀測(cè)的數(shù)量（count）、不同離散值的個(gè)數(shù)（unique）、出現(xiàn)頻次最高的離散值（top）和最高頻次數(shù)（freq）。以受教育水平變量為例，一共有16種不同的教育水平；3萬(wàn)多居民中，高中畢業(yè)的學(xué)歷是出現(xiàn)最多的；并且一共有10 501名。

3 .3 了解數(shù)據(jù)的分布形狀

為了了解數(shù)據(jù)的分布形狀（如偏度、峰度等）可以通過(guò)可視化的方法進(jìn)行展現(xiàn)。

#以被調(diào)查居民的年齡和每周工作小時(shí)數(shù)為例，繪制各自的分布形狀圖： import matplotlib.pyplot as plt# 設(shè)置繪圖風(fēng)格 plt.style.use('ggplot')# 設(shè)置多圖形的組合 fig, axes = plt.subplots(3,1)# 繪制不同收入水平下的年齡核密度圖，觀察連續(xù)型變量的分布情況 income.age[income.income == ' <=50K'].plot(kind = 'kde', label = '<=50K', ax = axes[0], legend = True, linestyle = '-') income.age[income.income == ' >50K'].plot(kind = 'kde', label = '>50K', ax = axes[0], legend = True, linestyle = '--')# 繪制不同收入水平下的周工作小時(shí)核密度圖 income['hours-per-week'][income.income == ' <=50K'].plot(kind = 'kde', label = '<=50K', ax = axes[1], legend = True, linestyle = '-') income['hours-per-week'][income.income == ' >50K'].plot(kind = 'kde', label = '>50K', ax = axes[1], legend = True, linestyle = '--')# 繪制不同收入水平下的受教育時(shí)長(zhǎng)核密度圖 income['education-num'][income.income == ' <=50K'].plot(kind = 'kde', label = '<=50K', ax = axes[2], legend = True, linestyle = '-') income['education-num'][income.income == ' >50K'].plot(kind = 'kde', label = '>50K', ax = axes[2], legend = True, linestyle = '--')

第一幅圖展現(xiàn)的是，在不同收入水平下，年齡的核密度分布圖，對(duì)于年收入超過(guò)5萬(wàn)美元的居民來(lái)說(shuō)，他們的年齡幾乎呈現(xiàn)正態(tài)分布，而收入低于5萬(wàn)美元的居民，年齡呈現(xiàn)右偏特征，即年齡偏大的居民人數(shù)要比年齡偏小的人數(shù)多。

第二幅圖展現(xiàn)了不同收入水平下，周工作小時(shí)數(shù)的核密度圖，很明顯，兩者的分布趨勢(shì)非常相似，并且出現(xiàn)局部峰值。

第三幅圖展現(xiàn)了不同收入水平下，教育時(shí)長(zhǎng)的核密度圖，很明顯，兩者的分布趨勢(shì)非常相似，并且也多次出現(xiàn)局部峰值。
針對(duì)離散型變量，對(duì)比居民的收入水平高低在性別、種族狀態(tài)、家庭關(guān)系等方面的差異，進(jìn)而可以發(fā)現(xiàn)這些離散變量是否影響收入水平：

# 構(gòu)造不同收入水平下各種族人數(shù)的數(shù)據(jù) race = pd.DataFrame(income.groupby(by = ['race','income']).aggregate(np.size).loc[:,'age']) #print(race)# 重設(shè)行索引 race = race.reset_index() #print(race)# 變量重命名 race.rename(columns={'age':'counts'}, inplace=True) #print(race)# 排序 race.sort_values(by = ['race','counts'], ascending=False, inplace=True) #print(race)# 構(gòu)造不同收入水平下各家庭關(guān)系人數(shù)的數(shù)據(jù) relationship = pd.DataFrame(income.groupby(by = ['relationship','income']).aggregate(np.size).loc[:,'age']) relationship = relationship.reset_index() relationship.rename(columns={'age':'counts'}, inplace=True) relationship.sort_values(by = ['relationship','counts'], ascending=False, inplace=True)# 構(gòu)造不同收入水平下各男女人數(shù)的數(shù)據(jù) sex = pd.DataFrame(income.groupby(by = ['sex','income']).aggregate(np.size).loc[:,'age']) sex = sex.reset_index() sex.rename(columns={'age':'counts'}, inplace=True) sex.sort_values(by = ['sex','counts'], ascending=False, inplace=True)# 設(shè)置圖框比例，并繪圖 plt.figure(figsize=(9,5)) sns.barplot(x="race", y="counts", hue = 'income', data=race) plt.show()plt.figure(figsize=(9,5)) sns.barplot(x="relationship", y="counts", hue = 'income', data=relationship) plt.show()plt.figure(figsize=(9,5)) sns.barplot(x="sex", y="counts", hue = 'income', data=sex) plt.show()

圖一、反映的是相同的種族下，居民年收入水平高低的人數(shù)差異；圖二、反映的是相同的家庭成員關(guān)系下，居民年收入水平高低的人數(shù)差異。但無(wú)論怎么比較，都發(fā)現(xiàn)一個(gè)規(guī)律，即在某一個(gè)相同的水平下（如白種人或未結(jié)婚人群中），年收入低于5萬(wàn)美元的人數(shù)都要比年收入高于5萬(wàn)美元的人數(shù)多，這個(gè)應(yīng)該是抽樣導(dǎo)致的差異（數(shù)據(jù)集中年收入低于5萬(wàn)和高于5萬(wàn)的居民比例大致在75%:25%）。圖三、反映的是相同的性別下，居民收入水平高低人數(shù)的差異；其中，女性收入低于5萬(wàn)美元的人數(shù)比高于5萬(wàn)美元人數(shù)的差異比男性更嚴(yán)重，比例大致為90%:10%，男性大致為70%：30%。

4. 數(shù)據(jù)建模

4.1 對(duì)離散變量重編碼

由于數(shù)據(jù)集中有很多離散型變量，這些變量的值為字符串，不利于建模，因此，需要先對(duì)這些變量進(jìn)行重新編碼。編碼的方法有很多種：

將字符型的值轉(zhuǎn)換為整數(shù)型的值
啞變量處理（0-1變量）
One-Hot熱編碼（類似于啞變量）

在本案例中，將采用“字符轉(zhuǎn)數(shù)值”的方法對(duì)離散型變量進(jìn)行重編碼

# 離散型變量的重編碼 for feature in income.columns:if income[feature].dtype == 'object':income[feature] = pd.Categorical(income[feature]).codes income.head(10) ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome0123456789

39	6	77516	9	13	4	0	1	4	1	2174	40	38	0
50	5	83311	9	13	2	3	0	4	1	0	13	38	0
38	3	215646	11	9	0	5	1	4	1	0	40	38	0
53	3	234721	1	7	2	5	0	2	1	0	40	38	0
28	3	338409	9	13	2	9	5	2	0	0	40	4	0
37	3	284582	12	14	2	3	5	4	0	0	40	38	0
49	3	160187	6	5	3	7	1	2	0	0	16	22	0
52	5	209642	11	9	2	3	0	4	1	0	45	38	1
31	3	45781	12	14	4	9	1	4	0	14084	50	38	1
42	3	159449	9	13	2	3	0	4	1	5178	40	38	1

對(duì)字符型離散變量的重編碼效果，所有的字符型變量都變成了整數(shù)型變量，接下來(lái)就基于這個(gè)處理好的數(shù)據(jù)集對(duì)收入水平income進(jìn)行預(yù)測(cè)。
在原本的居民收入數(shù)據(jù)集中，關(guān)于受教育程度的有兩個(gè)變量，一個(gè)是education（教育水平），另一個(gè)是education-num（受教育時(shí)長(zhǎng)），而且這兩個(gè)變量的值都是一一對(duì)應(yīng)的，只不過(guò)一個(gè)是字符型，另一個(gè)是對(duì)應(yīng)的數(shù)值型，如果將這兩個(gè)變量都包含在模型中的話，就會(huì)產(chǎn)生信息的冗余；fnlwgt變量代表的是一種序號(hào)，其對(duì)收入水平的高低并沒(méi)有實(shí)際意義。故為了避免冗余信息和無(wú)意義變量對(duì)模型的影響，考慮將education變量和fnlwgt變量從數(shù)據(jù)集中刪除。

income.drop(['education','fnlwgt'], axis=1, inplace=True) income.head(10) ageworkclasseducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countryincome0123456789

39	6	13	4	0	1	4	1	2174	40	38	0
50	5	13	2	3	0	4	1	0	13	38	0
38	3	9	0	5	1	4	1	0	40	38	0
53	3	7	2	5	0	2	1	0	40	38	0
28	3	13	2	9	5	2	0	0	40	4	0
37	3	14	2	3	5	4	0	0	40	38	0
49	3	5	3	7	1	2	0	0	16	22	0
52	5	9	2	3	0	4	1	0	45	38	1
31	3	14	4	9	1	4	0	14084	50	38	1
42	3	13	2	3	0	4	1	5178	40	38	1

上面表格呈現(xiàn)的就是經(jīng)處理“干凈”的數(shù)據(jù)集，所要預(yù)測(cè)的變量就是income，該變量是二元變量，對(duì)其預(yù)測(cè)的實(shí)質(zhì)就是對(duì)年收入水平的分類（一個(gè)新樣本進(jìn)來(lái)，通過(guò)分類模型，可以將該樣本分為哪一種收入水平）。

關(guān)于分類模型有很多種：

Logistic模型
**決策樹(shù) **
K近鄰
樸素貝葉斯模型
支持向量機(jī)
隨機(jī)森林
梯度提升樹(shù)GBDT模型等。

本案例將對(duì)比使用K近鄰和GBDT兩種分類器，因?yàn)橥ǔＧ闆r下，都會(huì)選用多個(gè)模型作為備選，通過(guò)對(duì)比才能得知哪種模型可以更好地?cái)M合數(shù)據(jù)。

4.2 拆分?jǐn)?shù)據(jù)集

接下來(lái)就進(jìn)一步說(shuō)明如何針對(duì)分類問(wèn)題，從零開(kāi)始完成建模的步驟。基于上面的“干凈”數(shù)據(jù)集，需要將其拆分為兩個(gè)部分，一部分用于分類器模型的構(gòu)建，另一部分用于分類器模型的評(píng)估，這樣做的目的是避免分類器模型過(guò)擬合或欠擬合。如果模型在訓(xùn)練集上表現(xiàn)很好，而在測(cè)試集中表現(xiàn)很差，則說(shuō)明分類器模型屬于過(guò)擬合狀態(tài)；如果模型在訓(xùn)練過(guò)程中都不能很好地?cái)M合數(shù)據(jù)，那說(shuō)明模型屬于欠擬合狀態(tài)。通常情況下，會(huì)把訓(xùn)練集和測(cè)試集的比例分配為75%和25%。

# 導(dǎo)入sklearn包中的函數(shù) from sklearn.model_selection import train_test_split# 拆分?jǐn)?shù)據(jù) X_train, X_test, y_train, y_test = train_test_split(income.loc[:,'age':'native-country'],income['income'],train_size = 0.75,test_size=0.25, random_state = 1234) # print(X_train) # print(y_train) print("訓(xùn)練數(shù)據(jù)集中共有 %d 條觀測(cè)" %X_train.shape[0]) print("測(cè)試數(shù)據(jù)集中共有 %d 條測(cè)試" %X_test.shape[0]) 訓(xùn)練數(shù)據(jù)集中共有 24420 條觀測(cè) 測(cè)試數(shù)據(jù)集中共有 8141 條測(cè)試

上面的結(jié)果，運(yùn)用隨機(jī)抽樣的方法，將數(shù)據(jù)集拆分為兩部分，其中訓(xùn)練數(shù)據(jù)集包含24 420條樣本，測(cè)試數(shù)據(jù)集包含8 141條樣本，下面將運(yùn)用拆分好的訓(xùn)練數(shù)據(jù)集開(kāi)始構(gòu)建K近鄰和GBDT兩種分類器。

4.3 搭建模型

# 導(dǎo)入K近鄰模型的類 from sklearn.neighbors import KNeighborsClassifier from sklearn.ensemble import GradientBoostingClassifier # 構(gòu)件K近鄰模型 kn = KNeighborsClassifier() kn.fit(X_train,y_train) print(kn)#構(gòu)件GBDT模型 gbdt = GradientBoostingClassifier() gbdt.fit(X_train, y_train) print(gbdt) KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',metric_params=None, n_jobs=1, n_neighbors=5, p=2,weights='uniform') GradientBoostingClassifier(criterion='friedman_mse', init=None,learning_rate=0.1, loss='deviance', max_depth=3,max_features=None, max_leaf_nodes=None,min_impurity_decrease=0.0, min_impurity_split=None,min_samples_leaf=1, min_samples_split=2,min_weight_fraction_leaf=0.0, n_estimators=100,presort='auto', random_state=None, subsample=1.0, verbose=0,warm_start=False)

首先，針對(duì)K近鄰模型，這里直接調(diào)用sklearn子模塊neighbors中的KNeighborsClassifier類，并且使用模型的默認(rèn)參數(shù)，即讓K近鄰模型自動(dòng)挑選最佳的搜尋近鄰算法（algorithm=‘a(chǎn)uto’）、使用歐氏距離公式計(jì)算樣本間的距離（p=2）、指定未知分類樣本的近鄰個(gè)數(shù)為5（n_neighbors=5）而且所有近鄰樣本的權(quán)重都相等（weights=‘uniform’）。其次，針對(duì)GBDT模型，可以調(diào)用sklearn子模塊ensemble中的GradientBoostingClassifier類，同樣先嘗試使用該模型的默認(rèn)參數(shù)，即讓模型的學(xué)習(xí)率（迭代步長(zhǎng)）為0.1（learning_rate=0.1）、損失函數(shù)使用的是對(duì)數(shù)損失函數(shù)（loss=‘deviance’）、生成100棵基礎(chǔ)決策樹(shù)（n_estimators=100），并且每棵基礎(chǔ)決策樹(shù)的最大深度為3（max_depth=3），中間節(jié)點(diǎn)（非葉節(jié)點(diǎn)）的最小樣本量為2（min_samples_split=2），葉節(jié)點(diǎn)的最小樣本量為1（min_samples_leaf=1），每一棵樹(shù)的訓(xùn)練都不會(huì)基于上一棵樹(shù)的結(jié)果（warm_start=False）。

4.4 模型網(wǎng)格搜索法，探尋模型最佳參數(shù)

# K 近鄰模型網(wǎng)格搜索法 # 導(dǎo)入網(wǎng)格搜索函數(shù) from sklearn.grid_search import GridSearchCV # 選擇不同的參數(shù) k_options = list(range(1,12)) parameters = {'n_neighbors':k_options} # 搜索不同的K值 grid_kn = GridSearchCV(estimator= KNeighborsClassifier(), param_grid=parameters, cv=10, scoring='accuracy') grid_kn.fit(X_train, y_train) # 結(jié)果輸出 grid_kn.grid_scores_, grid_kn.best_params_, grid_kn.best_score_ ([mean: 0.81507, std: 0.00711, params: {'n_neighbors': 1},mean: 0.83882, std: 0.00696, params: {'n_neighbors': 2},mean: 0.83722, std: 0.00843, params: {'n_neighbors': 3},mean: 0.84586, std: 0.01039, params: {'n_neighbors': 4},mean: 0.84222, std: 0.00916, params: {'n_neighbors': 5},mean: 0.84713, std: 0.00900, params: {'n_neighbors': 6},mean: 0.84316, std: 0.00719, params: {'n_neighbors': 7},mean: 0.84525, std: 0.00629, params: {'n_neighbors': 8},mean: 0.84394, std: 0.00678, params: {'n_neighbors': 9},mean: 0.84570, std: 0.00534, params: {'n_neighbors': 10},mean: 0.84464, std: 0.00444, params: {'n_neighbors': 11}],{'n_neighbors': 6},0.8471334971334972)

GridSearchCV函數(shù)中的幾個(gè)參數(shù)含義，estimator參數(shù)接受一個(gè)指定的模型，這里為K近鄰模型的類；param_grid用來(lái)指定模型需要搜索的參數(shù)列表對(duì)象，這里是K近鄰模型中n_neighbors參數(shù)的11種可能值；cv是指網(wǎng)格搜索需要經(jīng)過(guò)10重交叉驗(yàn)證；scoring指定模型評(píng)估的度量值，這里選用的是模型預(yù)測(cè)的準(zhǔn)確率。通過(guò)網(wǎng)格搜索的計(jì)算，得到三部分的結(jié)果，第一部分包含了11種K值下的平均準(zhǔn)確率（因?yàn)樽隽?0重交叉驗(yàn)證）；第二部分選擇出了最佳的K值，K值為6；第三部分是當(dāng)K值為6時(shí)模型的最佳平均準(zhǔn)確率，且準(zhǔn)確率為84.78%。

# GBDT 模型的網(wǎng)格搜索法 # 選擇不同的參數(shù) from sklearn.grid_search import GridSearchCV learning_rate_options = [0.01, 0.05, 0.1] max_depth_options = [3,5,7,9] n_estimators_options = [100, 300, 500]parameters = {'learning_rate':learning_rate_options,'max_depth':max_depth_options,'n_estimators':n_estimators_options}grid_gbdt = GridSearchCV(estimator= GradientBoostingClassifier(),param_grid=parameters,cv=10,scoring='accuracy') grid_gbdt.fit(X_train, y_train)# 結(jié)果輸出 grid_gbdt.grid_scores_,grid_gbdt.best_params_, grid_gbdt.best_score_ ([mean: 0.84267, std: 0.00727, params: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 100},mean: 0.85360, std: 0.00837, params: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 300},mean: 0.85930, std: 0.00746, params: {'learning_rate': 0.01, 'max_depth': 3, 'n_estimators': 500},mean: 0.85213, std: 0.00821, params: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 100},mean: 0.86327, std: 0.00751, params: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 300},mean: 0.86929, std: 0.00767, params: {'learning_rate': 0.01, 'max_depth': 5, 'n_estimators': 500},mean: 0.85459, std: 0.00767, params: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 100},mean: 0.87076, std: 0.00920, params: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 300},mean: 0.87342, std: 0.00983, params: {'learning_rate': 0.01, 'max_depth': 7, 'n_estimators': 500},mean: 0.85438, std: 0.00801, params: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 100},mean: 0.86855, std: 0.00907, params: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 300},mean: 0.87248, std: 0.00974, params: {'learning_rate': 0.01, 'max_depth': 9, 'n_estimators': 500},mean: 0.85962, std: 0.00778, params: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 100},mean: 0.86974, std: 0.00689, params: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 300},mean: 0.87326, std: 0.00697, params: {'learning_rate': 0.05, 'max_depth': 3, 'n_estimators': 500},mean: 0.86978, std: 0.00790, params: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 100},mean: 0.87543, std: 0.00897, params: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300},mean: 0.87445, std: 0.00962, params: {'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 500},mean: 0.87338, std: 0.00927, params: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 100},mean: 0.87391, std: 0.00964, params: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 300},mean: 0.87072, std: 0.01012, params: {'learning_rate': 0.05, 'max_depth': 7, 'n_estimators': 500},mean: 0.87211, std: 0.00989, params: {'learning_rate': 0.05, 'max_depth': 9, 'n_estimators': 100},mean: 0.86851, std: 0.01048, params: {'learning_rate': 0.05, 'max_depth': 9, 'n_estimators': 300},mean: 0.86229, std: 0.00857, params: {'learning_rate': 0.05, 'max_depth': 9, 'n_estimators': 500},mean: 0.86626, std: 0.00660, params: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 100},mean: 0.87355, std: 0.00802, params: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 300},mean: 0.87449, std: 0.00842, params: {'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 500},mean: 0.87383, std: 0.00878, params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 100},mean: 0.87310, std: 0.01001, params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 300},mean: 0.87236, std: 0.00939, params: {'learning_rate': 0.1, 'max_depth': 5, 'n_estimators': 500},mean: 0.87502, std: 0.01037, params: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 100},mean: 0.86953, std: 0.00873, params: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 300},mean: 0.86192, std: 0.00823, params: {'learning_rate': 0.1, 'max_depth': 7, 'n_estimators': 500},mean: 0.87154, std: 0.01075, params: {'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 100},mean: 0.85995, std: 0.00848, params: {'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 300},mean: 0.85328, std: 0.00828, params: {'learning_rate': 0.1, 'max_depth': 9, 'n_estimators': 500}],{'learning_rate': 0.05, 'max_depth': 5, 'n_estimators': 300},0.8754299754299755)

4.5 模型預(yù)測(cè)與評(píng)估

在模型構(gòu)建好后，下一步做的就是使用得到的分類器對(duì)測(cè)試數(shù)據(jù)集進(jìn)行預(yù)測(cè)，進(jìn)而驗(yàn)證模型在樣本外的表現(xiàn)能力。通常，驗(yàn)證模型好壞的方法有多種。

對(duì)于預(yù)測(cè)的連續(xù)變量來(lái)說(shuō)，常用的衡量指標(biāo)有均方誤差（MSE）和均方根誤差（RMSE）；
對(duì)于預(yù)測(cè)的分類變量來(lái)說(shuō)，常用的衡量指標(biāo)有混淆矩陣中的準(zhǔn)確率、ROC曲線下的面積AUC、K-S值等。

接下來(lái)，依次對(duì)上文中構(gòu)建的四種模型（K近鄰、K近鄰網(wǎng)格搜索法、GDBT模型、GDBT網(wǎng)格搜索法模型）進(jìn)行預(yù)測(cè)和評(píng)估。

4.5.1 K近鄰模型在測(cè)試集上的預(yù)測(cè)

kn_pred = kn.predict(X_test) print(pd.crosstab(kn_pred, y_test))# 模型得分 print("模型在訓(xùn)練集上的準(zhǔn)確率為%f" % kn.score(X_train, y_train)) print("模型在測(cè)試集上的準(zhǔn)確率為%f" % kn.score(X_test, y_test)) income 0 1 row_0 0 5644 725 1 582 1190 模型在訓(xùn)練集上的準(zhǔn)確率為0.889844 模型在測(cè)試集上的準(zhǔn)確率為0.839455

第一部分是混淆矩陣，矩陣中的行是模型的預(yù)測(cè)值，矩陣中的列是測(cè)試集的實(shí)際值，主對(duì)角線就是模型預(yù)測(cè)正確的數(shù)量（income <=50K有 5644 人和 income >50K 有 1190人），582和725就是模型預(yù)測(cè)錯(cuò)誤的數(shù)量。經(jīng)過(guò)計(jì)算，得到第二部分的結(jié)論，即模型在訓(xùn)練集中的準(zhǔn)確率為88.9%，但在測(cè)試集上的錯(cuò)誤率超過(guò)16%（1-0.839），說(shuō)明默認(rèn)參數(shù)下的KNN模型可能存在過(guò)擬合的風(fēng)險(xiǎn)。
模型的準(zhǔn)確率就是基于混淆矩陣計(jì)算的，但是該方法存在一定的弊端，即如果數(shù)據(jù)本身存在一定的不平衡時(shí)（正負(fù)樣本的比例差異較大），一定會(huì)導(dǎo)致準(zhǔn)確率很高，但并不一定說(shuō)明模型就是理想的。所以可以繪制ROC曲線，并計(jì)算曲線下的面積AUC值

from sklearn import metrics# 計(jì)算ROC曲線的x軸和 y軸數(shù)據(jù) fpr, tpr, _ = metrics.roc_curve(y_test, kn.predict_proba(X_test)[:,1]) # 繪制ROC曲線 plt.plot(fpr, tpr, linestyle = 'solid', color ='red') # 添加陰影 plt.stackplot(fpr, tpr, color='steelblue') # 繪制參考線 plt.plot([0,1],[0,1],linestyle='dashed', color='black') # 添加文本 plt.text(0.6, 0.4, 'AUC=%.3f' % metrics.auc(fpr,tpr), fontdict=dict(size =16)) plt.show()

上圖繪制了模型的ROC曲線，經(jīng)計(jì)算得知，該曲線下的面積AUC為0.864。使用AUC來(lái)評(píng)估模型的好壞，那應(yīng)該希望AUC越大越好。一般而言，當(dāng)AUC的值超過(guò)0.8時(shí)，基本上就可以認(rèn)為模型比較合理。所以，基于默認(rèn)參數(shù)的K近鄰模型在居民收入數(shù)據(jù)集上的表現(xiàn)還算理想。

4.5.2 K近鄰網(wǎng)格搜索模型在測(cè)試集上的預(yù)測(cè)

from sklearn import metricsgrid_kn_pred = grid_kn.predict(X_test) print(pd.crosstab(grid_kn_pred, y_test))# 模型得分 print("模型在訓(xùn)練集上的準(zhǔn)確率為%f" % grid_kn.score(X_train, y_train)) print("模型在測(cè)試集上的準(zhǔn)確率為%f" % grid_kn.score(X_test, y_test))fpr, tpr, _ = metrics.roc_curve(y_test, grid_kn.predict_proba(X_test)[:,1]) plt.plot(fpr, tpr, linestyle = 'solid', color ='red') plt.stackplot(fpr, tpr, color='steelblue') plt.plot([0,1],[0,1],linestyle='dashed', color='black') plt.text(0.6, 0.4, 'AUC=%.3f' % metrics.auc(fpr,tpr), fontdict=dict(size =16)) plt.show() income 0 1 row_0 0 5838 861 1 388 1054 模型在訓(xùn)練集上的準(zhǔn)確率為0.882924 模型在測(cè)試集上的準(zhǔn)確率為0.846579

相比于默認(rèn)參數(shù)的K近鄰模型來(lái)說(shuō)，經(jīng)過(guò)網(wǎng)格搜索后的模型在訓(xùn)練數(shù)據(jù)集上的準(zhǔn)確率下降了，但在測(cè)試數(shù)據(jù)集上的準(zhǔn)確率提高了，這也是我們所期望的，說(shuō)明優(yōu)化后的模型在預(yù)測(cè)效果上更加優(yōu)秀，并且兩者差異的縮小也能夠降低模型過(guò)擬合的可能。再來(lái)看看ROC曲線下的面積，網(wǎng)格搜索后的K近鄰模型所對(duì)應(yīng)的AUC為0.87，相比于原先的KNN模型提高了一點(diǎn)。所以，從模型的穩(wěn)定性來(lái)看，網(wǎng)格搜索后的K近鄰模型比原始的K近鄰模型更加優(yōu)秀。

4.5.3 GBDT模型在測(cè)試集上的預(yù)測(cè)

from sklearn import metricsgbdt_pred = gdbt.predict(X_test) print(pd.crosstab(gbdt_pred, y_test))# 模型得分 print("模型在訓(xùn)練集上的準(zhǔn)確率為%f" % gbdt.score(X_train, y_train)) print("模型在測(cè)試集上的準(zhǔn)確率為%f" % gbdt.score(X_test, y_test))fpr, tpr, _ = metrics.roc_curve(y_test, gbdt.predict_proba(X_test)[:,1]) plt.plot(fpr, tpr, linestyle = 'solid', color ='red') plt.stackplot(fpr, tpr, color='steelblue') plt.plot([0,1],[0,1],linestyle='dashed', color='black') plt.text(0.6, 0.4, 'AUC=%.3f' % metrics.auc(fpr,tpr), fontdict=dict(size =16)) plt.show() income 0 1 row_0 0 5869 780 1 357 1135 模型在訓(xùn)練集上的準(zhǔn)確率為0.869861 模型在測(cè)試集上的準(zhǔn)確率為0.860337

集成算法GBDT在測(cè)試集上的表現(xiàn)明顯要比K近鄰算法優(yōu)秀，這就是基于多棵決策樹(shù)進(jìn)行投票的優(yōu)點(diǎn)。該模型在訓(xùn)練集和測(cè)試集上的表現(xiàn)都非常好，準(zhǔn)確率均超過(guò)85%，而且AUC值也是前面兩種模型中最高的，達(dá)到了0.914。

4.5.4 網(wǎng)格搜索的GBDT模型在測(cè)試集上的預(yù)測(cè)

from sklearn import metricsgrid_gbdt_pred = grid_gbdt.predict(X_test) print(pd.crosstab(grid_gbdt_pred , y_test))# 模型得分 print("模型在訓(xùn)練集上的準(zhǔn)確率為%f" % grid_gbdt.score(X_train, y_train)) print("模型在測(cè)試集上的準(zhǔn)確率為%f" % grid_gbdt.score(X_test, y_test))fpr, tpr, _ = metrics.roc_curve(y_test, grid_gbdt.predict_proba(X_test)[:,1]) plt.plot(fpr, tpr, linestyle = 'solid', color ='red') plt.stackplot(fpr, tpr, color='steelblue') plt.plot([0,1],[0,1],linestyle='dashed', color='black') plt.text(0.6, 0.4, 'AUC=%.3f' % metrics.auc(fpr,tpr), fontdict=dict(size =16)) plt.show() income 0 1 row_0 0 5835 669 1 391 1246 模型在訓(xùn)練集上的準(zhǔn)確率為0.889271 模型在測(cè)試集上的準(zhǔn)確率為0.869795

基于網(wǎng)格搜索后的GBDT模型的表現(xiàn)，從準(zhǔn)確率來(lái)看，是4個(gè)模型中表現(xiàn)最佳的，該模型在訓(xùn)練集上的準(zhǔn)確率接近89%，同時(shí)，在測(cè)試集上的準(zhǔn)確率也超過(guò)86%；從繪制的ROC曲線來(lái)看，AUC的值也是最高的，超過(guò)0.92。
不論是K近鄰模型，還是梯度提升樹(shù)GBDT模型，都可以通過(guò)網(wǎng)格搜索法找到各自的最佳模型參數(shù)，而且這些最佳參數(shù)的組合一般都會(huì)使模型比較優(yōu)秀和健壯。所以，縱向比較默認(rèn)參數(shù)的模型和網(wǎng)格搜索后的最佳參數(shù)模型，后者可能是比較好的選擇（盡管后者可能會(huì)花費(fèi)更多的運(yùn)行時(shí)間）；橫向比較單一模型和集成模型，集成模型一般會(huì)比單一模型表現(xiàn)優(yōu)秀。

5.實(shí)驗(yàn)總結(jié)

本次收入預(yù)測(cè)問(wèn)題屬于機(jī)器學(xué)習(xí)中的聚類問(wèn)題，主要是通過(guò)數(shù)據(jù)集中的income變量將數(shù)據(jù)分成了兩個(gè)類別（年收入大于5W和年收入小于等于5W）。

本次實(shí)驗(yàn)主要收獲是：

1.熟悉了數(shù)據(jù)挖掘的重要流程（預(yù)覽數(shù)據(jù)集，明確分析的目的–導(dǎo)入數(shù)據(jù)并數(shù)據(jù)預(yù)處理–探索數(shù)據(jù)特征–清洗數(shù)據(jù)，構(gòu)建模型–模型預(yù)測(cè)，模型評(píng)估）。
2.初步了解了python sklearn中的兩個(gè)機(jī)器學(xué)習(xí)模型K近鄰和GBDT（梯度提升樹(shù)）和用網(wǎng)格搜索法改進(jìn)模型參數(shù)優(yōu)化模型。
3.本章還學(xué)習(xí)到了pands,matplotlib。

總結(jié)

以上是生活随笔為你收集整理的python 数据分析实践--（1）收入预测分析的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：思科CCIE认证知识点之IPv6地址
下一篇： python爬虫做灰产_python爬虫