scikit-learning_特征分析(数据挖掘入门与实践-实验7)
生活随笔
收集整理的這篇文章主要介紹了
scikit-learning_特征分析(数据挖掘入门与实践-实验7)
小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
#數(shù)據(jù)導(dǎo)入
import os
import pandas as pdadult_filename="adult.data"
adult = pd.read_csv(adult_filename, header=None, names=["Age", "Work-Class", "fnlwgt", "Education", "Education-Num", "Marital-Status", "Occupation", "Relationship", "Race", "Sex", "Capital-gain", "Capital-loss", "Hours-per-week", "Native-Country", "Earnings-Raw"])
adult.dropna(how='all',inplace=True)
#adult.columns#adult.loc[:5]#輸出特征 每周工時(shí)的特性(means、min、max...)
#adult["Hours-per-week"].describe()#輸出特征 受教育年限的means
#adult["Education-Num"].median()#輸出特征 工種類別
#adult["Work-Class"].unique()#生成特征 每周工時(shí)是否高于40
adult["LongHours"]=adult["Hours-per-week"]>40#####################################刪除特征方差達(dá)不到標(biāo)準(zhǔn)的特征(方差越低,對(duì)個(gè)體的區(qū)分力度越小)
import numpy as np#生成3*10的矩陣,數(shù)據(jù)從0-29
X=np.arange(30).reshape((10,3))
#將第二列置1,此時(shí)第二列的方差為0
X[:,1]=1#Xt僅保留了X的第一列、第三列,第二列方差為0被剔除
from sklearn.feature_selection import VarianceThreshold
vt = VarianceThreshold()
Xt = vt.fit_transform(X)#輸出每列的方差
#print(vt.variances_)####################################單個(gè)特征檢驗(yàn)
X=adult[["Age", "Education-Num", "Capital-gain", "Capital-loss", "Hours-per-week"]].values#目標(biāo)類別創(chuàng)建 稅前收入是否達(dá)到5wdollars
Y=(adult["Earnings-Raw"]=='>50k').values#創(chuàng)建轉(zhuǎn)換器 卡方評(píng)分
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
transformer=SelectKBest(score_func=chi2,k=3)#評(píng)分函數(shù)為卡方,分類效果較好的特征數(shù)量為3#評(píng)分開始
Xt_chi2=transformer.fit_transform(X, Y)
#print(transformer.scores_)#轉(zhuǎn)換器創(chuàng)建 皮爾遜相關(guān)系數(shù)評(píng)分
from scipy.stats import pearsonr
def multivariate_pearsonr(X, y):scores, pvalues = [], [] for column in range(X.shape[1]):cur_score, cur_p = pearsonr(X[:,column], y) scores.append(abs(cur_score)) pvalues.append(cur_p)return (np.array(scores), np.array(pvalues))#評(píng)分開始
transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)
Xt_pearson = transformer.fit_transform(X, Y)
print(transformer.scores_)###############################兩特征提取方法對(duì)比
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score
clf = DecisionTreeClassifier(random_state=14)
scores_chi2 = cross_val_score(clf, Xt_chi2, Y, scoring='accuracy')
scores_pearson = cross_val_score(clf, Xt_pearson, Y, scoring='accuracy')
#print("The chi2_method accurary is {0:.1f}%".format(100*np.mean(scores_chi2)))
#print("The pearson_method accurary is {0:.1f}%".format(100*np.mean(scores_pearson)))
總結(jié)
以上是生活随笔為你收集整理的scikit-learning_特征分析(数据挖掘入门与实践-实验7)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: Leetcode题库 110.平衡二叉树
- 下一篇: 密码学AES算法_S盒_C值搜索