python建模全步骤
生活随笔
收集整理的這篇文章主要介紹了
python建模全步骤
小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.
文章目錄
- 引入包名
- 變量類型轉(zhuǎn)化
- 提取object變量
- 轉(zhuǎn)換百分號(hào)變量
- 標(biāo)準(zhǔn)化
- 缺失值處理
- 查看
- numeric
- 特征工程
- 下采樣
- 正則
- map 函數(shù)
- object
- object編碼化
- 熱編碼
- 熱編碼Not sparse
- label_encoder
- 辨析
- 數(shù)據(jù)分類
- 合并數(shù)據(jù)
- 建模
- 決策樹(shù)可視化展示
- 特征重要性
- 均衡樣本
- 模型評(píng)價(jià)
- train test split
- 模型評(píng)價(jià)
- ROC
- oob
- 混淆矩陣confusion matrix
- 混淆矩陣標(biāo)準(zhǔn)化
- 成本矩陣cost matrix
引入包名
import matplotlib.pyplot as plt import numpy as np import os.path from sklearn.preprocessing import Imputer import csv import pandas as pd import warnings import seaborn as sns warnings.simplefilter("ignore") from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score from sklearn.tree import export_graphviz from sklearn import tree from sklearn.tree import DecisionTreeClassifier import pydotplus from IPython.display import Image from sklearn.model_selection import train_test_splitimport plotly.graph_objects as go import matplotlib.pyplot as plt import plotly_express as px from sklearn import preprocessing變量類型轉(zhuǎn)化
提取object變量
cat_cols = [col for col in X.columns.values if X[col].dtype == 'O']轉(zhuǎn)換百分號(hào)變量
for i in data.columns:try:if data[i].dtype == 'O' and data[i][0][-1]=='%':print(i)data[i] = data[i].apply(lambda x:float(x[:-1]))except:continue標(biāo)準(zhǔn)化
from sklearn import preprocessing scaler = MinMaxScaler() for i in data.columns:if data[i].dtype != 'O':data[i] = preprocessing.minmax_scale(data[i])或者
x = preprocessing.scale(x)缺失值處理
查看
missing_values_table(X)numeric
num = X.drop(cat_cols,axis=1).fillna(X.mean())num = X.drop(cat_cols,axis=1).fillna(X.median())num = X.drop(cat_cols,axis=1).fillna(X.mode())特征工程
下采樣
sub_sample
def lower_sample_data(df, class_):'''percent:多數(shù)類別下采樣的數(shù)量相對(duì)于少數(shù)類別樣本數(shù)量的比例'''data0 = df[df['retention'] == class_] # 將少數(shù)類別的樣本放在data0data1 = df[df['retention'] != class_] index = np.random.randint(len(data1), size= (len(df) - len(data1))) # 隨機(jī)給定下采樣取出樣本的序號(hào)lower_data1 = data1.iloc[list(index)] # 下采樣return(pd.concat([lower_data1, data0])) data = lower_sample_data(data,'lost') data['retention'].value_counts()正則
# 只取數(shù)字 data['brand_version'] = data['brand'].apply(lambda x:re.findall(r'\d',x)[0] if re.findall(r'\d',x) else 'null') data['brand_version'] = data['brand_version'].apply(lambda x:int(x) if x!='null' else 'null') # 分類 data['brand_class'] = data['brand'].apply(lambda x:'小米' if x.find('小米') else('紅米' if x.find('紅米') else 'others') ) # 只取英文 uncn = re.compile(r'[\u0061-\u007a,\u0020]') data['brand_series'] = data['brand'].apply(lambda x:"".join(uncn.findall(x.lower()))) # 只取英文和數(shù)字 data['brand_detail'] = data['brand'].apply(lambda x:re.sub('[^\u0061-\u007a^a-z^A-Z^0-9]+', '', x))map 函數(shù)
def price_map(x):if x=='0-600':y=1elif x=='600-1000':y=2elif x=='1000-1500':y=3elif x=='1500-2000':y=4elif x=='2000-3000':y=5elif x=='3000-4000':y=6else:y=7return ydata['price_band'] = data['price'].apply(lambda x:price_map(x))object
X = X.fillna('missing')object編碼化
熱編碼
熱編碼Not sparse
label_encoder
le = preprocessing.LabelEncoder() for col in cat_cols:cat_labelcoder[col] = le.fit_transform(cat_labelcoder[col].astype('str'))辨析
理論上,將object變量進(jìn)行l(wèi)abel_encoder或者one_hot encoder都是一樣的,但是因?yàn)閘abel encoder會(huì)將object賦予大小含義,切割特征時(shí)會(huì)按照numeric型變量進(jìn)行切分;因此,如果每次賦值不同,那么每次決策樹(shù)的左右子樹(shù)的值就會(huì)不同,會(huì)導(dǎo)致結(jié)果不一致。
因此,一般而言,除了表示“不好,一般,好,很好”這種帶有賦值含義的object型變量可以根據(jù)label_encoder進(jìn)行數(shù)據(jù)處理,其他情況請(qǐng)都用one_hot。
數(shù)據(jù)分類
x = data.drop(['id','retention'],axis=1)y = pd.DataFrame(data['retention'].apply(lambda x:1 if x=='lost' else 0))合并數(shù)據(jù)
x_labelcoder = pd.concat([num,cat_labelcoder],axis=1)建模
clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0) #擬合模型 clf.fit(x_onehot, y)決策樹(shù)可視化展示
clf = tree.DecisionTreeClassifier(min_samples_split=0.1,max_depth=int(np.log2(x_onehot.shape[1])),random_state=0,class_weight='balanced') #擬合模 clf.fit(x_onehot, y) # extract single tree dot_data = tree.export_graphviz(clf, out_file=None,feature_names=x_onehot.columns,### 重點(diǎn)!!!class_names=data['tag'].unique(),filled=True, rounded=True,special_characters=True) graph = pydotplus.graph_from_dot_data(dot_data) #使用ipython的終端jupyter notebook顯示。 Image(graph.create_png())特征重要性
clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0) #擬合模型 clf.fit(x_onehot, y) y_importances = clf.feature_importances_ x_importances = x_onehot.columns df = pd.DataFrame({'x':x_importances,'y':y_importances}).sort_values(by='y',ascending=False)px.bar_polar(df[:10], r="y", theta="x", color="x", template='plotly_white',color_discrete_sequence=px.colors.sequential.Plotly3[-2::-1])均衡樣本
class_weight=‘balanced’
clf = RandomForestClassifier(n_estimators=10, criterion='gini',max_depth=10,bootstrap=True,random_state=0,class_weight='balanced') #擬合模型 clf.fit(x_onehot, y)模型評(píng)價(jià)
train test split
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(x_onehot, y, test_size=0.33, random_state=42)模型評(píng)價(jià)
from sklearn.metrics import classification_reporty_predict = clf.predict(X_test)print(classification_report(y_test, y_predict))ROC
# y_test:實(shí)際的標(biāo)簽, dataset_pred:預(yù)測(cè)的概率值。 fpr, tpr, thresholds = roc_curve(y_test, y_predict) roc_auc = auc(fpr, tpr) #畫(huà)圖,只需要plt.plot(fpr,tpr),變量roc_auc只是記錄auc的值,通過(guò)auc()函數(shù)能計(jì)算出來(lái) plt.plot(fpr, tpr, lw=1, label='ROC(area = %0.2f)' % (roc_auc)) plt.xlabel("FPR (False Positive Rate)") plt.ylabel("TPR (True Positive Rate)") plt.title("Receiver Operating Characteristic, ROC(AUC = %0.2f)"% (roc_auc)) plt.show() roc_auc_score(y_test, y_predict)oob
clf = RandomForestClassifier(n_estimators=100, criterion='gini',max_depth=x_onehot.shape[1],bootstrap=True,random_state=0,class_weight='balanced',oob_score=True) #擬合模型 clf.fit(x_onehot, y)clf.oob_score_混淆矩陣confusion matrix
ax = sns.heatmap(confusion_matrix(y_test, y_predict),cmap='Blues',annot=True,fmt='g') plt.title('confusion matrix') plt.ylabel('True Lable') plt.xlabel('Predicted Lable')混淆矩陣標(biāo)準(zhǔn)化
_ = confusion_matrix(y_test, y_predict)/np.sum(confusion_matrix(y_test, y_predict)) _ = np.around(_,decimals=2) ax = sns.heatmap(_,cmap='Blues',annot=True,fmt='g') plt.title('confusion matrix') plt.ylabel('True Lable') plt.xlabel('Predicted Lable')成本矩陣cost matrix
cm = confusion_matrix(y_test, y_predict) # 0是流失,1是活躍 TP = cm[1][1] TN = cm[0][0] FP = cm[0][1]*5 FN = cm[1][0]*2 accuracy = round((TP+TN)/(TP+TN+FP+FN),2) recall = round(TP/(TP+FN),2) fscore = round(accuracy*recall/(accuracy+recall),2) cm_biz = np.vstack(([TN,FP],[FN,TP])) cm_biz = pd.DataFrame(cm_biz) ax = sns.heatmap(cm_biz,cmap='Blues',annot=True,fmt='g') plt.title('cost matrix'+'\n'+'accuracy= '+str(accuracy)+'\n'+'recall= '+str(recall)+'\n'+'f_score'+str(fscore)+'\n') plt.ylabel('True Lable') plt.xlabel('Predicted Lable')總結(jié)
以上是生活随笔為你收集整理的python建模全步骤的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 华硕主板型号说明
- 下一篇: 建模常用Python代码