當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

机器学习(8)——回归和异常值处理（安然数据集）

發布時間：2023/12/16 编程问答 30 豆豆

生活随笔收集整理的這篇文章主要介紹了机器学习(8)——回归和异常值处理（安然数据集）小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

根據年齡和工資回歸
刪除年齡和工資中的異常值
根據工資和獎金處理數據集
可視化工資和獎金數據集
找出工資和獎金的異常值
刪除異常值后重新可視化工資和獎金數據

機器學習——回歸和異常值處理（安然數據集）

以下代碼是在python 3.6下運行。

安然事件造成有史以來最大的公司破產。在2000年度，安然是美國最大的能源公司，然而被揭露舞弊后，它在一年內就破產了。

我們之所以選擇使用安然事件的數據集來做機器學習的項目，是因為我們已經有安然的電子郵件數據庫，它包含150名前安然員工之間的50萬封電子郵件，主要是高級管理人員。這也是唯一的大型公共的真實郵件數據庫。

感興趣的可以看一下安然的紀錄片，也是非常令人唏噓的一部經典紀錄片：【紀錄片】安然：房間里最聰明的人
或者閱讀安然事件文章

關于安然數據集的分析可參考上一篇文章：
安然數據集分析

根據年齡和工資回歸

#!/usr/bin/pythonimport random import numpy import matplotlib.pyplot as plt import picklefrom outlier_cleaner import outlierCleaner#python2_to_python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### load up some practice data with outliers in it ages = pickle.load( StrToBytes(open("practice_outliers_ages.pkl", "r"))) net_worths = pickle.load( StrToBytes(open("practice_outliers_net_worths.pkl", "r")) )### ages and net_worths need to be reshaped into 2D numpy arrays ### second argument of reshape command is a tuple of integers: (n_rows, n_columns) ### by convention, n_rows is the number of data points ### and n_columns is the number of features ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) #from sklearn.cross_validation import train_test_splitfrom sklearn.model_selection import train_test_split ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)### fill in a regression here! Name the regression object reg so that ### the plotting code below works, and you can see what your regression looks likefrom sklearn.linear_model import LinearRegressionreg = LinearRegression() reg.fit(ages_train, net_worths_train)print("scope: ", reg.coef_) print("intercept: ", reg.intercept_) print("train score: ", reg.score(ages_train, net_worths_train)) print("test score: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages, reg.predict(ages), color="blue") except NameError:pass plt.scatter(ages, net_worths) plt.show()

scope: [[5.07793064]]
intercept: [25.21002155]
train score: 0.4898725961751499
test score: 0.8782624703664671

此時斜率是5.07，訓練集的R-平方值是0.4898，測試集的R-平方值是0.878

刪除年齡和工資中的異常值

在outlier_cleaner.py中定義異常值清除函數outlierCleaner()，清楚10%的異常值

#!/usr/bin/pythonimport mathdef outlierCleaner(predictions, ages, net_worths):"""Clean away the 10% of points that have the largestresidual errors (difference between the predictionand the actual net worth).Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error)."""cleaned_data = []### your code goes hereerrors = abs(predictions - net_worths)cleaned_data = zip(ages, net_worths, errors)cleaned_data = sorted(cleaned_data, key=lambda clean:clean[2])clean_num = int(math.ceil(len(cleaned_data)*0.9))cleaned_data = cleaned_data[:clean_num]print('data length: ',len(ages))print('cleaned_data length: ',len(cleaned_data))return cleaned_data

調用異常值清除函數outlierCleaner()，講清除后的數據重新回歸擬合

import outlier_cleanercleaned_data = outlierCleaner(reg.predict(ages_train), ages_train, net_worths_train)ages_train_new, net_worths_train_new, e = zip(*cleaned_data)ages_train_new = numpy.reshape( numpy.array(ages_train_new), (len(ages_train_new), 1)) net_worths_train_new = numpy.reshape( numpy.array(net_worths_train_new), (len(net_worths_train_new), 1))reg.fit(ages_train_new, net_worths_train_new)print("scope_removal: ", reg.coef_) print("intercept_removal: ", reg.intercept_) print("train score_removal: ", reg.score(ages_train_new, net_worths_train_new)) print("test score_removal: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages_train_new, reg.predict(ages_train_new), color="blue") except NameError:pass plt.scatter(ages_train_new, net_worths_train_new) plt.show()

scope_removal: [[6.36859481]]
intercept_removal: [-6.91861069]
train score_removal: 0.9513734907601892
test score_removal: 0.9831894553955322

刪除異常之后，斜率是6.369，訓練集的R-平方值是0.95，測試集的R-平方值是0.98，效果明顯的好很多了。

根據工資和獎金處理數據集

主要是根據安然數據集中的工資和獎金進行處理，來判斷此數據集中是否有異常值。

import pickle import sys import matplotlib.pyplot sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit#python2 to python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### read in data dictionary, convert to numpy array data_dict = pickle.load( StrToBytes(open("../final_project/final_project_dataset.pkl", "r") )) features = ["salary", "bonus"]data = featureFormat(data_dict, features)

可視化工資和獎金數據集

for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()

很明顯，右上角的那一數據點與其他點差距太大，是異常值。

找出工資和獎金的異常值

max_value = sorted(data, reverse=True, key=lambda sal:sal[0])[0] print('the max_value is: ', max_value)for i in data_dict:if data_dict[i]['salary'] == max_value[0]:print('Who is the max_value is: ',i)

the max_value is: [26704229. 97343619.]
Who is the max_value is: TOTAL

刪除異常值后重新可視化工資和獎金數據

data_dict.pop( 'TOTAL', 0 ) data = featureFormat(data_dict, features)for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()

我們認為還有 4 個異常值需要調查；讓我們舉例來看。兩人獲得了至少 5 百萬美元的獎金，以及超過 1 百萬美元的工資；換句話說，他們就像是強盜。

和這些點相關的名字是什么？

for i in data_dict:if data_dict[i]['salary'] != 'NaN' and data_dict[i]['bonus'] != 'NaN':if data_dict[i]['salary'] > 1e6 and data_dict[i]['bonus'] > 5e6:print(i)

LAY KENNETH L
SKILLING JEFFREY K

你認為這兩個異常值應該并清除，還是留下來作為一個數據點？留下來，它是有效的數據點清除掉，它是一個電子表格怪癖清除掉，它是一個錯誤？

這兩個異常數據當天不能刪除，事實表明他們兩個非法拿到了很多錢，是司法的重點研究對象

總結

以上是生活随笔為你收集整理的机器学习(8)——回归和异常值处理（安然数据集）的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： decode,encode的用法
下一篇：什么才是【Python】中的鸭子类型和猴