机器学习(8)——回归和异常值处理(安然数据集)
文章目錄
- 根據年齡和工資回歸
- 刪除年齡和工資中的異常值
- 根據工資和獎金處理數據集
- 可視化工資和獎金數據集
- 找出工資和獎金的異常值
- 刪除異常值后重新可視化工資和獎金數據
機器學習——回歸和異常值處理(安然數據集)
以下代碼是在python 3.6下運行。
安然事件造成有史以來最大的公司破產。在2000年度,安然是美國最大的能源公司,然而被揭露舞弊后,它在一年內就破產了。
我們之所以選擇使用安然事件的數據集來做機器學習的項目,是因為我們已經有安然的電子郵件數據庫,它包含150名前安然員工之間的50萬封電子郵件,主要是高級管理人員。這也是唯一的大型公共的真實郵件數據庫。
感興趣的可以看一下安然的紀錄片,也是非常令人唏噓的一部經典紀錄片:【紀錄片】安然:房間里最聰明的人
或者閱讀安然事件文章
關于安然數據集的分析可參考上一篇文章:
安然數據集分析
根據年齡和工資回歸
#!/usr/bin/pythonimport random import numpy import matplotlib.pyplot as plt import picklefrom outlier_cleaner import outlierCleaner#python2_to_python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### load up some practice data with outliers in it ages = pickle.load( StrToBytes(open("practice_outliers_ages.pkl", "r"))) net_worths = pickle.load( StrToBytes(open("practice_outliers_net_worths.pkl", "r")) )### ages and net_worths need to be reshaped into 2D numpy arrays ### second argument of reshape command is a tuple of integers: (n_rows, n_columns) ### by convention, n_rows is the number of data points ### and n_columns is the number of features ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) #from sklearn.cross_validation import train_test_splitfrom sklearn.model_selection import train_test_split ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)### fill in a regression here! Name the regression object reg so that ### the plotting code below works, and you can see what your regression looks likefrom sklearn.linear_model import LinearRegressionreg = LinearRegression() reg.fit(ages_train, net_worths_train)print("scope: ", reg.coef_) print("intercept: ", reg.intercept_) print("train score: ", reg.score(ages_train, net_worths_train)) print("test score: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages, reg.predict(ages), color="blue") except NameError:pass plt.scatter(ages, net_worths) plt.show()scope: [[5.07793064]]
intercept: [25.21002155]
train score: 0.4898725961751499
test score: 0.8782624703664671
此時斜率是5.07,訓練集的R-平方值是0.4898,測試集的R-平方值是0.878
刪除年齡和工資中的異常值
在outlier_cleaner.py中定義異常值清除函數outlierCleaner(),清楚10%的異常值
#!/usr/bin/pythonimport mathdef outlierCleaner(predictions, ages, net_worths):"""Clean away the 10% of points that have the largestresidual errors (difference between the predictionand the actual net worth).Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error)."""cleaned_data = []### your code goes hereerrors = abs(predictions - net_worths)cleaned_data = zip(ages, net_worths, errors)cleaned_data = sorted(cleaned_data, key=lambda clean:clean[2])clean_num = int(math.ceil(len(cleaned_data)*0.9))cleaned_data = cleaned_data[:clean_num]print('data length: ',len(ages))print('cleaned_data length: ',len(cleaned_data))return cleaned_data調用異常值清除函數outlierCleaner(),講清除后的數據重新回歸擬合
import outlier_cleanercleaned_data = outlierCleaner(reg.predict(ages_train), ages_train, net_worths_train)ages_train_new, net_worths_train_new, e = zip(*cleaned_data)ages_train_new = numpy.reshape( numpy.array(ages_train_new), (len(ages_train_new), 1)) net_worths_train_new = numpy.reshape( numpy.array(net_worths_train_new), (len(net_worths_train_new), 1))reg.fit(ages_train_new, net_worths_train_new)print("scope_removal: ", reg.coef_) print("intercept_removal: ", reg.intercept_) print("train score_removal: ", reg.score(ages_train_new, net_worths_train_new)) print("test score_removal: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages_train_new, reg.predict(ages_train_new), color="blue") except NameError:pass plt.scatter(ages_train_new, net_worths_train_new) plt.show()scope_removal: [[6.36859481]]
intercept_removal: [-6.91861069]
train score_removal: 0.9513734907601892
test score_removal: 0.9831894553955322
刪除異常之后,斜率是6.369,訓練集的R-平方值是0.95,測試集的R-平方值是0.98,效果明顯的好很多了。
根據工資和獎金處理數據集
主要是根據安然數據集中的工資和獎金進行處理,來判斷此數據集中是否有異常值。
import pickle import sys import matplotlib.pyplot sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit#python2 to python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### read in data dictionary, convert to numpy array data_dict = pickle.load( StrToBytes(open("../final_project/final_project_dataset.pkl", "r") )) features = ["salary", "bonus"]data = featureFormat(data_dict, features)可視化工資和獎金數據集
for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()
很明顯,右上角的那一數據點與其他點差距太大,是異常值。
找出工資和獎金的異常值
max_value = sorted(data, reverse=True, key=lambda sal:sal[0])[0] print('the max_value is: ', max_value)for i in data_dict:if data_dict[i]['salary'] == max_value[0]:print('Who is the max_value is: ',i)the max_value is: [26704229. 97343619.]
Who is the max_value is: TOTAL
刪除異常值后重新可視化工資和獎金數據
data_dict.pop( 'TOTAL', 0 ) data = featureFormat(data_dict, features)for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()我們認為還有 4 個異常值需要調查;讓我們舉例來看。兩人獲得了至少 5 百萬美元的獎金,以及超過 1 百萬美元的工資;換句話說,他們就像是強盜。
和這些點相關的名字是什么?
for i in data_dict:if data_dict[i]['salary'] != 'NaN' and data_dict[i]['bonus'] != 'NaN':if data_dict[i]['salary'] > 1e6 and data_dict[i]['bonus'] > 5e6:print(i)LAY KENNETH L
SKILLING JEFFREY K
你認為這兩個異常值應該并清除,還是留下來作為一個數據點? 留下來,它是有效的數據點 清除掉,它是一個電子表格怪癖 清除掉,它是一個錯誤?
這兩個異常數據當天不能刪除,事實表明他們兩個非法拿到了很多錢,是司法的重點研究對象
總結
以上是生活随笔為你收集整理的机器学习(8)——回归和异常值处理(安然数据集)的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: decode,encode的用法
- 下一篇: 什么才是【Python】中的鸭子类型和猴