机器学习(8)——回归和异常值处理(安然数据集)
文章目錄
- 根據(jù)年齡和工資回歸
- 刪除年齡和工資中的異常值
- 根據(jù)工資和獎(jiǎng)金處理數(shù)據(jù)集
- 可視化工資和獎(jiǎng)金數(shù)據(jù)集
- 找出工資和獎(jiǎng)金的異常值
- 刪除異常值后重新可視化工資和獎(jiǎng)金數(shù)據(jù)
機(jī)器學(xué)習(xí)——回歸和異常值處理(安然數(shù)據(jù)集)
以下代碼是在python 3.6下運(yùn)行。
安然事件造成有史以來(lái)最大的公司破產(chǎn)。在2000年度,安然是美國(guó)最大的能源公司,然而被揭露舞弊后,它在一年內(nèi)就破產(chǎn)了。
我們之所以選擇使用安然事件的數(shù)據(jù)集來(lái)做機(jī)器學(xué)習(xí)的項(xiàng)目,是因?yàn)槲覀円呀?jīng)有安然的電子郵件數(shù)據(jù)庫(kù),它包含150名前安然員工之間的50萬(wàn)封電子郵件,主要是高級(jí)管理人員。這也是唯一的大型公共的真實(shí)郵件數(shù)據(jù)庫(kù)。
感興趣的可以看一下安然的紀(jì)錄片,也是非常令人唏噓的一部經(jīng)典紀(jì)錄片:【紀(jì)錄片】安然:房間里最聰明的人
或者閱讀安然事件文章
關(guān)于安然數(shù)據(jù)集的分析可參考上一篇文章:
安然數(shù)據(jù)集分析
根據(jù)年齡和工資回歸
#!/usr/bin/pythonimport random import numpy import matplotlib.pyplot as plt import picklefrom outlier_cleaner import outlierCleaner#python2_to_python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### load up some practice data with outliers in it ages = pickle.load( StrToBytes(open("practice_outliers_ages.pkl", "r"))) net_worths = pickle.load( StrToBytes(open("practice_outliers_net_worths.pkl", "r")) )### ages and net_worths need to be reshaped into 2D numpy arrays ### second argument of reshape command is a tuple of integers: (n_rows, n_columns) ### by convention, n_rows is the number of data points ### and n_columns is the number of features ages = numpy.reshape( numpy.array(ages), (len(ages), 1)) net_worths = numpy.reshape( numpy.array(net_worths), (len(net_worths), 1)) #from sklearn.cross_validation import train_test_splitfrom sklearn.model_selection import train_test_split ages_train, ages_test, net_worths_train, net_worths_test = train_test_split(ages, net_worths, test_size=0.1, random_state=42)### fill in a regression here! Name the regression object reg so that ### the plotting code below works, and you can see what your regression looks likefrom sklearn.linear_model import LinearRegressionreg = LinearRegression() reg.fit(ages_train, net_worths_train)print("scope: ", reg.coef_) print("intercept: ", reg.intercept_) print("train score: ", reg.score(ages_train, net_worths_train)) print("test score: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages, reg.predict(ages), color="blue") except NameError:pass plt.scatter(ages, net_worths) plt.show()scope: [[5.07793064]]
intercept: [25.21002155]
train score: 0.4898725961751499
test score: 0.8782624703664671
此時(shí)斜率是5.07,訓(xùn)練集的R-平方值是0.4898,測(cè)試集的R-平方值是0.878
刪除年齡和工資中的異常值
在outlier_cleaner.py中定義異常值清除函數(shù)outlierCleaner(),清楚10%的異常值
#!/usr/bin/pythonimport mathdef outlierCleaner(predictions, ages, net_worths):"""Clean away the 10% of points that have the largestresidual errors (difference between the predictionand the actual net worth).Return a list of tuples named cleaned_data where each tuple is of the form (age, net_worth, error)."""cleaned_data = []### your code goes hereerrors = abs(predictions - net_worths)cleaned_data = zip(ages, net_worths, errors)cleaned_data = sorted(cleaned_data, key=lambda clean:clean[2])clean_num = int(math.ceil(len(cleaned_data)*0.9))cleaned_data = cleaned_data[:clean_num]print('data length: ',len(ages))print('cleaned_data length: ',len(cleaned_data))return cleaned_data調(diào)用異常值清除函數(shù)outlierCleaner(),講清除后的數(shù)據(jù)重新回歸擬合
import outlier_cleanercleaned_data = outlierCleaner(reg.predict(ages_train), ages_train, net_worths_train)ages_train_new, net_worths_train_new, e = zip(*cleaned_data)ages_train_new = numpy.reshape( numpy.array(ages_train_new), (len(ages_train_new), 1)) net_worths_train_new = numpy.reshape( numpy.array(net_worths_train_new), (len(net_worths_train_new), 1))reg.fit(ages_train_new, net_worths_train_new)print("scope_removal: ", reg.coef_) print("intercept_removal: ", reg.intercept_) print("train score_removal: ", reg.score(ages_train_new, net_worths_train_new)) print("test score_removal: ", reg.score(ages_test, net_worths_test))try:plt.plot(ages_train_new, reg.predict(ages_train_new), color="blue") except NameError:pass plt.scatter(ages_train_new, net_worths_train_new) plt.show()scope_removal: [[6.36859481]]
intercept_removal: [-6.91861069]
train score_removal: 0.9513734907601892
test score_removal: 0.9831894553955322
刪除異常之后,斜率是6.369,訓(xùn)練集的R-平方值是0.95,測(cè)試集的R-平方值是0.98,效果明顯的好很多了。
根據(jù)工資和獎(jiǎng)金處理數(shù)據(jù)集
主要是根據(jù)安然數(shù)據(jù)集中的工資和獎(jiǎng)金進(jìn)行處理,來(lái)判斷此數(shù)據(jù)集中是否有異常值。
import pickle import sys import matplotlib.pyplot sys.path.append("../tools/") from feature_format import featureFormat, targetFeatureSplit#python2 to python3 class StrToBytes: def __init__(self, fileobj): self.fileobj = fileobj def read(self, size): return self.fileobj.read(size).encode() def readline(self, size=-1): return self.fileobj.readline(size).encode()### read in data dictionary, convert to numpy array data_dict = pickle.load( StrToBytes(open("../final_project/final_project_dataset.pkl", "r") )) features = ["salary", "bonus"]data = featureFormat(data_dict, features)可視化工資和獎(jiǎng)金數(shù)據(jù)集
for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()
很明顯,右上角的那一數(shù)據(jù)點(diǎn)與其他點(diǎn)差距太大,是異常值。
找出工資和獎(jiǎng)金的異常值
max_value = sorted(data, reverse=True, key=lambda sal:sal[0])[0] print('the max_value is: ', max_value)for i in data_dict:if data_dict[i]['salary'] == max_value[0]:print('Who is the max_value is: ',i)the max_value is: [26704229. 97343619.]
Who is the max_value is: TOTAL
刪除異常值后重新可視化工資和獎(jiǎng)金數(shù)據(jù)
data_dict.pop( 'TOTAL', 0 ) data = featureFormat(data_dict, features)for point in data:salary = point[0]bonus = point[1]matplotlib.pyplot.scatter( salary, bonus )matplotlib.pyplot.xlabel("salary") matplotlib.pyplot.ylabel("bonus") matplotlib.pyplot.show()我們認(rèn)為還有 4 個(gè)異常值需要調(diào)查;讓我們舉例來(lái)看。兩人獲得了至少 5 百萬(wàn)美元的獎(jiǎng)金,以及超過(guò) 1 百萬(wàn)美元的工資;換句話說(shuō),他們就像是強(qiáng)盜。
和這些點(diǎn)相關(guān)的名字是什么?
for i in data_dict:if data_dict[i]['salary'] != 'NaN' and data_dict[i]['bonus'] != 'NaN':if data_dict[i]['salary'] > 1e6 and data_dict[i]['bonus'] > 5e6:print(i)LAY KENNETH L
SKILLING JEFFREY K
你認(rèn)為這兩個(gè)異常值應(yīng)該并清除,還是留下來(lái)作為一個(gè)數(shù)據(jù)點(diǎn)? 留下來(lái),它是有效的數(shù)據(jù)點(diǎn) 清除掉,它是一個(gè)電子表格怪癖 清除掉,它是一個(gè)錯(cuò)誤?
這兩個(gè)異常數(shù)據(jù)當(dāng)天不能刪除,事實(shí)表明他們兩個(gè)非法拿到了很多錢,是司法的重點(diǎn)研究對(duì)象
總結(jié)
以上是生活随笔為你收集整理的机器学习(8)——回归和异常值处理(安然数据集)的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: decode,encode的用法
- 下一篇: 数据结构练习题——线性表