當(dāng)前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

缺失值和异常值的识别与处理_识别异常值-第一部分

發(fā)布時(shí)間：2023/11/29 编程问答 28 豆豆

生活随笔收集整理的這篇文章主要介紹了缺失值和异常值的识别与处理_识别异常值-第一部分小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

缺失值和異常值的識(shí)別與處理

📈Python金融系列 (📈Python for finance series)

Warning: There is no magical formula or Holy Grail here, though a new world might open the door for you.

警告：這里沒有神奇的配方或圣杯，盡管新世界可能為您打開了大門。

📈Python金融系列 (📈Python for finance series)

Identifying Outliers

識(shí)別異常值

Identifying Outliers — Part Two

識(shí)別異常值-第二部分

Identifying Outliers — Part Three

識(shí)別異常值-第三部分

Stylized Facts

程式化的事實(shí)

Feature Engineering & Feature Selection

特征工程與特征選擇

Data Transformation

數(shù)據(jù)轉(zhuǎn)換

Pandas has quite a few handy methods to clean up messy data, like dropna,drop_duplicates, etc.. However, finding and removing outliers is one of those functions that we would like to have and still not exist yet. Here I would like to share with you how to do it step by step in details:

Pandas有很多方便的方法可以清理混亂的數(shù)據(jù)，例如dropna ， drop_duplicates等。但是，查找和刪除異常值是我們希望擁有的但仍然不存在的功能之一。在這里，我想與您分享如何逐步進(jìn)行詳細(xì)操作：

The key to defining an outlier lays at the boundary we employed. Here I will give 3 different ways to define the boundary, namely, the Average mean, the Moving Average mean and the Exponential Weighted Moving Average mean.

定義離群值的關(guān)鍵在于我們采用的邊界。在這里，我將給出3種不同的方法來定義邊界，即平均均值，移動(dòng)平均數(shù)和指數(shù)加權(quán)移動(dòng)平均數(shù)。

1.數(shù)據(jù)準(zhǔn)備 (1. Data preparation)

Here I used Apple’s 10-year stock history price and returns from Yahoo Finance as an example, of course, you can use any data.

在這里，我以蘋果公司10年的股票歷史價(jià)格和Yahoo Finance的收益為例，當(dāng)然，您可以使用任何數(shù)據(jù)。

import pandas as pd
import yfinance as yfimport matplotlib.pyplot as plt
plt.style.use('seaborn')
plt.rcParams['figure.dpi'] = 300df = yf.download('AAPL',
start = '2000-01-01',
end= '2010-12-31')

As we only care about the returns, a new DataFrame (d1) is created to hold the adjusted price and returns.

由于我們只關(guān)心收益， DataFrame (d1)會(huì)創(chuàng)建一個(gè)新的DataFrame (d1)來容納調(diào)整后的價(jià)格和收益。

d1 = pd.DataFrame(df['Adj Close'])
d1.rename(columns={'Adj Close':'adj_close'}, inplace=True)
d1['simple_rtn']=d1.adj_close.pct_change()
d1.head()

2.以均值和標(biāo)準(zhǔn)差為邊界。 (2. Using mean and standard deviation as the boundary.)

Calculate the mean and std of the simple_rtn:

計(jì)算simple_rtn的均值和std：

d1_mean = d1['simple_rtn'].agg(['mean', 'std'])

If we use mean and one std as the boundary, the results will look like these:

如果我們使用均值和一個(gè)std作為邊界，結(jié)果將如下所示：

fig, ax = plt.subplots(figsize=(10,6))
d1['simple_rtn'].plot(label='simple_rtn', legend=True, ax = ax)
plt.axhline(y=d1_mean.loc['mean'], c='r', label='mean')
plt.axhline(y=d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.axhline(y=-d1_mean.loc['std'], c='c', linestyle='-.',label='std')
plt.legend(loc='lower right')

What happens if I use 3 times std instead?

如果我使用3次std會(huì)怎樣？

Looks good! Now is the time to look for those outliers:

看起來挺好的！現(xiàn)在是時(shí)候?qū)ふ夷切╇x群值了：

mu = d1_mean.loc['mean']
sigma = d1_mean.loc['std']def get_outliers(df, mu=mu, sigma=sigma, n_sigmas=3):
'''
df: the DataFrame
mu: mean
sigmas: std
n_sigmas: number of std as boundary
'''
x = df['simple_rtn']
mu = mu
sigma = sigma

if (x > mu+n_sigmas*sigma) | (x<mu-n_sigmas*sigma):
return 1
else:
return 0

After applied the rule get_outliers to the stock price return, a new column is created:

將規(guī)則get_outliers應(yīng)用于股票價(jià)格收益后，將創(chuàng)建一個(gè)新列：

d1['outlier'] = d1.apply(get_outliers, axis=1)
d1.head()

?提示！ (?Tip!)

#The above code snippet can be refracted as follow:cond = (d1['simple_rtn'] > mu + sigma * 2) | (d1['simple_rtn'] < mu - sigma * 2)
d1['outliers'] = np.where(cond, 1, 0)

Let’s have a look at the outliers. We can check how many outliers we found by doing a value count.

讓我們看看異常值。我們可以通過計(jì)數(shù)來檢查發(fā)現(xiàn)了多少離群值。

d1.outlier.value_counts()

We found 30 outliers if we set 3 times std as the boundary. We can pick those outliers out and put it into another DataFrame and show it in the graph:

如果我們將std設(shè)置為3倍，則發(fā)現(xiàn)30個(gè)離群值。我們可以挑選出這些離群值，并將其放入另一個(gè)DataFrame ，并在圖中顯示出來：

outliers = d1.loc[d1['outlier'] == 1, ['simple_rtn']]fig, ax = plt.subplots()ax.plot(d1.index, d1.simple_rtn,
color='blue', label='Normal')
ax.scatter(outliers.index, outliers.simple_rtn,
color='red', label='Anomaly')
ax.set_title("Apple's stock returns")
ax.legend(loc='lower right')plt.tight_layout()
plt.show()

In the above plot, we can observe outliers marked with a red dot. In the next post, I will show you how to use Moving Average Mean and Standard deviation as the boundary.

在上圖中，我們可以觀察到標(biāo)有紅點(diǎn)的離群值。在下一篇文章中，我將向您展示如何使用移動(dòng)平均均值和標(biāo)準(zhǔn)差作為邊界。

Happy learning, happy coding!

學(xué)習(xí)愉快，編碼愉快！

翻譯自: https://medium.com/python-in-plain-english/identifying-outliers-part-one-c0a31d9faefa

缺失值和異常值的識(shí)別與處理

總結(jié)

以上是生活随笔為你收集整理的缺失值和异常值的识别与处理_识别异常值-第一部分的全部內(nèi)容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。