當(dāng)前位置：首頁(yè) > 编程语言 > python >内容正文

python

python numpy和pandas库的区别_python – 来自熊猫和numpy的意思不同

發(fā)布時(shí)間：2024/7/23 python 36 豆豆

生活随笔收集整理的這篇文章主要介紹了 python numpy和pandas库的区别_python – 来自熊猫和numpy的意思不同小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

簡(jiǎn)潔版本：

之所以不同,是因?yàn)樵谡{(diào)用平均操作時(shí),pandas會(huì)使用瓶頸(如果已安裝),而不是僅僅依賴于numpy.據(jù)推測(cè),瓶頸似乎比numpy更快(至少在我的機(jī)器上),但代價(jià)是精確度.它們碰巧匹配64位版本,但32位不同(這是有趣的部分).

長(zhǎng)版：

通過(guò)檢查這些模塊的源代碼來(lái)判斷發(fā)生了什么是非常困難的(它們非常復(fù)雜,即使是像平均值這樣的簡(jiǎn)單計(jì)算,也很難說(shuō)數(shù)值計(jì)算很難).最好使用調(diào)試器來(lái)避免大腦編譯和那些類型的錯(cuò)誤.調(diào)試器不會(huì)在邏輯上出錯(cuò),它會(huì)告訴你究竟發(fā)生了什么.

這是我的一些堆棧跟蹤(由于沒(méi)有RNG的種子,值略有不同)：

可以重現(xiàn)(Windows)：

>>> import numpy as np; import pandas as pd

>>> x=np.random.normal(-9.,.005,size=900000)

>>> df=pd.DataFrame(x,dtype='float32',columns=['x'])

>>> df['x'].mean()

-9.0

>>> x.mean()

-9.0000037501099754

>>> x.astype(np.float32).mean()

-9.0000029

numpy的版本沒(méi)什么特別的.這是熊貓版本有點(diǎn)古怪.

讓我們來(lái)看看df [‘x’].mean()：

>>> def test_it_2():

... import pdb; pdb.set_trace()

... df['x'].mean()

>>> test_it_2()

... # Some stepping/poking around that isn't important

(Pdb) l

2307

2308 if we have an ndarray as a value, then simply perform the operation,

2309 otherwise delegate to the object

2310

2311 """

2312 -> delegate = self._values

2313 if isinstance(delegate, np.ndarray):

2314 # Validate that 'axis' is consistent with Series's single axis.

2315 self._get_axis_number(axis)

2316 if numeric_only:

2317 raise NotImplementedError('Series.{0} does not implement '

(Pdb) delegate.dtype

dtype('float32')

(Pdb) l

2315 self._get_axis_number(axis)

2316 if numeric_only:

2317 raise NotImplementedError('Series.{0} does not implement '

2318 'numeric_only.'.format(name))

2319 with np.errstate(all='ignore'):

2320 -> return op(delegate, skipna=skipna, **kwds)

2321

2322 return delegate._reduce(op=op, name=name, axis=axis, skipna=skipna,

2323 numeric_only=numeric_only,

2324 filter_type=filter_type, **kwds)

所以我們找到了麻煩點(diǎn),但現(xiàn)在事情變得有些奇怪了：

(Pdb) op

(Pdb) op(delegate)

-9.0

(Pdb) delegate_64 = delegate.astype(np.float64)

(Pdb) op(delegate_64)

-9.000003749978807

(Pdb) delegate.mean()

-9.0000029

(Pdb) delegate_64.mean()

-9.0000037499788075

(Pdb) np.nanmean(delegate, dtype=np.float64)

-9.0000037499788075

(Pdb) np.nanmean(delegate, dtype=np.float32)

-9.0000029

請(qǐng)注意,delegate.mean()和np.nanmean輸出-9.0000029類型為float32,而不是-9.0作為pandas nanmean.稍微探討一下,你可以在pandas.core.nanops中找到pandas nanmean的來(lái)源.有趣的是,它實(shí)際上似乎應(yīng)該首先匹配numpy.我們來(lái)看看pandas nanmean：

(Pdb) import inspect

(Pdb) src = inspect.getsource(op).split("\n")

(Pdb) for line in src: print(line)

@disallow('M8')

@bottleneck_switch()

def nanmean(values, axis=None, skipna=True):

values, mask, dtype, dtype_max = _get_values(values, skipna, 0)

dtype_sum = dtype_max

dtype_count = np.float64

if is_integer_dtype(dtype) or is_timedelta64_dtype(dtype):

dtype_sum = np.float64

elif is_float_dtype(dtype):

dtype_sum = dtype

dtype_count = dtype

count = _get_counts(mask, axis, dtype=dtype_count)

the_sum = _ensure_numeric(values.sum(axis, dtype=dtype_sum))

if axis is not None and getattr(the_sum, 'ndim', False):

the_mean = the_sum / count

ct_mask = count == 0

if ct_mask.any():

the_mean[ct_mask] = np.nan

else:

the_mean = the_sum / count if count > 0 else np.nan

return _wrap_results(the_mean, dtype)

這是bottleneck_switch裝飾器的(短)版本：

import bottleneck as bn

...

class bottleneck_switch(object):

def __init__(self, **kwargs):

self.kwargs = kwargs

def __call__(self, alt):

bn_name = alt.__name__

try:

bn_func = getattr(bn, bn_name)

except (AttributeError, NameError): # pragma: no cover

bn_func = None

...

if (_USE_BOTTLENECK and skipna and

_bn_ok_dtype(values.dtype, bn_name)):

result = bn_func(values, axis=axis, **kwds)

用alt作為pandas nanmean函數(shù)調(diào)用它,所以bn_name是’nanmean’,這是從瓶頸模塊中獲取的attr：

(Pdb) l

93 result = np.empty(result_shape)

94 result.fill(0)

95 return result

97 if (_USE_BOTTLENECK and skipna and

98 -> _bn_ok_dtype(values.dtype, bn_name)):

99 result = bn_func(values, axis=axis, **kwds)

100

101 # prefer to treat inf/-inf as NA, but must compute the fun

102 # twice :(

103 if _has_infs(result):

(Pdb) n

> d:\anaconda3\lib\site-packages\pandas\core\nanops.py(99)f()

-> result = bn_func(values, axis=axis, **kwds)

(Pdb) alt

(Pdb) alt.__name__

'nanmean'

(Pdb) bn_func

(Pdb) bn_name

'nanmean'

(Pdb) bn_func(values, axis=axis, **kwds)

-9.0

假裝bottleneck_switch()裝飾器暫時(shí)不存在.我們實(shí)際上可以看到調(diào)用手動(dòng)單步執(zhí)行此函數(shù)(沒(méi)有瓶頸)將獲得與numpy相同的結(jié)果：

(Pdb) from pandas.core.nanops import _get_counts

(Pdb) from pandas.core.nanops import _get_values

(Pdb) from pandas.core.nanops import _ensure_numeric

(Pdb) values, mask, dtype, dtype_max = _get_values(delegate, skipna=skipna)

(Pdb) count = _get_counts(mask, axis=None, dtype=dtype)

(Pdb) count

900000.0

(Pdb) values.sum(axis=None, dtype=dtype) / count

-9.0000029

但是,如果你已經(jīng)安裝了瓶頸,那就永遠(yuǎn)不會(huì)被調(diào)用.相反,bottleneck_switch()裝飾器反而突破了nanmean函數(shù)和瓶頸版本.這就是差異所在(有趣的是它在float64的情況下是匹配的)：

(Pdb) import bottleneck as bn

(Pdb) bn.nanmean(delegate)

-9.0

(Pdb) bn.nanmean(delegate.astype(np.float64))

-9.000003749978807

據(jù)我所知,瓶頸僅用于速度.我假設(shè)他們正在使用他們的nanmean函數(shù)采用某種類型的快捷方式,但我沒(méi)有對(duì)它進(jìn)行過(guò)多考察(有關(guān)此主題的詳細(xì)信息,請(qǐng)參閱@ ead的答案).您可以看到它的基準(zhǔn)測(cè)試通常比numpy快一點(diǎn)：https://github.com/kwgoodman/bottleneck.顯然,為這個(gè)速度付出的代價(jià)是精確的.

瓶頸實(shí)際上更快嗎？

當(dāng)然看起來(lái)像(至少在我的機(jī)器上).

In [1]: import numpy as np; import pandas as pd

In [2]: x=np.random.normal(-9.8,.05,size=900000)

In [3]: y_32 = x.astype(np.float32)

In [13]: %timeit np.nanmean(y_32)

100 loops, best of 3: 5.72 ms per loop

In [14]: %timeit bn.nanmean(y_32)

1000 loops, best of 3: 854 ?s per loop

對(duì)于pandas來(lái)說(shuō),在這里引入一個(gè)標(biāo)志可能會(huì)很好(一個(gè)用于速度,另一個(gè)用于更好的精度,默認(rèn)用于速度,因?yàn)槟鞘钱?dāng)前的impl).一些用戶更關(guān)心計(jì)算的準(zhǔn)確性而不是它發(fā)生的速度.

HTH.

總結(jié)

以上是生活随笔為你收集整理的python numpy和pandas库的区别_python – 来自熊猫和numpy的意思不同的全部?jī)?nèi)容，希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò)，歡迎將生活随笔推薦給好友。

上一篇：自学Java必看的知识点，猿们怎么看？
下一篇： python的安装包下载_科学网—[转载

日韩av黄I国产麻豆传媒I国产91av视频在线观看I日韩一区二区三区在线看I美女国产在线I麻豆视频国产在线观看I成人黄色短片

python

python numpy和pandas库的区别_python – 来自熊猫和numpy的意思不同

總結(jié)