Pandas CookBook -- 02DataFrame基础操作
Pandas基礎(chǔ)操作
簡(jiǎn)書大神SeanCheney的譯作,我作了些格式調(diào)整和文章目錄結(jié)構(gòu)的變化,更適合自己閱讀,以后翻閱是更加方便自己查找吧
import pandas as pd import numpy as np設(shè)定最大列數(shù)和最大行數(shù)
pd.set_option('max_columns',5 , 'max_rows', 5)1 選取多個(gè)DataFrame列
1.1 用列表選取多個(gè)列
movie = pd.read_csv('data/movie.csv') cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name'] movie_actor_director = movie[cols] movie_actor_director| CCH Pounder | Joel David Moore | Wes Studi | James Cameron |
| Johnny Depp | Orlando Bloom | Jack Davenport | Gore Verbinski |
| ... | ... | ... | ... |
| Alan Ruck | Daniel Henney | Eliza Coupe | Daniel Hsia |
| John August | Brian Herzlinger | Jon Gunn | Jon Gunn |
4916 rows × 4 columns
1.2 使用select_dtypes選取類型
select_dtypes(include=None, exclude=None)
- To select all numeric types, use np.number or 'number'
- To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
- To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
- To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
- To select Pandas categorical dtypes, use 'category'
1.2.1 選取整數(shù)列
movie.select_dtypes(include=['int']).head()| 886204 | 4834 | 33000 |
| 471220 | 48350 | 0 |
| 275868 | 11700 | 85000 |
| 1144337 | 106759 | 164000 |
| 8 | 143 | 0 |
1.2.2 選取非整數(shù)列
movie.select_dtypes(exclude=['int']).head()| Color | James Cameron | ... | 7.9 | 1.78 |
| Color | Gore Verbinski | ... | 7.1 | 2.35 |
| Color | Sam Mendes | ... | 6.8 | 2.35 |
| Color | Christopher Nolan | ... | 8.5 | 2.35 |
| NaN | Doug Walker | ... | 7.1 | NaN |
5 rows × 25 columns
1.2.3 通過(guò)filter函數(shù)過(guò)濾選取多列
filter(items=None, like=None, regex=None, axis=None)
- items : list-like
- List of info axis to restrict to (must not all be present)
- 傳遞個(gè)列名或行名列表
- like : string
- Keep info axis where “arg in col == True”
- 類似Python里面字符串的find()函數(shù),col.find(arg)
- regex : string (regular expression)
- Keep info axis with re.search(regex, col) == True
通過(guò)filter()函數(shù)過(guò)濾選取多列
movie.filter(like='facebook').head()| 0.0 | 855.0 | ... | 936.0 | 33000 |
| 563.0 | 1000.0 | ... | 5000.0 | 0 |
| 0.0 | 161.0 | ... | 393.0 | 85000 |
| 22000.0 | 23000.0 | ... | 23000.0 | 164000 |
| 131.0 | NaN | ... | 12.0 | 0 |
5 rows × 6 columns
通過(guò)正則表達(dá)式選取多列
movie.filter(regex='\d').head()| 855.0 | Joel David Moore | ... | Wes Studi | 936.0 |
| 1000.0 | Orlando Bloom | ... | Jack Davenport | 5000.0 |
| 161.0 | Rory Kinnear | ... | Stephanie Sigman | 393.0 |
| 23000.0 | Christian Bale | ... | Joseph Gordon-Levitt | 23000.0 |
| NaN | Rob Walker | ... | NaN | 12.0 |
5 rows × 6 columns
filter()函數(shù),傳遞列表到參數(shù)items,選取多列
movie.filter(items=['actor_1_name', 'actor_3_name']).head()| CCH Pounder | Wes Studi |
| Johnny Depp | Jack Davenport |
| Christoph Waltz | Stephanie Sigman |
| Tom Hardy | Joseph Gordon-Levitt |
| Doug Walker | NaN |
2 DataFrame上操作
2.1 基本方法
數(shù)據(jù)的個(gè)數(shù) 數(shù)據(jù)集的維度 數(shù)據(jù)集的長(zhǎng)度
movie.shape,movie.size,movie.ndim ((4916, 28), 137648, 2)各個(gè)列的非空值的個(gè)數(shù)
movie.count() color 4897 director_name 4814... aspect_ratio 4590 movie_facebook_likes 4916 Length: 28, dtype: int642.2 統(tǒng)計(jì)信息
movie.shape (4916, 28)2.2.1 最大 最小值
2.2.1.1 數(shù)值類型
# min max quantile movie_min = movie.min() movie_min.name = '最小值' movie_min num_critic_for_reviews 1.00 duration 7.00... aspect_ratio 1.18 movie_facebook_likes 0.00 Name: 最小值, Length: 16, dtype: float64計(jì)算是默認(rèn)會(huì)跳過(guò)缺失值的,可設(shè)置skipna=False使其包含缺失,但這樣不具有意義
movie.min(skipna=False) num_critic_for_reviews NaN duration NaN... aspect_ratio NaN movie_facebook_likes 0.0 Length: 16, dtype: float642.2.1.2 字符串類型
當(dāng)字符串類型的列包含缺失值時(shí),聚合方法min、max、sum,不會(huì)返回任何值。
movie[['color', 'movie_title', 'color']].max() Series([], dtype: float64)要讓pandas強(qiáng)行返回每列的值,必須填入缺失值。下面填入的是空字符串
movie[['color', 'movie_title', 'color']].fillna('').max() color Color movie_title ?on Flux color Color dtype: object2.2.2 統(tǒng)計(jì)信息
2.2.2.1 數(shù)值型
使用percentiles參數(shù)指定分位數(shù)
movie.describe(percentiles=[.01, .3, .99])| 4867.000000 | 4901.000000 | ... | 4590.000000 | 4916.000000 |
| 137.988905 | 107.090798 | ... | 2.222349 | 7348.294142 |
| ... | ... | ... | ... | ... |
| 546.680000 | 189.000000 | ... | 4.000000 | 93850.000000 |
| 813.000000 | 511.000000 | ... | 16.000000 | 349000.000000 |
9 rows × 16 columns
2.2.2.2 字符串型
movie.select_dtypes(include='object').describe()| 4897 | 4814 | ... | 4911 | 4616 |
| 2 | 2397 | ... | 65 | 18 |
| Color | Steven Spielberg | ... | USA | R |
| 4693 | 26 | ... | 3710 | 2067 |
4 rows × 12 columns
2.3 方法的組合
使用isnull方法將每個(gè)值轉(zhuǎn)變?yōu)椴紶栔?/p> movie.isnull().head()
| False | False | ... | False | False |
| False | False | ... | False | False |
| False | False | ... | False | False |
| False | False | ... | False | False |
| True | False | ... | True | False |
5 rows × 28 columns
sum統(tǒng)計(jì)布爾值,返回的是Series
movie.isnull().sum().head() color 19 director_name 102 num_critic_for_reviews 49 duration 15 director_facebook_likes 102 dtype: int64對(duì)這個(gè)Series再使用sum,返回整個(gè)DataFrame的缺失值的個(gè)數(shù),返回值是個(gè)標(biāo)量
movie.isnull().sum().sum() 2654判斷整個(gè)DataFrame有沒有缺失值,方法是連著使用兩個(gè)any
movie.isnull().any().any() True2.4 運(yùn)算符
行索引名設(shè)為INSTNM,用UGDS_過(guò)濾出本科生的種族比例
college = pd.read_csv('data/college.csv', index_col='INSTNM') college_ugds_ = college.filter(like='UGDS_') college_ugds_| 0.0333 | 0.9353 | ... | 0.0059 | 0.0138 |
| 0.5922 | 0.2600 | ... | 0.0179 | 0.0100 |
| ... | ... | ... | ... | ... |
| NaN | NaN | ... | NaN | NaN |
| NaN | NaN | ... | NaN | NaN |
7535 rows × 9 columns
college_ugds_的數(shù)值類型都是float,可以進(jìn)行整數(shù)運(yùn)算
college_ugds_.dtypes UGDS_WHITE float64 UGDS_BLACK float64... UGDS_NRA float64 UGDS_UNKN float64 Length: 9, dtype: object2.4.1 加減乘除
college_ugds_.head() + .00501| 0.03831 | 0.94031 | ... | 0.01091 | 0.01881 |
| 0.59721 | 0.26501 | ... | 0.02291 | 0.01501 |
| 0.30401 | 0.42421 | ... | 0.00501 | 0.27651 |
| 0.70381 | 0.13051 | ... | 0.03821 | 0.04001 |
| 0.02081 | 0.92581 | ... | 0.02931 | 0.01871 |
5 rows × 9 columns
2.4.2 計(jì)算樣例數(shù)據(jù)的百分比
2.4.2.1 方式一
college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100 college_ugds_op_round.head()| 0.03 | 0.94 | ... | 0.01 | 0.01 |
| 0.59 | 0.26 | ... | 0.02 | 0.01 |
| 0.30 | 0.42 | ... | 0.00 | 0.27 |
| 0.70 | 0.13 | ... | 0.03 | 0.04 |
| 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
2.4.2.2 方式二
college_ugds_round = (college_ugds_ + .00001).round(2) college_ugds_round.head()| 0.03 | 0.94 | ... | 0.01 | 0.01 |
| 0.59 | 0.26 | ... | 0.02 | 0.01 |
| 0.30 | 0.42 | ... | 0.00 | 0.27 |
| 0.70 | 0.13 | ... | 0.03 | 0.04 |
| 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
2.4.2.3 方式三
college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100) college_ugds_op_round_methods.head()| 0.03 | 0.94 | ... | 0.01 | 0.01 |
| 0.59 | 0.26 | ... | 0.02 | 0.01 |
| 0.30 | 0.42 | ... | 0.00 | 0.27 |
| 0.70 | 0.13 | ... | 0.03 | 0.04 |
| 0.02 | 0.92 | ... | 0.02 | 0.01 |
5 rows × 9 columns
3 比較缺失值
Pandas使用NumPy NaN(np.nan)對(duì)象表示缺失值。這是一個(gè)不等于自身的特殊對(duì)象:
np.nan == np.nan False所有和np.nan的比較都返回False,除了不等于:
5 > np.nan False 5 != np.nan True無(wú)法通過(guò)直接比較比較,含有缺失值的df是否一致
movie_equal = movie == movie movie_equal.all().all() False movie_equal.size - movie_equal.sum().sum() 2654 movie.isnull().sum().sum() 2654比較兩個(gè)DataFrame最直接的方法是使用equals()方法
from pandas.testing import assert_frame_equal assert_frame_equal(movie, movie)轉(zhuǎn)載于:https://www.cnblogs.com/shiyushiyu/p/9734621.html
總結(jié)
以上是生活随笔為你收集整理的Pandas CookBook -- 02DataFrame基础操作的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: CentOS安装jdk(无需配置环境变量
- 下一篇: (2) LVS负载均衡:VS_TUN和V