日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

Pandas CookBook -- 02DataFrame基础操作

發(fā)布時(shí)間:2025/3/20 编程问答 35 豆豆
生活随笔 收集整理的這篇文章主要介紹了 Pandas CookBook -- 02DataFrame基础操作 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

Pandas基礎(chǔ)操作

簡(jiǎn)書大神SeanCheney的譯作,我作了些格式調(diào)整和文章目錄結(jié)構(gòu)的變化,更適合自己閱讀,以后翻閱是更加方便自己查找吧

import pandas as pd import numpy as np

設(shè)定最大列數(shù)和最大行數(shù)

pd.set_option('max_columns',5 , 'max_rows', 5)

1 選取多個(gè)DataFrame列

1.1 用列表選取多個(gè)列

movie = pd.read_csv('data/movie.csv') cols =['actor_1_name', 'actor_2_name', 'actor_3_name', 'director_name'] movie_actor_director = movie[cols] movie_actor_director actor_1_nameactor_2_nameactor_3_namedirector_name01...49144915
CCH PounderJoel David MooreWes StudiJames Cameron
Johnny DeppOrlando BloomJack DavenportGore Verbinski
............
Alan RuckDaniel HenneyEliza CoupeDaniel Hsia
John AugustBrian HerzlingerJon GunnJon Gunn

4916 rows × 4 columns

1.2 使用select_dtypes選取類型

select_dtypes(include=None, exclude=None)

  • To select all numeric types, use np.number or 'number'
  • To select strings you must use the object dtype, but note that this will return all object dtype columns,See the numpy dtype hierarchy
  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'
  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'
  • To select Pandas categorical dtypes, use 'category'
movie.shape (4916, 28)

1.2.1 選取整數(shù)列

movie.select_dtypes(include=['int']).head() num_voted_userscast_total_facebook_likesmovie_facebook_likes01234
886204483433000
471220483500
2758681170085000
1144337106759164000
81430

1.2.2 選取非整數(shù)列

movie.select_dtypes(exclude=['int']).head() colordirector_name...imdb_scoreaspect_ratio01234
ColorJames Cameron...7.91.78
ColorGore Verbinski...7.12.35
ColorSam Mendes...6.82.35
ColorChristopher Nolan...8.52.35
NaNDoug Walker...7.1NaN

5 rows × 25 columns

1.2.3 通過(guò)filter函數(shù)過(guò)濾選取多列

filter(items=None, like=None, regex=None, axis=None)

  • items : list-like
    • List of info axis to restrict to (must not all be present)
    • 傳遞個(gè)列名或行名列表
  • like : string
    • Keep info axis where “arg in col == True”
    • 類似Python里面字符串的find()函數(shù),col.find(arg)
  • regex : string (regular expression)
    • Keep info axis with re.search(regex, col) == True

通過(guò)filter()函數(shù)過(guò)濾選取多列

movie.filter(like='facebook').head() director_facebook_likesactor_3_facebook_likes...actor_2_facebook_likesmovie_facebook_likes01234
0.0855.0...936.033000
563.01000.0...5000.00
0.0161.0...393.085000
22000.023000.0...23000.0164000
131.0NaN...12.00

5 rows × 6 columns

通過(guò)正則表達(dá)式選取多列

movie.filter(regex='\d').head() actor_3_facebook_likesactor_2_name...actor_3_nameactor_2_facebook_likes01234
855.0Joel David Moore...Wes Studi936.0
1000.0Orlando Bloom...Jack Davenport5000.0
161.0Rory Kinnear...Stephanie Sigman393.0
23000.0Christian Bale...Joseph Gordon-Levitt23000.0
NaNRob Walker...NaN12.0

5 rows × 6 columns

filter()函數(shù),傳遞列表到參數(shù)items,選取多列

movie.filter(items=['actor_1_name', 'actor_3_name']).head() actor_1_nameactor_3_name01234
CCH PounderWes Studi
Johnny DeppJack Davenport
Christoph WaltzStephanie Sigman
Tom HardyJoseph Gordon-Levitt
Doug WalkerNaN

2 DataFrame上操作

2.1 基本方法

數(shù)據(jù)的個(gè)數(shù) 數(shù)據(jù)集的維度 數(shù)據(jù)集的長(zhǎng)度

movie.shape,movie.size,movie.ndim ((4916, 28), 137648, 2)

各個(gè)列的非空值的個(gè)數(shù)

movie.count() color 4897 director_name 4814... aspect_ratio 4590 movie_facebook_likes 4916 Length: 28, dtype: int64

2.2 統(tǒng)計(jì)信息

movie.shape (4916, 28)

2.2.1 最大 最小值

2.2.1.1 數(shù)值類型

# min max quantile movie_min = movie.min() movie_min.name = '最小值' movie_min num_critic_for_reviews 1.00 duration 7.00... aspect_ratio 1.18 movie_facebook_likes 0.00 Name: 最小值, Length: 16, dtype: float64

計(jì)算是默認(rèn)會(huì)跳過(guò)缺失值的,可設(shè)置skipna=False使其包含缺失,但這樣不具有意義

movie.min(skipna=False) num_critic_for_reviews NaN duration NaN... aspect_ratio NaN movie_facebook_likes 0.0 Length: 16, dtype: float64

2.2.1.2 字符串類型

當(dāng)字符串類型的列包含缺失值時(shí),聚合方法min、max、sum,不會(huì)返回任何值。

movie[['color', 'movie_title', 'color']].max() Series([], dtype: float64)

要讓pandas強(qiáng)行返回每列的值,必須填入缺失值。下面填入的是空字符串

movie[['color', 'movie_title', 'color']].fillna('').max() color Color movie_title ?on Flux color Color dtype: object

2.2.2 統(tǒng)計(jì)信息

2.2.2.1 數(shù)值型

使用percentiles參數(shù)指定分位數(shù)

movie.describe(percentiles=[.01, .3, .99]) num_critic_for_reviewsduration...aspect_ratiomovie_facebook_likescountmean...99%max
4867.0000004901.000000...4590.0000004916.000000
137.988905107.090798...2.2223497348.294142
...............
546.680000189.000000...4.00000093850.000000
813.000000511.000000...16.000000349000.000000

9 rows × 16 columns

2.2.2.2 字符串型

movie.select_dtypes(include='object').describe() colordirector_name...countrycontent_ratingcountuniquetopfreq
48974814...49114616
22397...6518
ColorSteven Spielberg...USAR
469326...37102067

4 rows × 12 columns

2.3 方法的組合

使用isnull方法將每個(gè)值轉(zhuǎn)變?yōu)椴紶栔?/p> movie.isnull().head()

colordirector_name...aspect_ratiomovie_facebook_likes01234
FalseFalse...FalseFalse
FalseFalse...FalseFalse
FalseFalse...FalseFalse
FalseFalse...FalseFalse
TrueFalse...TrueFalse

5 rows × 28 columns

sum統(tǒng)計(jì)布爾值,返回的是Series

movie.isnull().sum().head() color 19 director_name 102 num_critic_for_reviews 49 duration 15 director_facebook_likes 102 dtype: int64

對(duì)這個(gè)Series再使用sum,返回整個(gè)DataFrame的缺失值的個(gè)數(shù),返回值是個(gè)標(biāo)量

movie.isnull().sum().sum() 2654

判斷整個(gè)DataFrame有沒有缺失值,方法是連著使用兩個(gè)any

movie.isnull().any().any() True

2.4 運(yùn)算符

行索引名設(shè)為INSTNM,用UGDS_過(guò)濾出本科生的種族比例

college = pd.read_csv('data/college.csv', index_col='INSTNM') college_ugds_ = college.filter(like='UGDS_') college_ugds_ UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKNINSTNMAlabama A & M UniversityUniversity of Alabama at Birmingham...Bay Area Medical Academy - San Jose Satellite LocationExcel Learning Center-San Antonio South
0.03330.9353...0.00590.0138
0.59220.2600...0.01790.0100
...............
NaNNaN...NaNNaN
NaNNaN...NaNNaN

7535 rows × 9 columns

college_ugds_的數(shù)值類型都是float,可以進(jìn)行整數(shù)運(yùn)算

college_ugds_.dtypes UGDS_WHITE float64 UGDS_BLACK float64... UGDS_NRA float64 UGDS_UNKN float64 Length: 9, dtype: object

2.4.1 加減乘除

college_ugds_.head() + .00501 UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKNINSTNMAlabama A & M UniversityUniversity of Alabama at BirminghamAmridge UniversityUniversity of Alabama in HuntsvilleAlabama State University
0.038310.94031...0.010910.01881
0.597210.26501...0.022910.01501
0.304010.42421...0.005010.27651
0.703810.13051...0.038210.04001
0.020810.92581...0.029310.01871

5 rows × 9 columns

2.4.2 計(jì)算樣例數(shù)據(jù)的百分比

2.4.2.1 方式一

college_ugds_op_round = (college_ugds_ + .00501) // .01 / 100 college_ugds_op_round.head() UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKNINSTNMAlabama A & M UniversityUniversity of Alabama at BirminghamAmridge UniversityUniversity of Alabama in HuntsvilleAlabama State University
0.030.94...0.010.01
0.590.26...0.020.01
0.300.42...0.000.27
0.700.13...0.030.04
0.020.92...0.020.01

5 rows × 9 columns

2.4.2.2 方式二

college_ugds_round = (college_ugds_ + .00001).round(2) college_ugds_round.head() UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKNINSTNMAlabama A & M UniversityUniversity of Alabama at BirminghamAmridge UniversityUniversity of Alabama in HuntsvilleAlabama State University
0.030.94...0.010.01
0.590.26...0.020.01
0.300.42...0.000.27
0.700.13...0.030.04
0.020.92...0.020.01

5 rows × 9 columns

2.4.2.3 方式三

college_ugds_op_round_methods = college_ugds_.add(.00501).floordiv(.01).div(100) college_ugds_op_round_methods.head() UGDS_WHITEUGDS_BLACK...UGDS_NRAUGDS_UNKNINSTNMAlabama A & M UniversityUniversity of Alabama at BirminghamAmridge UniversityUniversity of Alabama in HuntsvilleAlabama State University
0.030.94...0.010.01
0.590.26...0.020.01
0.300.42...0.000.27
0.700.13...0.030.04
0.020.92...0.020.01

5 rows × 9 columns

3 比較缺失值

Pandas使用NumPy NaN(np.nan)對(duì)象表示缺失值。這是一個(gè)不等于自身的特殊對(duì)象:

np.nan == np.nan False

所有和np.nan的比較都返回False,除了不等于:

5 > np.nan False 5 != np.nan True

無(wú)法通過(guò)直接比較比較,含有缺失值的df是否一致

movie_equal = movie == movie movie_equal.all().all() False movie_equal.size - movie_equal.sum().sum() 2654 movie.isnull().sum().sum() 2654

比較兩個(gè)DataFrame最直接的方法是使用equals()方法

from pandas.testing import assert_frame_equal assert_frame_equal(movie, movie)

轉(zhuǎn)載于:https://www.cnblogs.com/shiyushiyu/p/9734621.html

總結(jié)

以上是生活随笔為你收集整理的Pandas CookBook -- 02DataFrame基础操作的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。