當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

数据分析第七讲 pandas练习数据的合并、分组聚合、时间序列、pandas绘图

發布時間：2023/12/15 编程问答 44 豆豆

生活随笔收集整理的這篇文章主要介紹了数据分析第七讲 pandas练习数据的合并、分组聚合、时间序列、pandas绘图小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

文章目錄

數據分析第七講 pandas練習數據的合并和分組聚合
- 一、pandas-DataFrame
- - 練習1
  - - 對于這一組電影數據，如果我們想runtime(電影時長)的分布情況，應該如何呈現數據？
  - 練習2
  - - 全球食品數據分析
- 二、數據的合并和分組聚合
- - 1.字符串離散化
  - 2.數據合并
  - 3.數據的分組聚合
  - 4.索引和復合索引
  - - Series復合索引
    - DateFrame復合索引
  - 5.練習
  - - 1.使用matplotlib呈現出店鋪總數排名前10的國家
    - 2.使用matplotlib呈現出中國每個城市的店鋪數量
- 三、pandas中的時間序列
- - 1.時間范圍
  - 2.DataFrame中使用時間序列
  - 3.pandas重采樣
  - 4.練習
  - - 練習1：統計出911數據中不同月份的電話次數
    - 練習2現在我們有北上廣、深圳、和沈陽5個城市空氣質量數據，請繪制出5個城市的PM2.5隨時間的變化情況
- 四、pandas畫圖
- - 1.折線圖
  - 2.分組柱狀圖
  - 2.餅圖

數據分析第七講 pandas練習數據的合并和分組聚合

一、pandas-DataFrame

練習1

對于這一組電影數據，如果我們想runtime(電影時長)的分布情況，應該如何呈現數據？

數據分析

# 對于這一組電影數據，如果我們想runtime(電影時長)的分布情況，應該如何呈現數據？import pandas as pd from matplotlib import pyplot as plt import matplotlib font = { 'family':'SimHei', 'weight':'bold', 'size':12 } matplotlib.rc("font", **font) file_path = './IMDB-Movie-Data.csv' df = pd.read_csv(file_path) # print(df.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 1000 entries, 0 to 999 Data columns (total 12 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Rank 1000 non-null int64 1 Title 1000 non-null object 2 Genre 1000 non-null object 3 Description 1000 non-null object 4 Director 1000 non-null object 5 Actors 1000 non-null object 6 Year 1000 non-null int64 7 Runtime (Minutes) 1000 non-null int64 8 Rating 1000 non-null float649 Votes 1000 non-null int64 10 Revenue (Millions) 872 non-null float6411 Metascore 936 non-null float64 dtypes: float64(3), int64(4), object(5) memory usage: 93.9+ KB None ''' # print(df.head(3)) # runtime_data = df.loc[:,'Runtime (Minutes)'] # 也可以直接取runtime_data = df['Runtime (Minutes)'] # print(type(runtime_data)) # <class 'pandas.core.series.Series'> runtime_data = df['Runtime (Minutes)'] print(type(runtime_data)) # <class 'pandas.core.series.Series'> print(max(runtime_data)) # 191 print(min(runtime_data)) # 66 # print(runtime_data) ''' 0 121 1 124 2 117 3 108 4 123... 995 111 996 94 997 98 998 93 999 87 Name: Runtime (Minutes), Length: 1000, dtype: int64 ''' runtime_data = df['Runtime (Minutes)'].values # 變成列表的形式 print(type(runtime_data)) # <class 'numpy.ndarray'> print(max(runtime_data)) # 191 print(min(runtime_data)) # 66 # print(runtime_data) ''' [121 124 117 108 123 103 128 89 141 116 133 127 133 107 109 87 139 123118 116 120 137 108 92 120 83 159 99 100 115 111 116 144 108 107 147169 115 132 113 89 111 73 115 99 136 132 91 122 130 136 91 118 101152 161 88 106 117 96 151 86 112 125 130 129 133 120 106 107 133 124108 97 108 169 143 153 151 116 148 118 180 149 137 124 129 162 187 128153 123 146 114 141 116 106 90 105 151 132 115 144 106 116 102 120 110105 108 89 134 118 117 130 105 118 161 104 97 127 139 98 86 164 106165 96 108 156 139 125 86 107 130 140 122 143 138 148 127 94 130 118165 144 104 162 113 121 117 142 88 121 91 94 131 118 112 121 106 90132 118 144 122 129 109 144 148 118 101 84 126 102 130 130 107 134 117118 92 105 112 124 135 113 119 100 125 133 94 128 92 140 124 95 148114 107 113 146 134 126 120 132 99 118 125 111 114 94 144 104 112 126136 104 100 117 96 117 100 158 110 163 119 107 97 102 118 95 139 131114 102 100 85 99 125 134 95 90 126 118 158 109 119 119 112 92 94147 142 112 100 131 105 81 118 119 108 108 117 112 99 102 172 107 85143 169 110 106 89 124 112 157 117 130 115 128 113 119 98 110 105 12795 99 118 112 92 107 143 111 94 109 127 158 132 121 95 97 104 148113 110 104 110 129 180 93 144 138 126 112 81 94 131 104 84 114 109120 106 110 103 95 133 87 133 117 150 123 97 122 88 126 117 107 119131 102 139 110 127 138 102 94 89 124 119 96 119 102 118 123 107 12398 100 132 88 106 120 115 117 136 123 89 113 113 110 111 99 123 13397 126 86 124 81 142 100 121 105 140 126 132 100 115 122 98 101 112112 113 85 110 91 91 92 95 118 100 157 100 137 99 93 115 104 13098 102 108 123 146 101 111 88 108 102 99 166 102 115 104 109 170 102116 132 102 97 123 114 103 88 130 117 100 112 81 125 95 101 102 100124 101 108 119 109 115 108 100 117 119 125 97 109 97 103 129 100 91112 107 157 123 158 153 120 106 137 94 96 113 111 97 115 97 97 94117 115 105 111 98 118 85 114 108 101 106 112 109 96 131 118 109 124141 110 131 95 94 91 94 124 91 132 115 92 150 120 161 111 120 117133 112 106 103 109 100 123 135 117 92 126 89 125 132 130 108 92 108128 105 107 126 103 112 92 108 98 100 106 123 100 104 106 91 108 92122 84 103 91 110 101 127 111 154 96 98 98 109 107 121 101 117 106117 125 146 101 90 103 94 110 133 114 137 110 107 93 152 112 106 10596 116 110 83 97 95 113 88 129 95 110 95 100 98 139 120 124 162135 160 90 92 101 139 95 114 103 88 108 97 109 118 104 88 93 98112 112 120 85 98 129 93 116 83 113 117 122 119 113 85 116 101 133101 104 116 101 103 87 83 140 165 101 100 96 111 95 114 117 102 129106 115 129 143 114 137 131 114 98 126 115 102 101 140 123 112 90 130123 118 96 88 103 122 96 151 104 105 127 93 100 106 128 94 111 106111 165 98 87 106 88 141 96 150 80 95 131 116 108 104 86 138 97123 115 102 89 140 113 109 110 117 100 119 104 100 149 104 126 93 128118 112 99 111 129 110 96 116 86 127 99 106 113 104 107 103 106 11192 87 113 116 114 114 102 118 85 153 114 122 109 119 106 95 98 11096 106 102 90 90 108 113 117 120 82 95 134 108 92 89 92 118 111101 66 102 108 120 98 114 109 88 110 115 105 104 111 116 95 111 12893 100 123 112 87 129 95 94 99 73 107 123 118 119 90 88 91 108191 87 102 94 120 112 83 108 128 113 102 115 108 125 135 97 98 13894 131 108 128 123 104 116 96 92 111 115 123 107 141 113 129 81 85108 120 110 122 99 126 116 85 93 107 114 100 98 98 110 99 120 125119 96 101 143 139 101 88 100 104 134 127 145 92 105 123 117 96 89101 105 98 102 117 110 95 122 95 106 133 87 96 89 122 156 103 91147 113 97 93 102 103 154 117 110 110 86 144 113 107 97 97 120 10892 96 108 109 98 124 86 106 119 99 132 91 100 80 105 115 146 95105 94 100 109 117 110 101 101 115 108 104 180 122 123 83 150 92 105123 111 124 94 113 92 97 120 109 118 133 104 111 102 92 104 99 12892 165 97 97 88 111 94 98 93 87]''' max_runtime = max(runtime_data) min_runtime = min(runtime_data)# 直方圖 # 組數 = (最大值-最小值)/組距 (191-66)//5 = 25 num_bins = (max_runtime - min_runtime) // 5 # print(num_bins) # 25plt.figure(figsize=(15,8),dpi=80) plt.hist(runtime_data,num_bins) plt.xticks(range(min_runtime,max_runtime+5,5)) plt.title('電影時長分布圖') plt.grid() plt.show()

練習2

全球食品數據分析

步驟分析
1.數據清洗
2.獲取國家列表
3.對各個國家進行統計
4.保存統計結果

import pandas as pd# df = pd.read_csv('FoodFacts.csv') # print(df.info()) ''' sys:1: DtypeWarning: Columns (0,3,5,27,36) have mixed types.Specify dtype option on import or set low_memory=False. <class 'pandas.core.frame.DataFrame'> RangeIndex: 65503 entries, 0 to 65502 Columns: 159 entries, code to nutrition_score_uk_100g dtypes: float64(103), object(56) memory usage: 79.5+ MB None'''def get_countries(countries_data):'''獲取國家數據'''low_countries_data = countries_data.str.lower() # 改為小寫# print(low_countries_data.head(30))# 很多行包含兩個國家france,united kingdom# contains 模糊查詢 ~ 非low_countries_data = low_countries_data[~low_countries_data.str.contains(',')]countries = low_countries_data.unique() # 返回列的所有唯一值# print(countries)'''['united kingdom' 'france' 'spain' 'germany' 'united states' 'belgium''australia' 'netherlands' 'cuba' 'canada' 'switzerland' 'austria''sweden' 'united arab emirates' 'saint pierre and miquelon' 'fr:quebec''italy' 'czech republic' 'china' 'cambodia' 'réunion' 'hong kong''brazil' 'japan' 'mexico' 'lebanon' 'philippines' 'guadeloupe' 'ireland''senegal' "c?te d'ivoire" 'togo' 'poland' 'india' 'south korea' 'turkey''french guiana' 'portugal' 'south africa' 'romania' 'denmark' 'greece''luxembourg' 'new caledonia' 'russia' 'french polynesia' 'tunisia''martinique' 'mayotte' 'hungary' 'bulgaria' 'slovenia' 'finland''republique-de-chine' 'taiwan' 'lithuania' 'belarus' 'cyprus' 'irlande''albania' 'malta' 'iceland' 'polska' 'kenya' 'mauritius' 'algeria' 'iran''qatar' 'thailand' 'colombia' 'norway' 'israel' 'venezuela' 'argentina''chile' 'new zealand' 'andorra' 'serbia' 'other-turquie' 'iraq''nederland' 'singapore' 'indonesia' 'burkina faso']'''# print("一共有%s個國家"%len(countries)) # 一共有84個國家return countriesdef get_additives_count(countries, data):count_list = []for country in countries:f_data = data[data['countries_en'].str.contains(country, case=False)] # 模糊查找，不區分大小寫# print(f_data.head(10))# exit()'''countries_en additives_n5 United Kingdom 0.010 United Kingdom 5.011 United Kingdom 5.014 United Kingdom 0.017 United Kingdom 0.018 United Kingdom 0.046 France,United Kingdom 1.060 United Kingdom 0.084 United Kingdom 0.097 United Kingdom 0.0'''count = f_data['additives_n'].mean() # 計算平均值count_list.append(count)# print(count_list)'''[1.259009009009009, 1.93042170523602, 0.930323846908734, 0.7779232111692844, 2.1806083650190113, 1.8857142857142857, 0.5796847635726795, 1.5714285714285714, 1.8333333333333333, 1.847457627118644, 1.7183673469387755, 0.6185567010309279, 0.18181818181818182, 1.7142857142857142, 2.4324324324324325, 1.1724137931034482, 0.9190751445086706, 1.2666666666666666, 1.2, 0.2222222222222222, 2.0625, 1.375, 0.898876404494382, 0.8666666666666667, 1.75, 0.6, 0.8, 1.6748466257668713, 0.7142857142857143, 1.9166666666666667, 0.0, 8.0, 2.310810810810811, 2.0, 2.0, 0.3, 1.7954545454545454, 1.8299492385786802, 1.7272727272727273, 1.7692307692307692, 1.5912408759124088, 1.1111111111111112, 2.1923076923076925, 1.3333333333333333, 0.125, 2.0, 2.246153846153846, 1.5454545454545454, 0.0, 0.5384615384615384, 1.25, 0.25, 0.4166666666666667, 1.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.2857142857142857, 1.3333333333333333, 0.6666666666666666, 0.0, 1.5, 2.1666666666666665, 2.857142857142857, 0.0, 4.4, 0.6896551724137931, 0.5, 1.5714285714285714, 1.0, 0.0, 1.9090909090909092, 3.5, 2.6607142857142856, 0.2, 2.0, 0.0, 1.5, 0.0, 1.0, 2.125, 1.6666666666666667]'''# print(len(count_list)) # 84result_df = pd.DataFrame()result_df['country'] = countriesresult_df['count'] = count_listreturn result_dfdef main():'''主函數'''# 讀取數據df1 = pd.read_csv('FoodFacts.csv', usecols=['countries_en', 'additives_n'])# print(df1.info())'''<class 'pandas.core.frame.DataFrame'>RangeIndex: 65503 entries, 0 to 65502Data columns (total 2 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 countries_en 65292 non-null object 1 additives_n 43664 non-null float64dtypes: float64(1), object(1)memory usage: 1023.6+ KBNone'''# 數據清洗刪除缺失數據data = df1.dropna()# print(data.info())'''<class 'pandas.core.frame.DataFrame'>Int64Index: 43616 entries, 5 to 65501Data columns (total 2 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 countries_en 43616 non-null object 1 additives_n 43616 non-null float64dtypes: float64(1), object(1)memory usage: 1022.2+ KBNone'''# print(data.head())''' countries_en additives_n5 United Kingdom 0.06 France 0.08 France 0.010 United Kingdom 5.011 United Kingdom 5.0'''countries = get_countries(data['countries_en'])# print(countries)# 獲取每個國家添加劑的數量additives = get_additives_count(countries, data)additives.to_csv('result.csv', index=False) # 保存，不加索引if __name__ == "__main__":main()

二、數據的合并和分組聚合

1.字符串離散化

字符串離散化的案例
對于這一組電影數據，如果我們希望統計電影分類(genre)的情況，應該如何處理數據？
思路：重新構造一個全為0的數組，列名為分類，如果某一條數據中分類出現過，就讓0變為1

# 字符串離散化的案例 import pandas as pd import numpy as np from matplotlib import pyplot as plt import matplotlibfont = {'family': 'SimHei','weight': 'bold','size': 12 } matplotlib.rc("font", **font)pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', 5) # 只看5行 pd.set_option('max_colwidth', 200)file_path = './IMDB-Movie-Data.csv'df = pd.read_csv(file_path) # print(df) # print(df.info()) # print(df.head(3)) g_list = df.loc[:, 'Genre'] # 獲取分類列 # print(g_list) ''' 0 Action,Adventure,Sci-Fi 1 Adventure,Mystery,Sci-Fi... 998 Adventure,Comedy 999 Comedy,Family,Fantasy Name: Genre, Length: 1000, dtype: object''' g_list = df["Genre"].str.split(",").tolist() # [[],...,[]] # print(g_list) # temp_list = [] # for i in g_list: # for j in i: # # print(j) # temp_list.append(j) # print(set(temp_list)) genre_list = list(set([j for i in g_list for j in i])) # 電影分類類名 # print(genre_list) '''['Action', 'Crime', 'Biography', 'Adventure', 'Thriller', 'Music', 'Romance', 'Sport', 'Family', 'Horror', 'Western', 'War', 'Animation', 'Mystery', 'Sci-Fi', 'History', 'Drama', 'Musical', 'Fantasy', 'Comedy']'''# 構造全為0的數組 zeros_data = pd.DataFrame(np.zeros((df.shape[0], len(genre_list))), columns=genre_list) # print(zeros_data) '''Action Horror Family Thriller Animation History Romance Fantasy \ 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 .. ... ... ... ... ... ... ... ... 998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Biography Western Drama Comedy Sci-Fi Crime War Sport Mystery \ 0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 .. ... ... ... ... ... ... ... ... ... 998 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 999 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 Musical Music Adventure 0 0.0 0.0 0.0 1 0.0 0.0 0.0 .. ... ... ... 998 0.0 0.0 0.0 999 0.0 0.0 0.0 [1000 rows x 20 columns]'''# 給每個電影出現的位置賦值1 # print(df) # [1000 rows x 12 columns] for i in range(df.shape[0]):# zeros_data[0,[Action,Musical]] = 1zeros_data.loc[i, g_list[i]] = 1 # print(zeros_data.head(3)) '''Music Sport Mystery Adventure Musical Comedy Horror Fantasy Action \ 0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1 0.0 0.0 1.0 1.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 Biography Family Sci-Fi War Thriller Romance Drama Crime Animation \ 0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 Western History 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 ''' genre_count = zeros_data.sum(axis=0) # 對0軸進行求和 # print(genre_count) ''' Thriller 195.0 Biography 81.0... Action 303.0 Sci-Fi 120.0 Length: 20, dtype: float64'''x = genre_count.index y = genre_count.valuesplt.figure(figsize=(15, 8), dpi=80) plt.bar(x, y) plt.xticks(range(len(x)), x, rotation=45) plt.title("電影分類圖") plt.grid() plt.show()

2.數據合并

join:默認情況下它是把行索引相同的數據合并到一起
merge:按照指定的列把數據按照一定的方式合并到一起
示例1

示例2

# 數據合并 import numpy as np import pandas as pdt1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('ABC'), columns=list('DEFG')) t2 = pd.DataFrame(np.arange(10).reshape(2, 5), index=list('AB'), columns=list('DWXYZ')) print(t1) '''D E F G A 0 1 2 3 B 4 5 6 7 C 8 9 10 11''' print(t2) '''D W X Y Z A 0 1 2 3 4 B 5 6 7 8 9''' print(t1 + t2) '''D E F G W X Y Z A 0.0 NaN NaN NaN NaN NaN NaN NaN B 9.0 NaN NaN NaN NaN NaN NaN NaN C NaN NaN NaN NaN NaN NaN NaN NaN''' print(t1.merge(t2)) # merge:至少要有一個相同的列索引，相同的列下有相同的元素，就把該元素所在的行（去除相同的元素）合并到一行 '''D E F G W X Y Z 0 0 1 2 3 1 2 3 4''' t1.iloc[2, 0] = 5 print(t1) '''D E F G A 0 1 2 3 B 4 5 6 7 C 5 9 10 11''' print(t2) '''D W X Y Z A 0 1 2 3 4 B 5 6 7 8 9''' print(t1.merge(t2)) # 以t1為主，merge:至少要有一個相同的列索引，相同的列下有相同的元素，就把該元素所在的行（去除相同的元素）合并到一行，合并后的數據行索引從0開始 '''D E F G W X Y Z 0 0 1 2 3 1 2 3 4 1 5 9 10 11 6 7 8 9''' print(t2.merge(t1)) # 以t2為主，merge:至少要有一個相同的列索引，相同的列下有相同的元素，就把該元素所在的行（去除相同的元素）合并到一行，合并后的數據行索引從0開始 '''D W X Y Z E F G 0 0 1 2 3 4 1 2 3 1 5 6 7 8 9 9 10 11''' t1.iloc[[0, 1, 2], 0] = 1 print(t1) '''D E F G A 1 1 2 3 B 1 5 6 7 C 1 9 10 11''' print(t2) '''D W X Y Z A 0 1 2 3 4 B 5 6 7 8 9''' print(t2.merge(t1)) # 以t2為主，merge:至少要有一個相同的列索引，相同的列下沒有相同的元素，Empty DataFrame ''' Empty DataFrame Columns: [D, W, X, Y, Z, E, F, G] Index: []''' t1.iloc[[0, 1, 2], 0] = 0 print(t1) '''D E F G A 0 1 2 3 B 0 5 6 7 C 0 9 10 11''' print(t2) '''D W X Y Z A 0 1 2 3 4 B 5 6 7 8 9''' print(t2.merge(t1)) # 以t2為主，merge:至少要有一個相同的列索引，相同的列下有相同的元素，就把該元素所在的行（去除相同的元素）合并到一行，合并后的數據行索引從0開始 '''D W X Y Z E F G 0 0 1 2 3 4 1 2 3 1 0 1 2 3 4 5 6 7 2 0 1 2 3 4 9 10 11''' print(t2.merge(t1,how='left')) # 以t2為主，merge:至少要有一個相同的列索引，相同的列下有相同的元素，就把該元素所在的行（去除相同的元素）合并到一行，合并后的數據行索引從0開始 how='left'保留t2的原數據，后面用NaN填充 '''D W X Y Z E F G 0 0 1 2 3 4 1.0 2.0 3.0 1 0 1 2 3 4 5.0 6.0 7.0 2 0 1 2 3 4 9.0 10.0 11.0 3 5 6 7 8 9 NaN NaN NaN''' print(t2.merge(t1, on='D', how='outer')) '''D W X Y Z E F G 0 0 1 2 3 4 1.0 2.0 3.0 1 0 1 2 3 4 5.0 6.0 7.0 2 0 1 2 3 4 9.0 10.0 11.0 3 5 6 7 8 9 NaN NaN NaN'''t3 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('abc'), columns=list('MNOP')) t4 = pd.DataFrame(np.arange(10).reshape(2, 5), index=list('de'), columns=list('MWXYZ')) print(t3) '''M N O P a 0 1 2 3 b 4 5 6 7 c 8 9 10 11''' print(t4) '''M W X Y Z d 0 1 2 3 4 e 5 6 7 8 9 ''' print(t3.merge(t4,left_on='N',right_on='Z')) '''M_x N O P M_y W X Y Z 0 8 9 10 11 5 6 7 8 9'''

3.數據的分組聚合

df.groupby(by=“columns_name”)

dict_obj = {
‘key1’ : [‘a’, ‘b’, ‘a’, ‘b’,‘a’, ‘b’, ‘a’, ‘a’],
‘key2’ : [‘one’, ‘one’, ‘two’, ‘three’,‘two’, ‘two’, ‘one’, ‘three’],
‘data1’: np.arange(8),
‘data2’: np.arange(8)
}

# 分組聚合 import pandas as pd import numpy as npdict_obj = {'key1': ['a', 'b', 'a', 'b', 'a', 'b', 'a', 'a'],'key2': ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three'],'data1': np.arange(8),'data2': np.arange(8) } df = pd.DataFrame(dict_obj) print(df) '''key1 key2 data1 data2 0 a one 0 0 1 b one 1 1 2 a two 2 2 3 b three 3 3 4 a two 4 4 5 b two 5 5 6 a one 6 6 7 a three 7 7 ''' # 分組 groupby df.groupby(by="key1") print(df.groupby(by="key1")) # <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000000002CB9208> for i in df.groupby(by="key1"):print(i) ''' ('a', key1 key2 data1 data2 0 a one 0 0 2 a two 2 2 4 a two 4 4 6 a one 6 6 7 a three 7 7) ('b', key1 key2 data1 data2 1 b one 1 1 3 b three 3 3 5 b two 5 5) ''' print(df.groupby(by="key1").mean()) # 求平均值,key2不是數字類型直接忽略掉 '''data1 data2 key1 a 3.8 3.8 b 3.0 3.0''' print(df.groupby(by="key1").sum()) # 求和,key2不是數字類型直接忽略掉 '''data1 data2 key1 a 19 19 b 9 9''' print(df.groupby(by="key1").count()) # 記錄個數 '''key2 data1 data2 key1 a 5 5 5 b 3 3 3'''

練習
現在我們有一組關于全球星巴克店鋪的統計數據，如果我想知道美國的星巴克數量和中國的哪個多，或者我想知道中國每個省份星巴克的數量的情況，那么應該怎么辦？

# 現在我們有一組關于全球星巴克店鋪的統計數據，如果我想知道美國的星巴克數量和 # 中國的哪個多，或者我想知道中國每個省份星巴克的數量的情況，那么應該怎么辦？ # 思路：根據國家分組，統計每個國家的總數進行對比，取出中國的數據，按省份分組統計每個省份的總數進行對比 import numpy as np import pandas as pdfile_path = './starbucks_store_worldwide.csv' df = pd.read_csv(file_path) print(df.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 25600 entries, 0 to 25599 Data columns (total 13 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Brand 25600 non-null object 1 Store Number 25600 non-null object 2 Store Name 25600 non-null object 3 Ownership Type 25600 non-null object 4 Street Address 25598 non-null object 5 City 25585 non-null object 6 State/Province 25600 non-null object 7 Country 25600 non-null object 8 Postcode 24078 non-null object 9 Phone Number 18739 non-null object 10 Timezone 25600 non-null object 11 Longitude 25599 non-null float6412 Latitude 25599 non-null float64 dtypes: float64(2), object(11) memory usage: 2.5+ MB None''' print(df.head()) '''Brand Store Number ... Longitude Latitude 0 Starbucks 47370-257954 ... 1.53 42.51 1 Starbucks 22331-212325 ... 55.47 25.42 2 Starbucks 47089-256771 ... 55.47 25.39 3 Starbucks 22126-218024 ... 54.38 24.48 4 Starbucks 17127-178586 ... 54.54 24.51 ''' pd.set_option('display.max_rows', 10) pd.set_option('display.max_columns', 13) print(df) '''City State/Province Country Postcode Phone Number \ 0 Andorra la Vella 7 AD AD500 376818720 1 Ajman AJ AE NaN NaN 2 Ajman AJ AE NaN NaN 3 Abu Dhabi AZ AE NaN NaN 4 Abu Dhabi AZ AE NaN NaN ... ... ... ... ... ... 25595 Thành Ph? H? Chí Minh SG VN 70000 08 3824 4668 25596 Thành Ph? H? Chí Minh SG VN 70000 08 5413 8292 25597 Johannesburg GT ZA 2194 27873500159 25598 Menlyn GT ZA 181 NaN 25599 Midrand GT ZA 1682 27873500215 Timezone Longitude Latitude 0 GMT+1:00 Europe/Andorra 1.53 42.51 1 GMT+04:00 Asia/Dubai 55.47 25.42 2 GMT+04:00 Asia/Dubai 55.47 25.39 3 GMT+04:00 Asia/Dubai 54.38 24.48 4 GMT+04:00 Asia/Dubai 54.54 24.51 ... ... ... ... 25595 GMT+000000 Asia/Saigon 106.70 10.78 25596 GMT+000000 Asia/Saigon 106.71 10.72 25597 GMT+000000 Africa/Johannesburg 28.04 -26.15 25598 GMT+000000 Africa/Johannesburg 28.28 -25.79 25599 GMT+000000 Africa/Johannesburg 28.11 -26.02 [25600 rows x 13 columns] ''' grouped = df.groupby(by='Country').count()['Brand'] print(type(grouped)) # <class 'pandas.core.series.Series'> print(grouped) ''' AD 1 AE 144 AR 108 AT 18 AU 22... TT 3 TW 394 US 13608 VN 25 ZA 3 Name: Brand, Length: 73, dtype: int64''' us_count = grouped['US'] # us_count = grouped.loc['US']也可以 print(us_count) # 13608 cn_count = grouped['CN'] # cn_count = grouped.loc['CN']也可以 print(cn_count) # 2734 print("美國的星巴克總數位%s,中國的星巴克為%s" % (us_count, cn_count)) '''美國的星巴克總數位13608,中國的星巴克為2734'''# 布爾索引 china_data = df[df['Country'] == 'CN'] print(china_data) '''Brand Store Number Store Name Ownership Type \ 2091 Starbucks 22901-225145 北京西站第一咖啡店 Company Owned 2092 Starbucks 32320-116537 北京華宇時尚店 Company Owned 2093 Starbucks 32447-132306 北京藍色港灣圣拉娜店 Company Owned 2094 Starbucks 17477-161286 北京太陽宮凱德嘉茂店 Company Owned 2095 Starbucks 24520-237564 北京東三環北店 Company Owned ... ... ... ... ... 4820 Starbucks 17872-186929 Sands Licensed 4821 Starbucks 24126-235784 Wynn II Licensed 4822 Starbucks 28490-242269 Wynn Palace BOH Licensed 4823 Starbucks 22210-218665 Sands Cotai Central Licensed 4824 Starbucks 17108-179449 One Central Licensed Street Address City State/Province \ 2091 豐臺區, 北京西站通廊7-1號, 中關村南大街2號北京市 11 2092 海淀區, 數碼大廈B座華宇時尚購物中心內, 藍色港灣國際商區1座C1-3單元首層、北京市 11 2093 朝陽區朝陽公園路6號, 二層C1-3單元及二層陽臺, 太陽宮中路12號北京市 11 2094 朝陽區, 太陽宮凱德嘉茂一層01-44/45號, 東三環北路27號北京市 11 2095 朝陽區, 嘉銘中心大廈A座B1層024商鋪, 金融大街7號北京市 11 ... ... ... ... 4820 Portion of Shop 04, Ground Floor, Sands, Largo... Macau 92 4821 Wynn Macau, Rua Cidada de Sintra, NAPE Macau 92 4822 Employee Entrance Outlet, Wynn Cotai, Resort Macau 92 4823 Shop K201 & K202, Level 02, Parcela 5&6, Estra... Macau 92 4824 Promenade Rd, Open Area, 2/F Macau 92 Country Postcode Phone Number Timezone Longitude \ 2091 CN 100073 NaN GMT+08:00 Asia/Beijing 116.32 2092 CN 100086 010-51626616 GMT+08:00 Asia/Beijing 116.32 2093 CN 100020 010-59056343 GMT+08:00 Asia/Beijing 116.47 2094 CN 100028 010-84150945 GMT+08:00 Asia/Beijing 116.45 2095 CN NaN NaN GMT+08:00 Asia/Beijing 116.46 ... ... ... ... ... ... 4820 CN NaN (853)28782773 GMT+08:00 Asia/Beijing 113.55 4821 CN NaN 85328723516 GMT+08:00 Asia/Beijing 113.55 4822 CN NaN NaN GMT+08:00 Asia/Beijing 113.54 4823 CN NaN 85328853439 GMT+08:00 Asia/Beijing 113.56 4824 CN NaN NaN GMT+08:00 Asia/Beijing 113.55 Latitude 2091 39.90 2092 39.97 2093 39.95 2094 39.97 2095 39.93 ... ... 4820 22.19 4821 22.19 4822 22.20 4823 22.15 4824 22.19 [2734 rows x 13 columns] ''' china_province_data = china_data.groupby(by='State/Province').count() print(china_province_data['Brand']) ''' State/Province 11 236 12 58 13 24 14 8 15 8... 62 3 63 3 64 2 91 162 92 13 Name: Brand, Length: 31, dtype: int64'''

DataFrameGroupBy對象方法
方法 ??????????說明
count?????????分組中非NA值的數量
sum???????????非NA值的和
mean?????????非NA值的平均值
min,max????非NA值的最小值和最大值

4.索引和復合索引

簡單的索引操作: t.index
指定index: t.index = [‘a’,‘b’,‘c’]
重新設置index : t.reindex([“a”,“e”])
指定某一列作為index : t.set_index(“M”,drop=False)
返回index的唯一值: t.set_index(“M”).index.unique()
設置兩個索引的時候會是什么樣子呢?
示例

# 索引和復合索引 import numpy as np import pandas as pdt1 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('ABC'), columns=list('WXYZ')) print(t1) '''W X Y Z A 0 1 2 3 B 4 5 6 7 C 8 9 10 11''' print(t1.index) # Index(['A', 'B', 'C'], dtype='object') # 重置索引 print(t1.reindex(['a','e'])) '''W X Y Z a NaN NaN NaN NaN e NaN NaN NaN NaN''' print(t1.reindex(['A','e'])) '''W X Y Z A 0.0 1.0 2.0 3.0 e NaN NaN NaN NaN''' # 指定某一列作為索引 print(t1.set_index('W')) # 默認刪除W列，以W作為行索引 '''X Y Z W 0 1 2 3 4 5 6 7 8 9 10 11''' print(t1.set_index('W',drop=False)) # 不刪除W列，以W作為行索引 '''W X Y Z W 0 0 1 2 3 4 4 5 6 7 8 8 9 10 11''' # 返回index的唯一值: t.set_index("M").index.unique() print(t1.set_index('W').index) '''Int64Index([0, 4, 8], dtype='int64', name='W')''' print(t1.set_index('W').index.unique()) '''Int64Index([0, 4, 8], dtype='int64', name='W')''' t1.loc['B','W'] = 8 print(t1) '''W X Y Z A 0 1 2 3 B 8 5 6 7 C 8 9 10 11''' print(t1.set_index('W').index.unique()) '''Int64Index([0, 8], dtype='int64', name='W')''' # 設置兩個索引復合索引 t2 = pd.DataFrame(np.arange(12).reshape(3, 4), index=list('ABC'), columns=list('WXYZ')) print(t2) '''W X Y Z A 0 1 2 3 B 4 5 6 7 C 8 9 10 11''' print(t2.set_index(['W','X'])) '''Y Z W X 0 1 2 3 4 5 6 7 8 9 10 11''' print(type(t2.set_index(['W','X']))) # <class 'pandas.core.frame.DataFrame'>

Series復合索引

a = pd.DataFrame({‘a’: range(7),‘b’: range(7, 0, -1),‘c’: [‘one’,‘one’,‘one’,‘two’,‘two’,‘two’, ‘two’],‘d’:
list(“hjklmno”)})

設置c,d為索引

# 索引和復合索引 import numpy as np import pandas as pda = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd':list("hjklmno")}) print(type(a)) # <class 'pandas.core.frame.DataFrame'> print(a) '''a b c d 0 0 7 one h 1 1 6 one j 2 2 5 one k 3 3 4 two l 4 4 3 two m 5 5 2 two n 6 6 1 two o''' b = a.set_index(['c','d']) print(b) '''a b c d one h 0 7j 1 6k 2 5 two l 3 4m 4 3n 5 2o 6 1 ''' print(b.loc['one']) '''a b d h 0 7 j 1 6 k 2 5''' print(b.loc['one'].loc['j']) ''' a 1 b 6 Name: j, dtype: int64''' print(b.loc['one'].loc['j']['a']) # 1 c = b['a'] print(type(c)) # <class 'pandas.core.series.Series'> print(c) ''' c d one h 0j 1k 2 two l 3m 4n 5o 6 Name: a, dtype: int64''' print(c['one']['j']) # 1 print(c['one','j']) # 1

設置d,c為索引

# 索引和復合索引 import numpy as np import pandas as pda = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd':list("hjklmno")}) print(type(a)) # <class 'pandas.core.frame.DataFrame'> print(a) '''a b c d 0 0 7 one h 1 1 6 one j 2 2 5 one k 3 3 4 two l 4 4 3 two m 5 5 2 two n 6 6 1 two o''' b = a.set_index(['d','c']) print(b) '''a b d c h one 0 7 j one 1 6 k one 2 5 l two 3 4 m two 4 3 n two 5 2 o two 6 1 ''' print(b.loc['j']) '''a b c one 1 6''' print(b.loc['j'].loc['one']) ''' a 1 b 6 Name: one, dtype: int64''' print(b.loc['j'].loc['one']['a']) # 1 print(b.loc['j'].loc['one','a']) # 1 # 復合索引交換 print(b.swaplevel()) '''a b c d one h 0 7j 1 6k 2 5 two l 3 4m 4 3n 5 2o 6 1'''

DateFrame復合索引

# 索引和復合索引 import numpy as np import pandas as pda = pd.DataFrame({'a': range(7), 'b': range(7, 0, -1), 'c': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 'd':list("hjklmno")}) print(type(a)) # <class 'pandas.core.frame.DataFrame'> print(a) '''a b c d 0 0 7 one h 1 1 6 one j 2 2 5 one k 3 3 4 two l 4 4 3 two m 5 5 2 two n 6 6 1 two o''' b = a.set_index(['c','d']) print(b) '''a b c d one h 0 7j 1 6k 2 5 two l 3 4m 4 3n 5 2o 6 1 ''' # 從外層開始取值 print(b.loc['one'].loc['j','b']) # 6 # 從內層開始取值 print(b.swaplevel().loc['j'].loc['one','b']) # 6

5.練習

1.使用matplotlib呈現出店鋪總數排名前10的國家

# # 1.使用matplotlib呈現出店鋪總數排名前10的國家 # sort_values groupby import numpy as np import pandas as pd from matplotlib import pyplot as pltfile_path = './starbucks_store_worldwide.csv' df = pd.read_csv(file_path) print(df.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 25600 entries, 0 to 25599 Data columns (total 13 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Brand 25600 non-null object 1 Store Number 25600 non-null object 2 Store Name 25600 non-null object 3 Ownership Type 25600 non-null object 4 Street Address 25598 non-null object 5 City 25585 non-null object 6 State/Province 25600 non-null object 7 Country 25600 non-null object 8 Postcode 24078 non-null object 9 Phone Number 18739 non-null object 10 Timezone 25600 non-null object 11 Longitude 25599 non-null float6412 Latitude 25599 non-null float64 dtypes: float64(2), object(11) memory usage: 2.5+ MB None''' data = df.groupby(by='Country').count() print(data.head(3)) '''Brand Store Number Store Name ... Timezone Longitude Latitude Country ... AD 1 1 1 ... 1 1 1 AE 144 144 144 ... 144 144 144 AR 108 108 108 ... 108 108 108[3 rows x 12 columns]''' data_sort = data.sort_values(by='Brand', ascending=False)[0:10] # 倒序 print(data_sort['Brand']) # data = df.groupby(by='Country').count()['Brand'].sort_values(ascending=False)[0:10] # print(data) # 這樣寫也可以 ''' Country US 13608 CN 2734 CA 1468 JP 1237 KR 993 GB 901 MX 579 TW 394 TR 326 PH 298 Name: Brand, dtype: int64''' x = data_sort['Brand'].index y = data_sort['Brand'].values# 設置圖片大小 plt.figure(figsize=(15, 8), dpi=80) # 直方圖 plt.bar(x, y) plt.show()

2.使用matplotlib呈現出中國每個城市的店鋪數量

# 2.使用matplotlib呈現出中國每個城市的店鋪數量import pandas as pd from matplotlib import pyplot as plt import matplotlibfont = {'family': 'SimHei','weight': 'bold','size': 12 } matplotlib.rc("font", **font)file_path = './starbucks_store_worldwide.csv' df = pd.read_csv(file_path) # print(df.info()) '''# Column Non-Null Count Dtype --- ------ -------------- ----- 0 Brand 25600 non-null object 1 Store Number 25600 non-null object 2 Store Name 25600 non-null object 3 Ownership Type 25600 non-null object 4 Street Address 25598 non-null object 5 City 25585 non-null object 6 State/Province 25600 non-null object 7 Country 25600 non-null object 8 Postcode 24078 non-null object 9 Phone Number 18739 non-null object 10 Timezone 25600 non-null object 11 Longitude 25599 non-null float6412 Latitude 25599 non-null float64 dtypes: float64(2), object(11) memory usage: 2.5+ MB None''' df = df[df['Country'] == 'CN'] # print(df.head(3)) '''Brand Store Number ... Longitude Latitude 2091 Starbucks 22901-225145 ... 116.32 39.90 2092 Starbucks 32320-116537 ... 116.32 39.97 2093 Starbucks 32447-132306 ... 116.47 39.95[3 rows x 13 columns]''' data = df.groupby(by='City').count()['Brand'].sort_values(ascending=False)[0:30] x = data.index y = data.values# 設置圖片大小 plt.figure(figsize=(15, 8), dpi=80) # 直方圖 plt.bar(x, y) # 設置x軸刻度 plt.xticks(rotation=45) plt.show()

三、pandas中的時間序列

1.時間范圍

時間范圍
pd.date_range(start=None, end=None, periods=None, freq=‘D’)
periods?? ? ? 時間范圍的個數
freq?? ? ? 頻率,以天為單位還是以月為單位
關于頻率的更多縮寫

示例

# 時間序列 import pandas as pdd = pd.date_range(start='20210101', end='20210201') print(d) # freq='D' 頻率默認是天 '''DatetimeIndex(['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04','2021-01-05', '2021-01-06', '2021-01-07', '2021-01-08','2021-01-09', '2021-01-10', '2021-01-11', '2021-01-12','2021-01-13', '2021-01-14', '2021-01-15', '2021-01-16','2021-01-17', '2021-01-18', '2021-01-19', '2021-01-20','2021-01-21', '2021-01-22', '2021-01-23', '2021-01-24','2021-01-25', '2021-01-26', '2021-01-27', '2021-01-28','2021-01-29', '2021-01-30', '2021-01-31', '2021-02-01'],dtype='datetime64[ns]', freq='D')''' m = pd.date_range(start='20210101', end='20211231',freq='M') # 頻率是1月 print(m) ''' DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30','2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31','2021-09-30', '2021-10-31', '2021-11-30', '2021-12-31'],dtype='datetime64[ns]', freq='M')''' d10 = pd.date_range(start='20210101', end='20210201',freq='10D') print(d10) # freq='10D' 頻率是10天 ''' DatetimeIndex(['2021-01-01', '2021-01-11', '2021-01-21', '2021-01-31'], dtype='datetime64[ns]', freq='10D')''' d3 = pd.date_range(start='20210101', end='20210201',freq='3D') print(d3) # freq='10D' 頻率是3天 ''' DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-07', '2021-01-10','2021-01-13', '2021-01-16', '2021-01-19', '2021-01-22','2021-01-25', '2021-01-28', '2021-01-31'],dtype='datetime64[ns]', freq='3D')''' d3 = pd.date_range(start='20210101', periods=5, freq='3D') print(d3) # freq='10D' 頻率是3天 periods和end不要同時用 ''' DatetimeIndex(['2021-01-01', '2021-01-04', '2021-01-07', '2021-01-10','2021-01-13'],dtype='datetime64[ns]', freq='3D')'''

2.DataFrame中使用時間序列

index=pd.date_range(“20190101”,periods=10)
df = pd.DataFrame(np.arange(10),index=index) 作為行索引
df = pd.DataFrame(np.arange(10).reshape(1,10),columns=index) 作為列索引

示例

# 使用時間序列 import pandas as pd import numpy as np# 時間序列當行索引 dindex = pd.date_range(start='20210101', periods=10) df = pd.DataFrame(np.arange(10),index=dindex) print(df) '''0 2021-01-01 0 2021-01-02 1 2021-01-03 2 2021-01-04 3 2021-01-05 4 2021-01-06 5 2021-01-07 6 2021-01-08 7 2021-01-09 8 2021-01-10 9 ''' # 時間序列當列索引 d_index = pd.date_range(start='20210101', periods=10) df1 = pd.DataFrame(np.arange(10).reshape(1,10),columns=d_index) print(df1) '''2021-01-01 2021-01-02 2021-01-03 ... 2021-01-08 2021-01-09 2021-01-10 0 0 1 2 ... 7 8 9[1 rows x 10 columns]'''

3.pandas重采樣

重采樣：指的是將時間序列從一個頻率轉化為另一個頻率進行處理的過程，將高頻率數據轉化為低頻率數據為降采樣，低頻率轉化為高頻率為升采樣
pandas提供了一個resample的方法來幫助我們實現頻率轉化
示例

# 使用時間序列 import pandas as pd import numpy as npdindex = pd.date_range(start='2021-01-01',end='2021-02-24') t = pd.DataFrame(np.arange(55).reshape(55,1),index=dindex) print(t) '''0 2021-01-01 0 2021-01-02 1 2021-01-03 2 2021-01-04 3 2021-01-05 4 2021-01-06 5 2021-01-07 6 2021-01-08 7 2021-01-09 8 2021-01-10 9 2021-01-11 10 2021-01-12 11 2021-01-13 12 2021-01-14 13 2021-01-15 14 2021-01-16 15 2021-01-17 16 2021-01-18 17 2021-01-19 18 2021-01-20 19 2021-01-21 20 2021-01-22 21 2021-01-23 22 2021-01-24 23 2021-01-25 24 2021-01-26 25 2021-01-27 26 2021-01-28 27 2021-01-29 28 2021-01-30 29 2021-01-31 30 2021-02-01 31 2021-02-02 32 2021-02-03 33 2021-02-04 34 2021-02-05 35 2021-02-06 36 2021-02-07 37 2021-02-08 38 2021-02-09 39 2021-02-10 40 2021-02-11 41 2021-02-12 42 2021-02-13 43 2021-02-14 44 2021-02-15 45 2021-02-16 46 2021-02-17 47 2021-02-18 48 2021-02-19 49 2021-02-20 50 2021-02-21 51 2021-02-22 52 2021-02-23 53 2021-02-24 54 ''' print(t.resample('M').mean()) '''0 2021-01-31 15.0 2021-02-28 42.5''' print(t.resample('10D').mean()) '''0 2021-01-01 4.5 2021-01-11 14.5 2021-01-21 24.5 2021-01-31 34.5 2021-02-10 44.5 2021-02-20 52.0'''

4.練習

練習1：統計出911數據中不同月份的電話次數

# 練習1：統計出911數據中不同月份的電話次數 import pandas as pd from matplotlib import pyplot as pltfile_path = './911.csv' df = pd.read_csv(file_path) pd.set_option('display.max_rows', 30) pd.set_option('display.max_columns', 9) # print(df.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 249737 entries, 0 to 249736 Data columns (total 9 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 lat 249737 non-null float641 lng 249737 non-null float642 desc 249737 non-null object 3 zip 219391 non-null float644 title 249737 non-null object 5 timeStamp 249737 non-null object 6 twp 249644 non-null object 7 addr 249737 non-null object 8 e 249737 non-null int64 dtypes: float64(3), int64(1), object(5) memory usage: 17.1+ MB None ''' # print(df['timeStamp']) '''0 2015-12-10 17:10:52... 249736 2017-09-20 19:42:29 Name: timeStamp, Length: 249737, dtype: object''' df['timeStamp'] = pd.to_datetime(df['timeStamp']) # 把時間字符串轉為索引 # print(df['timeStamp']) ''' 0 2015-12-10 17:10:52... 249736 2017-09-20 19:42:29 Name: timeStamp, Length: 249737, dtype: datetime64[ns]''' df.set_index('timeStamp', inplace=True) # print(df) '''lat lng \ timeStamp 2015-12-10 17:10:52 40.297876 -75.581294 ... ... ... 2017-09-20 19:42:29 40.095206 -75.410735 desc \ timeStamp 2015-12-10 17:10:52 REINDEER CT & DEAD END; NEW HANOVER; Station ... ... ... 2017-09-20 19:42:29 1ST AVE & MOORE RD; UPPER MERION; 2017-09-20 @... zip title twp \ timeStamp 2015-12-10 17:10:52 19525.0 EMS: BACK PAINS/INJURY NEW HANOVER ... ... ... ... 2017-09-20 19:42:29 19406.0 Traffic: VEHICLE ACCIDENT - UPPER MERION addr e timeStamp 2015-12-10 17:10:52 REINDEER CT & DEAD END 1 ... ... .. 2017-09-20 19:42:29 1ST AVE & MOORE RD 1 [249737 rows x 8 columns] ''' count_by_month = df.resample('M').count()['lat'] print(count_by_month) ''' timeStamp 2015-12-31 7916 2016-01-31 13096 2016-02-29 11396 2016-03-31 11059 2016-04-30 11287 2016-05-31 11374 2016-06-30 11732 2016-07-31 12088 2016-08-31 11904 2016-09-30 11669 2016-10-31 12502 2016-11-30 12091 2016-12-31 12162 2017-01-31 11605 2017-02-28 10267 2017-03-31 11684 2017-04-30 11056 2017-05-31 11719 2017-06-30 12333 2017-07-31 11768 2017-08-31 11753 2017-09-30 7276 Freq: M, Name: lat, dtype: int64 ''' # 繪制折線圖分析變化趨勢 x = count_by_month.index y = count_by_month.values x = [i.strftime("%Y%m%d") for i in x] # 轉化為時間日期格式。 plt.figure(figsize=(15, 8), dpi=80) plt.plot(range(len(x)), y) plt.xticks(range(len(x)), x, rotation=45) plt.show()

https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities

練習2現在我們有北上廣、深圳、和沈陽5個城市空氣質量數據，請繪制出5個城市的PM2.5隨時間的變化情況

數據來源：https://www.kaggle.com/uciml/pm25-data-for-five-chinese-cities
觀察這組數據中的時間結構，并不是字符串，這個時候我們應該怎么辦？

之前所學習的DatetimeIndex可以理解為時間戳
那么現在我們要學習的PeriodIndex可以理解為時間段
periods = pd.PeriodIndex(year=df[“year”],month=df[“month”],day=df[“day”],hour=df[“hour”],freq=“H”)
那么如果給這個時間段降采樣呢？
data = df.set_index(periods).resample(“10D”).mean()

# 現在我們有北上廣、深圳、和沈陽5個城市空氣質量數據，請繪制出5個城市的PM2.5隨時間的變化情況 import pandas as pd from matplotlib import pyplot as pltdf = pd.read_csv('./PM2.5/BeijingPM20100101_20151231.csv') # print(df.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 52584 entries, 0 to 52583 Data columns (total 18 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 No 52584 non-null int64 1 year 52584 non-null int64 2 month 52584 non-null int64 3 day 52584 non-null int64 4 hour 52584 non-null int64 5 season 52584 non-null int64 6 PM_Dongsi 25052 non-null float647 PM_Dongsihuan 20508 non-null float648 PM_Nongzhanguan 24931 non-null float649 PM_US Post 50387 non-null float6410 DEWP 52579 non-null float6411 HUMI 52245 non-null float6412 PRES 52245 non-null float6413 TEMP 52579 non-null float6414 cbwd 52579 non-null object 15 Iws 52579 non-null float6416 precipitation 52100 non-null float6417 Iprec 52100 non-null float64 dtypes: float64(11), int64(6), object(1) memory usage: 7.2+ MB None''' pd.set_option('display.max_rows', 30) pd.set_option('display.max_columns', 18) # print(df.head()) '''No year month day hour season PM_Dongsi PM_Dongsihuan \ 0 1 2010 1 1 0 4 NaN NaN 1 2 2010 1 1 1 4 NaN NaN 2 3 2010 1 1 2 4 NaN NaN 3 4 2010 1 1 3 4 NaN NaN 4 5 2010 1 1 4 4 NaN NaN PM_Nongzhanguan PM_US Post DEWP HUMI PRES TEMP cbwd Iws \ 0 NaN NaN -21.0 43.0 1021.0 -11.0 NW 1.79 1 NaN NaN -21.0 47.0 1020.0 -12.0 NW 4.92 2 NaN NaN -21.0 43.0 1019.0 -11.0 NW 6.71 3 NaN NaN -21.0 55.0 1019.0 -14.0 NW 9.84 4 NaN NaN -20.0 51.0 1018.0 -12.0 NW 12.97 precipitation Iprec 0 0.0 0.0 1 0.0 0.0 2 0.0 0.0 3 0.0 0.0 4 0.0 0.0 ''' # pd.PeriodIndex 把分開的時間字符串通過PeriodIndex的方法轉換為pandas的時間類型 periods = pd.PeriodIndex(year=df["year"], month=df["month"], day=df["day"], hour=df["hour"], freq="H") # print(periods) ''' PeriodIndex(['2010-01-01 00:00', '2010-01-01 01:00', '2010-01-01 02:00','2010-01-01 03:00', '2010-01-01 04:00', '2010-01-01 05:00','2010-01-01 06:00', '2010-01-01 07:00', '2010-01-01 08:00','2010-01-01 09:00',...'2015-12-31 14:00', '2015-12-31 15:00', '2015-12-31 16:00','2015-12-31 17:00', '2015-12-31 18:00', '2015-12-31 19:00','2015-12-31 20:00', '2015-12-31 21:00', '2015-12-31 22:00','2015-12-31 23:00'],dtype='period[H]', length=52584, freq='H') ''' df['datetime'] = periods df.set_index('datetime',inplace=True) # print(df.head()) '''No year month day hour season PM_Dongsi \ datetime 2010-01-01 00:00 1 2010 1 1 0 4 NaN 2010-01-01 01:00 2 2010 1 1 1 4 NaN 2010-01-01 02:00 3 2010 1 1 2 4 NaN 2010-01-01 03:00 4 2010 1 1 3 4 NaN 2010-01-01 04:00 5 2010 1 1 4 4 NaN PM_Dongsihuan PM_Nongzhanguan PM_US Post DEWP HUMI \ datetime 2010-01-01 00:00 NaN NaN NaN -21.0 43.0 2010-01-01 01:00 NaN NaN NaN -21.0 47.0 2010-01-01 02:00 NaN NaN NaN -21.0 43.0 2010-01-01 03:00 NaN NaN NaN -21.0 55.0 2010-01-01 04:00 NaN NaN NaN -20.0 51.0 PRES TEMP cbwd Iws precipitation Iprec datetime 2010-01-01 00:00 1021.0 -11.0 NW 1.79 0.0 0.0 2010-01-01 01:00 1020.0 -12.0 NW 4.92 0.0 0.0 2010-01-01 02:00 1019.0 -11.0 NW 6.71 0.0 0.0 2010-01-01 03:00 1019.0 -14.0 NW 9.84 0.0 0.0 2010-01-01 04:00 1018.0 -12.0 NW 12.97 0.0 0.0 ''' # 進行降采樣 df = df.resample('10D').mean() # 處理缺失數據 data = df['PM_US Post'].dropna() # print(data) ''' datetime 2010-01-01 23:00 129.0 2010-01-02 00:00 148.0 2010-01-02 01:00 159.0 2010-01-02 02:00 181.0 2010-01-02 03:00 138.0... 2015-12-31 19:00 133.0 2015-12-31 20:00 169.0 2015-12-31 21:00 203.0 2015-12-31 22:00 212.0 2015-12-31 23:00 235.0 Freq: H, Name: PM_US Post, Length: 50387, dtype: float64 ''' x = data.index y = data.values plt.figure(figsize=(15,8),dpi=80) plt.plot(range(len(x)),y) plt.xticks(range(0,len(x),10),list(x)[::10],rotation=45) plt.show()

四、pandas畫圖

1.折線圖

# pandas繪圖 from matplotlib import pyplot as plt import pandas as pdiris_data = pd.read_csv('iris.csv') # print(iris_data.info()) ''' <class 'pandas.core.frame.DataFrame'> RangeIndex: 150 entries, 0 to 149 Data columns (total 5 columns):# Column Non-Null Count Dtype --- ------ -------------- ----- 0 SepalLength 150 non-null float641 SepalWidth 150 non-null float642 PetalLength 150 non-null float643 PetalWidth 150 non-null float644 Name 150 non-null object dtypes: float64(4), object(1) memory usage: 6.0+ KB None ''' # 折線圖 iris_data.plot() plt.show()

2.分組柱狀圖

# pandas繪圖 from matplotlib import pyplot as plt import pandas as pdiris_data = pd.read_csv('iris.csv') # print(iris_data.info())# 分組柱狀圖 iris_data.groupby('Name').mean().plot(kind='bar') plt.show()

2.餅圖

# pandas繪圖 from matplotlib import pyplot as plt import pandas as pdiris_data = pd.read_csv('iris.csv') # print(iris_data.info())# 餅圖 iris_data.groupby('Name').size().plot(kind='pie',autopct='%.2f%%') plt.show()

總結

以上是生活随笔為你收集整理的数据分析第七讲 pandas练习数据的合并、分组聚合、时间序列、pandas绘图的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：旧金山州立大学计算机科学专业,美国留学旧
下一篇： wap网站服务器要求,使您的WEB服务器