日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

pandasStudyNoteBook

發布時間:2023/12/13 编程问答 30 豆豆
生活随笔 收集整理的這篇文章主要介紹了 pandasStudyNoteBook 小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

pandas 入門培訓



pandas簡介


- 官網鏈接:http://pandas.pydata.org/
- pandas = pannel data + data analysis
- Pandas是python的一個數據分析包 , Pandas最初被作為金融數據分析工具而開發出來,因此,pandas為時間序列分析提供了很好的支持

基本功能


- 具備按軸自動或顯式數據對齊功能的數據結構
- 集成時間序列功能
- 既能處理時間序列數據也能處理非時間序列數據的數據結構
- 數學運算和約簡(比如對某個軸求和)可以根據不同的元數據(軸編號)執行
- 靈活處理缺失數據
- 合并及其他出現在常見數據庫(例如基于SQL的)中的關系型運算

數據結構


數據結構 serial


- Series是一種類似于一維數組的對象,它由一組數據(各種NumPy數據類型)以及一組與之相關的數據標簽(即索引)組成。
- Series的字符串表現形式為:索引在左邊,值在右邊。

代碼:


- serial的創建
- 使用列表
- 使用字典
- Serial的讀寫
- serial的運算

# -*- coding: utf-8 -*- from pandas import Series # from __future__ import print_functionprint '用數組生成Series' obj = Series([4, 7, -5, 3]) #使用列表生成Serial print obj print obj.values print obj.index printprint '指定Series的index' obj2 = Series([4, 7, -5, 3], index = ['d', 'b', 'a', 'c']) #通過使用index關鍵字申明serial的索引值 print obj2 print obj2.index print obj2['a'] obj2['d'] = 100 #通過索引修改serial某個元素的值 print obj2[['c', 'a', 'd']] #通過索引指定輸出順序 print obj2[obj2 > 0] # 找出大于0的元素 print 'b' in obj2 # 判斷索引是否存在 print 'e' in obj2 printprint '使用字典生成Series' sdata = {'Ohio':10000, 'Texas':20000, 'Oregon':16000, 'Utah':5000} obj3 = Series(sdata) #通過字典構建serial數據結構 print obj3 printprint '使用字典生成Series,并額外指定index,不匹配部分為NaN,沒有的部分直接舍棄' states = ['California', 'Ohio', 'Oregon', 'Texas'] obj4 = Series(sdata, index = states) #通過index指定索引 print obj4 printprint 'Series相加,相同索引部分相加,不同的部分直接賦值為nan,整體結果是求并的結果' print obj3 + obj4 printprint '指定Series及其索引的名字' obj4.name = 'population' #指定serial的名字 obj4.index.name = 'state' #指定行索引的名字 print obj4 printprint '替換index' obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan'] print obj 用數組生成Series 0 4 1 7 2 -5 3 3 dtype: int64 [ 4 7 -5 3] RangeIndex(start=0, stop=4, step=1)指定Series的index d 4 b 7 a -5 c 3 dtype: int64 Index([u'd', u'b', u'a', u'c'], dtype='object') -5 c 3 a -5 d 100 dtype: int64 d 100 b 7 c 3 dtype: int64 True False使用字典生成Series Ohio 10000 Oregon 16000 Texas 20000 Utah 5000 dtype: int64使用字典生成Series,并額外指定index,不匹配部分為NaN,沒有的部分直接舍棄 California NaN Ohio 10000.0 Oregon 16000.0 Texas 20000.0 dtype: float64Series相加,相同索引部分相加,不同的部分直接賦值為nan,整體結果是求并的結果 California NaN Ohio 20000.0 Oregon 32000.0 Texas 40000.0 Utah NaN dtype: float64指定Series及其索引的名字 state California NaN Ohio 10000.0 Oregon 16000.0 Texas 20000.0 Name: population, dtype: float64替換index Bob 4 Steve 7 Jeff -5 Ryan 3 dtype: int64

數據結構 DataFrame


- DataFrame是一個表格型的數據結構,它含有一組有序的列,每列可以是不同的值類型(數值、字符串、布爾值等)
- DataFrame既有行索引也有列索引,它可以被看做由Series組成的字典(共用同一個索引)
- 可以輸入給DataFrame構造器的數據



代碼:


- 創建
- 讀寫

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '用字典生成DataFrame,key為列的名字。' data = {'state':['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], #字典的key作為dataframe的列索引'year':[2000, 2001, 2002, 2001, 2002],'pop':[1.5, 1.7, 3.6, 2.4, 2.9]} print DataFrame(data) print DataFrame(data, columns = ['year', 'state', 'pop']) # 指定列順序 (columns:列 , index:行) printprint '指定索引,在列中指定不存在的列,默認數據用NaN。' frame2 = DataFrame(data,columns = ['year', 'state', 'pop', 'debt'],#定義列索引index = ['one', 'two', 'three', 'four', 'five'])#定義行索引print frame2 print frame2['state'] #取出‘state’這一列的數據 print frame2.year #取出‘year的數據 print frame2.ix['three'] #通過ix表示是通過行索引 frame2['debt'] = 16.5 # 修改一整列 print frame2 frame2.debt = np.arange(5) # 用numpy數組修改元素 print frame2 printprint '用Series指定要修改的索引及其對應的值,沒有指定的默認數據用NaN。' val = Series([-1.2, -1.5, -1.7], index = ['two', 'four', 'five']) #將“debt”列中的第2,4,5,個元素更換值,其余的1,3,設置為nan frame2['debt'] = val print frame2 printprint '賦值給新列' frame2['eastern'] = (frame2.state == 'Ohio') # 增加一個新的列,列的值取:如果state等于Ohio為True print frame2 print frame2.columns printprint 'DataFrame轉置' pop = {'Nevada':{2001:12.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}} frame3 = DataFrame(pop) #使用字典構建dataframe print "frame3" print frame3 print frame3.T printprint '指定索引順序,以及使用切片初始化數據。' print DataFrame(pop, index = [2001, 2002, 2003]) pdata = {'Ohio':frame3['Ohio'][:-1], 'Nevada':frame3['Nevada'][:2]} print DataFrame(pdata) printprint '指定索引和列的名稱' frame3.index.name = 'year' frame3.columns.name = 'state' print frame3 print frame3.values print frame2.values 用字典生成DataFrame,key為列的名字。pop state year 0 1.5 Ohio 2000 1 1.7 Ohio 2001 2 3.6 Ohio 2002 3 2.4 Nevada 2001 4 2.9 Nevada 2002year state pop 0 2000 Ohio 1.5 1 2001 Ohio 1.7 2 2002 Ohio 3.6 3 2001 Nevada 2.4 4 2002 Nevada 2.9指定索引,在列中指定不存在的列,默認數據用NaN。year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 NaN three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 NaN five 2002 Nevada 2.9 NaN one Ohio two Ohio three Ohio four Nevada five Nevada Name: state, dtype: object one 2000 two 2001 three 2002 four 2001 five 2002 Name: year, dtype: int64 year 2002 state Ohio pop 3.6 debt NaN Name: three, dtype: objectyear state pop debt one 2000 Ohio 1.5 16.5 two 2001 Ohio 1.7 16.5 three 2002 Ohio 3.6 16.5 four 2001 Nevada 2.4 16.5 five 2002 Nevada 2.9 16.5year state pop debt one 2000 Ohio 1.5 0 two 2001 Ohio 1.7 1 three 2002 Ohio 3.6 2 four 2001 Nevada 2.4 3 five 2002 Nevada 2.9 4用Series指定要修改的索引及其對應的值,沒有指定的默認數據用NaN。year state pop debt one 2000 Ohio 1.5 NaN two 2001 Ohio 1.7 -1.2 three 2002 Ohio 3.6 NaN four 2001 Nevada 2.4 -1.5 five 2002 Nevada 2.9 -1.7賦值給新列year state pop debt eastern one 2000 Ohio 1.5 NaN True two 2001 Ohio 1.7 -1.2 True three 2002 Ohio 3.6 NaN True four 2001 Nevada 2.4 -1.5 False five 2002 Nevada 2.9 -1.7 False Index([u'year', u'state', u'pop', u'debt', u'eastern'], dtype='object')DataFrame轉置 frame3Nevada Ohio 2000 NaN 1.5 2001 12.4 1.7 2002 2.9 3.62000 2001 2002 Nevada NaN 12.4 2.9 Ohio 1.5 1.7 3.6指定索引順序,以及使用切片初始化數據。Nevada Ohio 2001 12.4 1.7 2002 2.9 3.6 2003 NaN NaNNevada Ohio 2000 NaN 1.5 2001 12.4 1.7指定索引和列的名稱 state Nevada Ohio year 2000 NaN 1.5 2001 12.4 1.7 2002 2.9 3.6 [[ nan 1.5][12.4 1.7][ 2.9 3.6]] [[2000 'Ohio' 1.5 nan True][2001 'Ohio' 1.7 -1.2 True][2002 'Ohio' 3.6 nan True][2001 'Nevada' 2.4 -1.5 False][2002 'Nevada' 2.9 -1.7 False]]/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:22: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

數據結構 索引對象


- pandas的索引對象負責管理軸標簽和其他元數據(比如軸名稱等)。構建Series或DataFrame時,所用到的任何數組或其他序列的標簽都會被轉換成一個Index.
- Index對象是不可修改的(immutable),因此用戶不能對其進行修改。不可修改性非常重要,因為這樣才能使Index對象在多個數據結構之間安全共享

- pandas中主要的index對象



- Index的方法和屬性 I



- Index的方法和屬性 II



代碼:


# -*- coding: utf-8 -*- import numpy as np import pandas as pd import sys from pandas import Series, DataFrame, Indexprint '獲取index' obj = Series(range(3), index = ['a', 'b', 'c']) index = obj.index #獲取serial對象的行索引 print index[1:] try:index[1] = 'd' # index對象read only,無法對其賦值 except:print sys.exc_info()[0] printprint '使用Index對象' index = Index(np.arange(3))#構建行索引 obj2 = Series([1.5, -2.5, 0], index = index) print obj2 print obj2.index is index printprint '判斷列和索引是否存在' pop = {'Nevada':{20001:2.4, 2002:2.9},'Ohio':{2000:1.5, 2001:1.7, 2002:3.6}} frame3 = DataFrame(pop) print frame3 print 'Ohio' in frame3.columns #判斷是否在列索引中 print '2003' in frame3.index #判斷是否在行索引中 獲取index Index([u'b', u'c'], dtype='object') <type 'exceptions.TypeError'>使用Index對象 0 1.5 1 -2.5 2 0.0 dtype: float64 True判斷列和索引是否存在Nevada Ohio 2000 NaN 1.5 2001 NaN 1.7 2002 2.9 3.6 20001 2.4 NaN True False

基本功能


基本功能 重新索引


- 創建一個適應新索引的新對象,該Series的reindex將會根據新索引進行重排。如果某個索引值當前不存在,就引入缺失值
- 對于時間序列這樣的有序數據,重新索引時可能需要做一些插值處理。method選項即可達到此目的。

- reindex函數的參數

屏幕快照 2018-06-07 上午9.24.50.png

代碼

# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrame, Seriesprint '重新指定索引及順序' obj = Series([4.5, 7.2, -5.3, 3.6], index = ['d', 'b', 'a', 'c']) print obj obj2 = obj.reindex(['a', 'b', 'd', 'c', 'e'])#默認的填充方法是nan print obj2 print obj.reindex(['a', 'b', 'd', 'c', 'e'], fill_value = 0) # 指定不存在元素的填充值 printprint '重新指定索引并指定填元素充方法' obj3 = Series(['blue', 'purple', 'yellow'], index = [0, 2, 4]) print obj3 print obj3.reindex(range(6), method = 'ffill')#根據前一個數據的值進行填充 printprint '對DataFrame重新指定索引' frame = DataFrame(np.arange(9).reshape(3, 3),index = ['a', 'c', 'd'],columns = ['Ohio', 'Texas', 'California']) print frame frame2 = frame.reindex(['a', 'b', 'c', 'd'])#默認更新軸為行 print frame2 printprint '重新指定column' states = ['Texas', 'Utah', 'California'] print frame.reindex(columns = states)#制定列索引的順序 print frameprint '對DataFrame重新指定索引并指定填元素充方法' print frame.reindex(index = ['a', 'b', 'c', 'd'],method = 'ffill') # columns = states) print frame.ix[['a', 'b', 'd', 'c'], states]#通過ix指定修改的軸為行 重新指定索引及順序 d 4.5 b 7.2 a -5.3 c 3.6 dtype: float64 a -5.3 b 7.2 d 4.5 c 3.6 e NaN dtype: float64 a -5.3 b 7.2 d 4.5 c 3.6 e 0.0 dtype: float64重新指定索引并指定填元素充方法 0 blue 2 purple 4 yellow dtype: object 0 blue 1 blue 2 purple 3 purple 4 yellow 5 yellow dtype: object對DataFrame重新指定索引Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8Ohio Texas California a 0.0 1.0 2.0 b NaN NaN NaN c 3.0 4.0 5.0 d 6.0 7.0 8.0重新指定columnTexas Utah California a 1 NaN 2 c 4 NaN 5 d 7 NaN 8Ohio Texas California a 0 1 2 c 3 4 5 d 6 7 8 對DataFrame重新指定索引并指定填元素充方法Ohio Texas California a 0 1 2 b 0 1 2 c 3 4 5 d 6 7 8Texas Utah California a 1.0 NaN 2.0 b NaN NaN NaN d 7.0 NaN 8.0 c 4.0 NaN 5.0/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:38: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

基本功能 丟棄指定軸上的項


- 丟棄某條軸上的一個或多個項很簡單,只要有一個索引數組或列表即可。由于需要執行一些數據整理和集合邏輯,所以drop方法返回的是一個在指定軸上刪除了指定值的新對象

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrame# print 'Series根據索引刪除元素' # obj = Series(np.arange(5.), index = ['a', 'b', 'c', 'd', 'e']) # new_obj = obj.drop('c')#根據行索引刪除某一個行 # print new_obj # obj = obj.drop(['d', 'c']) # print obj # printprint 'DataFrame刪除元素,可指定索引或列。' data = DataFrame(np.arange(16).reshape((4, 4)),index = ['Ohio', 'Colorado', 'Utah', 'New York'],columns = ['one', 'two', 'three', 'four']) print data print data.drop(['Colorado', 'Ohio']) print data.drop('two', axis = 1)#指定列索引 print data.drop(['two', 'four'], axis = 1) DataFrame刪除元素,可指定索引或列。one two three four Ohio 0 1 2 3 Colorado 4 5 6 7 Utah 8 9 10 11 New York 12 13 14 15one two three four Utah 8 9 10 11 New York 12 13 14 15one three four Ohio 0 2 3 Colorado 4 6 7 Utah 8 10 11 New York 12 14 15one three Ohio 0 2 Colorado 4 6 Utah 8 10 New York 12 14

基本功能 索引、選取和過濾


- Series索引(obj[…])的工作方式類似于NumPy數組的索引,只不過Series的索引值不只是整數。
- 利用標簽的切片運算與普通的Python切片運算不同,其末端是包含的(inclusive),完全閉區間。
- 對DataFrame進行索引其實就是獲取一個或多個列
- 為了在DataFrame的行上進行標簽索引,引入了專門的索引字段ix

- DataFrame的索引選項


代碼:

  • 列表索引
  • 切片索引
  • 行/列索引
  • 條件索引

-- coding: utf-8 --

import numpy as np
from pandas import Series, DataFrame

print ‘Series的索引,默認數字索引可以工作。’
obj = Series(np.arange(4.), index = [‘a’, ‘b’, ‘c’, ‘d’])
print obj[‘b’]
print obj[3]
print obj[[1, 3]]#索引時候使用的是列表,非索引一般用的是元祖,選中obj[1]和obj[3]
print obj[obj < 2]#將obj中小于2的元素打印出來
print

print ‘Series的數組切片’
print obj[‘b’:’d’] # 閉區間[b:d]
obj[‘b’:’c’] = 5
print obj
print

print ‘DataFrame的索引’
data = DataFrame(np.arange(16).reshape((4, 4)),
index = [‘Ohio’, ‘Colorado’, ‘Utah’, ‘New York’],
columns = [‘one’, ‘two’, ‘three’, ‘four’])
print data
print data[‘two’] # 打印列.使用下標進行索引時,默認的是列索引
print data[[‘three’, ‘one’]]#以列表進行索引
print data[:2]
print data.ix[‘Colorado’, [‘two’, ‘three’]] # 指定索引和列,通過ix完成行索引
print data.ix[[‘Colorado’, ‘Utah’], [3, 0, 1]]
print data.ix[2] # 打印第2行(從0開始)
print data.ix[:’Utah’, ‘two’] # 從開始到Utah,第2列。
print

print ‘根據條件選擇’
print data[data.three > 5]
print data < 5 # 打印True或者False
data[data < 5] = 0
print data

基本功能 算術運算和數據對齊


- 對不同的索引對象進行算術運算
- 自動數據對齊在不重疊的索引處引入了NA值,缺失值會在算術運算過程中傳播。
- 對于DataFrame,對齊操作會同時發生在行和列上。
- fill_value參數
- DataFrame和Series之間的運算

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '加法' s1 = Series([7.3, -2.5, 3.4, 1.5], index = ['a', 'c', 'd', 'e']) s2 = Series([-2.1, 3.6, -1.5, 4, 3.1], index = ['a', 'c', 'e', 'f', 'g']) print s1 print s2 print s1 + s2 #相同索引的元素對應相加,不相同的部分直接賦值為nan,加法后的索引為之前索引的并集 printprint 'DataFrame加法,索引和列都必須匹配。' df1 = DataFrame(np.arange(9.).reshape((3, 3)),columns = list('bcd'),index = ['Ohio', 'Texas', 'Colorado']) df2 = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) print df1 print df2 print df1 + df2#dataframe加法是作用于行和列兩個方向的,相同索引的相加,不同索引的賦值nan printprint '數據填充' df1 = DataFrame(np.arange(12.).reshape((3, 4)), columns = list('abcd')) df2 = DataFrame(np.arange(20.).reshape((4, 5)), columns = list('abcde')) print df1 print df2 print 'df1 + df2' print df1 + df2 print df1.add(df2, fill_value = 0)#使用add函數進行相加,和+符號的結果不一樣 print df1.reindex(columns = df2.columns, fill_value = 0)#使用dataframe2的列索引來跟新dataframe1的列索引,沒有的填充0 printprint 'DataFrame與Series之間的操作' arr = np.arange(12.).reshape((3, 4)) print arr print arr[0] print arr - arr[0] frame = DataFrame(np.arange(12).reshape((4, 3)),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) series = frame.ix[0] print frame print series print frame - series #把serial看成是一個dataframe,只不過,此時他只有一行而已,在利用dataframe的減法原則 series2 = Series(range(3), index = list('bef')) print frame + series2 series3 = frame['d'] print frame.sub(series3, axis = 0) # 按列減 加法 a 7.3 c -2.5 d 3.4 e 1.5 dtype: float64 a -2.1 c 3.6 e -1.5 f 4.0 g 3.1 dtype: float64 a 5.2 c 1.1 d NaN e 0.0 f NaN g NaN dtype: float64DataFrame加法,索引和列都必須匹配。b c d Ohio 0.0 1.0 2.0 Texas 3.0 4.0 5.0 Colorado 6.0 7.0 8.0b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11b c d e Colorado NaN NaN NaN NaN Ohio 3.0 NaN 6.0 NaN Oregon NaN NaN NaN NaN Texas 9.0 NaN 12.0 NaN Utah NaN NaN NaN NaN數據填充a b c d 0 0.0 1.0 2.0 3.0 1 4.0 5.0 6.0 7.0 2 8.0 9.0 10.0 11.0a b c d e 0 0.0 1.0 2.0 3.0 4.0 1 5.0 6.0 7.0 8.0 9.0 2 10.0 11.0 12.0 13.0 14.0 3 15.0 16.0 17.0 18.0 19.0 df1 + df2a b c d e 0 0.0 2.0 4.0 6.0 NaN 1 9.0 11.0 13.0 15.0 NaN 2 18.0 20.0 22.0 24.0 NaN 3 NaN NaN NaN NaN NaNa b c d e 0 0.0 2.0 4.0 6.0 4.0 1 9.0 11.0 13.0 15.0 9.0 2 18.0 20.0 22.0 24.0 14.0 3 15.0 16.0 17.0 18.0 19.0a b c d e 0 0.0 1.0 2.0 3.0 0 1 4.0 5.0 6.0 7.0 0 2 8.0 9.0 10.0 11.0 0DataFrame與Series之間的操作 [[ 0. 1. 2. 3.][ 4. 5. 6. 7.][ 8. 9. 10. 11.]] [0. 1. 2. 3.] [[0. 0. 0. 0.][4. 4. 4. 4.][8. 8. 8. 8.]]b d e Utah 0 1 2 Ohio 3 4 5 Texas 6 7 8 Oregon 9 10 11 b 0 d 1 e 2 Name: Utah, dtype: int64b d e Utah 0 0 0 Ohio 3 3 3 Texas 6 6 6 Oregon 9 9 9b d e f Utah 0.0 NaN 3.0 NaN Ohio 3.0 NaN 6.0 NaN Texas 6.0 NaN 9.0 NaN Oregon 9.0 NaN 12.0 NaNb d e Utah -1 0 1 Ohio -1 0 1 Texas -1 0 1 Oregon -1 0 1/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:45: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

基本功能 函數應用和映射


- numpy的ufuncs(元素級數組方法)
- DataFrame的apply方法
- 對象的applymap方法(因為Series有一個應用于元素級的map方法)
- 所有numpy作用于元素級別的函數都可以作用于pandas的datafram

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '函數' frame = DataFrame(np.random.randn(4, 3),columns = list('bde'),index = ['Utah', 'Ohio', 'Texas', 'Oregon']) print frame print np.abs(frame)#對dataframe中的每個元素求絕對值 printprint 'lambda以及應用' f = lambda x: x.max() - x.min() print frame.apply(f)#默認是對列的元素進行操作 print frame.apply(f, axis = 1)#忽略列,對行進行操作def f(x):return Series([x.min(), x.max()], index = ['min', 'max']) print frame.apply(f) printprint 'applymap和map' _format = lambda x: '%.2f' % x print frame.applymap(_format) print frame['e'].map(_format) 函數b d e Utah -0.188935 0.298682 1.692648 Ohio -0.666434 -0.102262 -0.172966 Texas -1.103831 -1.324074 -1.024516 Oregon 1.354406 -0.564374 -0.967438b d e Utah 0.188935 0.298682 1.692648 Ohio 0.666434 0.102262 0.172966 Texas 1.103831 1.324074 1.024516 Oregon 1.354406 0.564374 0.967438lambda以及應用 b 2.458237 d 1.622756 e 2.717164 dtype: float64 Utah 1.881583 Ohio 0.564172 Texas 0.299558 Oregon 2.321844 dtype: float64b d e min -1.103831 -1.324074 -1.024516 max 1.354406 0.298682 1.692648applymap和mapb d e Utah -0.19 0.30 1.69 Ohio -0.67 -0.10 -0.17 Texas -1.10 -1.32 -1.02 Oregon 1.35 -0.56 -0.97 Utah 1.69 Ohio -0.17 Texas -1.02 Oregon -0.97 Name: e, dtype: object

基本功能 排序和排名


- 對行或列索引進行排序
- 對于DataFrame,根據任意一個軸上的索引進行排序
- 可以指定升序降序
- 按值排序
- 對于DataFrame,可以指定按值排序的列
- rank函數

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '根據索引排序,對于DataFrame可以指定軸。' obj = Series(range(4), index = ['d', 'a', 'b', 'c']) print obj.sort_index()#通過索引進行排序 frame = DataFrame(np.arange(8).reshape((2, 4)),index = ['three', 'one'],columns = list('dabc')) print frame.sort_index()#默認是對行索引進行排序 print frame.sort_index(axis = 1)#對列索引進行排序 print frame.sort_index(axis = 1, ascending = False) # 降序 printprint '根據值排序' obj = Series([4, 7, -3, 2]) print obj.sort_values() # order已淘汰 printprint 'DataFrame指定列排序' frame = DataFrame({'b':[4, 7, -3, 2], 'a':[0, 1, 0, 1]}) print frame print frame.sort_values(by = 'b') # sort_index(by = ...)已淘汰 print frame.sort_values(by = ['a', 'b']) printprint 'rank,求排名的平均位置(從1開始)' obj = Series([7, -5, 7, 4, 2, 0, 4]) # 對應排名:-5(1), 0(2), 2(3), 4(4), 4(5), 7(6), 7(7) print obj.rank() print obj.rank(method = 'first') # 去第一次出現,不求平均值。 print obj.rank(ascending = False, method = 'max') # 逆序,并取最大值。所以-5的rank是7. frame = DataFrame({'b':[4.3, 7, -3, 2],'a':[0, 1, 0, 1],'c':[-2, 5, 8, -2.5]}) print frame print frame.rank(axis = 1) 根據索引排序,對于DataFrame可以指定軸。 a 1 b 2 c 3 d 0 dtype: int64d a b c one 4 5 6 7 three 0 1 2 3a b c d three 1 2 3 0 one 5 6 7 4d c b a three 0 3 2 1 one 4 7 6 5根據值排序 2 -3 3 2 0 4 1 7 dtype: int64DataFrame指定列排序a b 0 0 4 1 1 7 2 0 -3 3 1 2a b 2 0 -3 3 1 2 0 0 4 1 1 7a b 2 0 -3 0 0 4 3 1 2 1 1 7rank,求排名的平均位置(從1開始) 0 6.5 1 1.0 2 6.5 3 4.5 4 3.0 5 2.0 6 4.5 dtype: float64 0 6.0 1 1.0 2 7.0 3 4.0 4 3.0 5 2.0 6 5.0 dtype: float64 0 2.0 1 7.0 2 2.0 3 4.0 4 5.0 5 6.0 6 4.0 dtype: float64a b c 0 0 4.3 -2.0 1 1 7.0 5.0 2 0 -3.0 8.0 3 1 2.0 -2.5a b c 0 2.0 3.0 1.0 1 1.0 3.0 2.0 2 2.0 1.0 3.0 3 2.0 3.0 1.0

基本功能 帶有重復值的索引


- 對于重復索引,返回Series,對應單個值的索引則返回標量。

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '重復的索引' obj = Series(range(5), index = ['a', 'a', 'b', 'b', 'c']) print obj print obj.index.is_unique # 判斷是非有重復索引 print obj['a'][0], obj.a[1] df = DataFrame(np.random.randn(4, 3), index = ['a', 'a', 'b', 'b']) print df print df.ix['b'].ix[0] print df.ix['b'].ix[1] 重復的索引 a 0 a 1 b 2 b 3 c 4 dtype: int64 False 0 10 1 2 a 1.166285 0.600093 1.043009 a 0.791440 0.764078 1.136826 b -1.624025 -0.384034 1.255976 b 0.164236 -0.181083 0.131282 0 -1.624025 1 -0.384034 2 1.255976 Name: b, dtype: float64 0 0.164236 1 -0.181083 2 0.131282 Name: b, dtype: float64/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:13: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecateddel sys.path[0] /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:14: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

匯總和計算描述統計

匯總和計算描述統計 匯總和計算描述統計


- 常用方法選項


- 常用描述和匯總統計函數 I


- 常用描述和匯總統計函數 II


- 數值型和非數值型的區別
- NA值被自動排查,除非通過skipna選項

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '求和' df = DataFrame([[1.4, np.nan], [7.1, -4.5], [np.nan, np.nan], [0.75, -1.3]],index = ['a', 'b', 'c', 'd'],columns = ['one', 'two']) print df print df.sum() # 按列求和,默認求和的方式是按列求和 print df.sum(axis = 1) # 按行求和,通過axis關鍵字指定按行進行求和 printprint '平均數' print df.mean(axis = 1, skipna = False)#按行進行求平均,不跳過nan print df.mean(axis = 1)#默認跳過nan printprint '其它' print df.idxmax()#默認對列進行操作 print df.idxmax(axis = 1) #默認是按列操作 print df.cumsum()#默認按列進行操作 print df.describe()#默認是按列進行操作 obj = Series(['a', 'a', 'b', 'c'] * 4) print obj print obj.describe() 求和one two a 1.40 NaN b 7.10 -4.5 c NaN NaN d 0.75 -1.3 one 9.25 two -5.80 dtype: float64 a 1.40 b 2.60 c 0.00 d -0.55 dtype: float64平均數 a NaN b 1.300 c NaN d -0.275 dtype: float64 a 1.400 b 1.300 c NaN d -0.275 dtype: float64其它 one b two d dtype: object a one b one c NaN d one dtype: objectone two a 1.40 NaN b 8.50 -4.5 c NaN NaN d 9.25 -5.8one two count 3.000000 2.000000 mean 3.083333 -2.900000 std 3.493685 2.262742 min 0.750000 -4.500000 25% 1.075000 -3.700000 50% 1.400000 -2.900000 75% 4.250000 -2.100000 max 7.100000 -1.300000 0 a 1 a 2 b 3 c 4 a 5 a 6 b 7 c 8 a 9 a 10 b 11 c 12 a 13 a 14 b 15 c dtype: object count 16 unique 3 top a freq 8 dtype: object

### 匯總和計算描述統計 相關系數與協方差

- 相關系數:相關系數是用以反映變量之間相關關系密切程度的統計指標。百度百科
- 協方差:從直觀上來看,協方差表示的是兩個變量總體誤差的期望。如果兩個變量的變化趨勢一致,也就是說如果其中一個大于自身的期望值時另外一個也大于自身的期望值,那么兩個變量之間的協方差就是正值;如果兩個變量的變化趨勢相反,即其中一個變量大于自身的期望值時另外一個卻小于自身的期望值,那么兩個變量之間的協方差就是負值。

代碼:

# -*- coding: utf-8 -*- import numpy as np # from pandas_datareader import data , web import pandas.io.data as web from pandas import DataFrameprint '相關性與協方差' # 協方差:https://zh.wikipedia.org/wiki/%E5%8D%8F%E6%96%B9%E5%B7%AE all_data = {} for ticker in ['AAPL', 'IBM', 'MSFT', 'GOOG']:all_data[ticker] = web.get_data_yahoo(ticker, '4/1/2016', '7/15/2015')price = DataFrame({tic: data['Adj Close'] for tic, data in all_data.iteritems()})volume = DataFrame({tic: data['Volume'] for tic, data in all_data.iteritems()}) returns = price.pct_change() print returns.tail() print returns.MSFT.corr(returns.IBM) print returns.corr() # 相關性,自己和自己的相關性總是1 print returns.cov() # 協方差 print returns.corrwith(returns.IBM) print returns.corrwith(returns.volume) ---------------------------------------------------------------------------ImportError Traceback (most recent call last)<ipython-input-61-a72f5c63b2a8> in <module>()3 import numpy as np4 # from pandas_datareader import data , web ----> 5 import pandas.io.data as web6 from pandas import DataFrame7 /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/pandas/io/data.py in <module>()1 raise ImportError( ----> 2 "The pandas.io.data module is moved to a separate package "3 "(pandas-datareader). After installing the pandas-datareader package "4 "(https://github.com/pydata/pandas-datareader), you can change "5 "the import ``from pandas.io import data, wb`` to "ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.

匯總和計算描述統計 唯一值以及成員資格


- 常用方法


代碼:

# -*- coding: utf-8 -*- import numpy as np import pandas as pd from pandas import Series, DataFrameprint '去重' obj = Series(['c', 'a', 'd', 'a', 'a', 'b', 'b', 'c', 'c']) print obj print obj.unique() #去重索引 print obj.value_counts() #計算索引對應的個數 printprint '判斷元素存在' mask = obj.isin(['b', 'c']) print mask print obj[mask] #只打印元素b和c data = DataFrame({'Qu1':[1, 3, 4, 3, 4],'Qu2':[2, 3, 1, 2, 3],'Qu3':[1, 5, 2, 4, 4]}) print data print data.apply(pd.value_counts).fillna(0) print data.apply(pd.value_counts, axis = 1).fillna(0) 去重 0 c 1 a 2 d 3 a 4 a 5 b 6 b 7 c 8 c dtype: object ['c' 'a' 'd' 'b'] c 3 a 3 b 2 d 1 dtype: int64判斷元素存在 0 True 1 False 2 False 3 False 4 False 5 True 6 True 7 True 8 True dtype: bool 0 c 5 b 6 b 7 c 8 c dtype: objectQu1 Qu2 Qu3 0 1 2 1 1 3 3 5 2 4 1 2 3 3 2 4 4 4 3 4Qu1 Qu2 Qu3 1 1.0 1.0 1.0 2 0.0 2.0 1.0 3 2.0 2.0 0.0 4 2.0 0.0 2.0 5 0.0 0.0 1.01 2 3 4 5 0 2.0 1.0 0.0 0.0 0.0 1 0.0 0.0 2.0 0.0 1.0 2 1.0 1.0 0.0 1.0 0.0 3 0.0 1.0 1.0 1.0 0.0 4 0.0 0.0 1.0 2.0 0.0

處理缺失數據

處理缺失數據


- NA處理方法


- NaN(Not a Number)表示浮點數和非浮點數組中的缺失數據
- None也被當作NA處理

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Seriesprint '作為null處理的值' string_data = Series(['aardvark', 'artichoke', np.nan, 'avocado']) print string_data print string_data.isnull() #判斷是否為空缺值 string_data[0] = None print string_data.isnull() 作為null處理的值 0 aardvark 1 artichoke 2 NaN 3 avocado dtype: object 0 False 1 False 2 True 3 False dtype: bool 0 True 1 False 2 True 3 False dtype: bool

處理缺失數據 濾除缺失數據


- dropna
- 布爾索引
- DatFrame默認丟棄任何含有缺失值的行
- how參數控制行為,axis參數選擇軸,thresh參數控制留下的數量

代碼:

# -*- coding: utf-8 -*- import numpy as np from numpy import nan as NA from pandas import Series, DataFrame# print '丟棄NA' # data = Series([1, NA, 3.5, NA, 7 , None]) # print data.dropna() #去掉serial數據中的NA值 # print data[data.notnull()] # printprint 'DataFrame對丟棄NA的處理' data = DataFrame([[1., 6.5, 3.], [1., NA, NA],[NA, NA, NA], [NA, 6.5, 3.]]) print data print data.dropna() # 默認只要某行有NA就全部刪除 print data.dropna(how = 'all') # 全部為NA才刪除,使用how來指定方式 data[4] = NA # 新增一列 print data.dropna(axis = 1, how = 'all')#默認按行進行操作,可以通過axis來指定通過列進行操作 data = DataFrame(np.random.randn(7, 3)) data.ix[:4, 1] = NA data.ix[:2, 2] = NA print data print data.dropna(thresh = 2) # 每行至少要有2個非NA元素 DataFrame對丟棄NA的處理0 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.00 1 2 0 1.0 6.5 3.00 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 3 NaN 6.5 3.00 1 2 0 1.0 6.5 3.0 1 1.0 NaN NaN 2 NaN NaN NaN 3 NaN 6.5 3.00 1 2 0 -0.181398 NaN NaN 1 -1.153083 NaN NaN 2 -0.072996 NaN NaN 3 0.783739 NaN 0.324288 4 -1.277365 NaN -1.683068 5 2.305280 0.082071 0.175902 6 -0.167521 -0.043577 -0.9591340 1 2 3 0.783739 NaN 0.324288 4 -1.277365 NaN -1.683068 5 2.305280 0.082071 0.175902 6 -0.167521 -0.043577 -0.959134/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:22: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

處理缺失數據 填充缺失數據


- fillna
- inplace參數控制返回新對象還是就地修改

代碼:

# -*- coding: utf-8 -*- import numpy as np from numpy import nan as NA import pandas as pd from pandas import Series, DataFrame, Indexprint '填充0' df = DataFrame(np.random.randn(7, 3)) print df df.ix[:4, 1] = NA df.ix[:2, 2] = NA print df print df.fillna(0) df.fillna(0, inplace = False) #不在原先的數據結構上進行修改 df.fillna(0, inplace = True) #對原先的數據結構進行修改 print df printprint '不同行列填充不同的值' print df.fillna({1:0.5, 3:-1}) # 第3列不存在 printprint '不同的填充方式' df = DataFrame(np.random.randn(6, 3)) df.ix[2:, 1] = NA df.ix[4:, 2] = NA print df print df.fillna(method = 'ffill') print df.fillna(method = 'ffill', limit = 2) printprint '用統計數據填充' data = Series([1., NA, 3.5, NA, 7]) print data.fillna(data.mean()) 填充00 1 2 0 -0.747530 0.733795 0.207921 1 0.329993 -0.092622 -0.274532 2 -0.498705 1.097721 -0.248666 3 -1.072368 1.281738 1.143063 4 -0.838184 -1.229197 -1.588577 5 0.386622 -1.056740 0.120941 6 -0.104685 0.062590 -0.6826520 1 2 0 -0.747530 NaN NaN 1 0.329993 NaN NaN 2 -0.498705 NaN NaN 3 -1.072368 NaN 1.143063 4 -0.838184 NaN -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.6826520 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.6826520 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.682652不同行列填充不同的值0 1 2 0 -0.747530 0.00000 0.000000 1 0.329993 0.00000 0.000000 2 -0.498705 0.00000 0.000000 3 -1.072368 0.00000 1.143063 4 -0.838184 0.00000 -1.588577 5 0.386622 -1.05674 0.120941 6 -0.104685 0.06259 -0.682652不同的填充方式0 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 NaN 1.006229 3 0.813411 NaN 0.451489 4 0.097840 NaN NaN 5 -1.944482 NaN NaN0 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 -0.564576 1.006229 3 0.813411 -0.564576 0.451489 4 0.097840 -0.564576 0.451489 5 -1.944482 -0.564576 0.4514890 1 2 0 0.037005 -0.554357 -0.968951 1 0.600986 -0.564576 -0.718096 2 1.268549 -0.564576 1.006229 3 0.813411 -0.564576 0.451489 4 0.097840 NaN 0.451489 5 -1.944482 NaN 0.451489用統計數據填充 0 1.000000 1 3.833333 2 3.500000 3 3.833333 4 7.000000 dtype: float64/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:11: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated# This is added back by InteractiveShellApp.init_path() /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:26: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

層次化索引


- 使你能在一個軸上擁有多個(兩個以上)索引級別。抽象的說,它使你能以低緯度形式處理高維度數據。
- 通過stack與unstack變換DataFrame

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrame, MultiIndex# print 'Series的層次索引' # data = Series(np.random.randn(10), # index = [['a', 'a', 'a', 'b', 'b', 'b', 'c', 'c', 'd', 'd'], # [1, 2, 3, 1, 2, 3, 1, 2, 2, 3]]) # print data # print data.index # print data.b # print data['b':'c'] # print data[:2] # print data.unstack() # print data.unstack().stack() # printprint 'DataFrame的層次索引' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) print frame frame.index.names = ['key1', 'key2'] frame.columns.names = ['state', 'color'] print frame print frame.ix['a', 1] print frame.ix['a', 2]['Colorado'] print frame.ix['a', 2]['Ohio']['Red'] printprint '直接用MultiIndex創建層次索引結構' print MultiIndex.from_arrays([['Ohio', 'Ohio', 'Colorado'], ['Gree', 'Red', 'Green']],names = ['state', 'color']) DataFrame的層次索引Ohio ColoradoGreen Red Green a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11 state Ohio Colorado color Green Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11 state color Ohio Green 0Red 1 Colorado Green 2 Name: (a, 1), dtype: int64 color Green 5 Name: (a, 2), dtype: int64 4直接用MultiIndex創建層次索引結構 MultiIndex(levels=[[u'Colorado', u'Ohio'], [u'Gree', u'Green', u'Red']],labels=[[1, 1, 0], [0, 2, 1]],names=[u'state', u'color'])/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:27: DeprecationWarning: .ix is deprecated. Please use .loc for label based indexing or .iloc for positional indexingSee the documentation here: http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated

層次化索引 重新分級順序


- 索引交換
- 索引重新排序

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import Series, DataFrameprint '索引層級交換' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) frame.index.names = ['key1', 'key2'] frame_swapped = frame.swaplevel('key1', 'key2') print frame_swapped print frame_swapped.swaplevel(0, 1) printprint '根據索引排序' print frame.sortlevel('key2') print frame.swaplevel(0, 1).sortlevel(0) 索引層級交換Ohio ColoradoGreen Red Green key2 key1 1 a 0 1 2 2 a 3 4 5 1 b 6 7 8 2 b 9 10 11Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11根據索引排序Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 2 b 1 6 7 8 a 2 3 4 5 b 2 9 10 11Ohio ColoradoGreen Red Green key2 key1 1 a 0 1 2b 6 7 8 2 a 3 4 5b 9 10 11/Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:17: FutureWarning: sortlevel is deprecated, use sort_index(level= ...) /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/ipykernel_launcher.py:18: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)

層次化索引 根據級別匯總統計


- 指定索引級別和軸

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrameprint '根據指定的key計算統計信息' frame = DataFrame(np.arange(12).reshape((4, 3)),index = [['a', 'a', 'b', 'b'], [1, 2, 1, 2]],columns = [['Ohio', 'Ohio', 'Colorado'], ['Green', 'Red', 'Green']]) frame.index.names = ['key1', 'key2'] print frame print frame.sum(level = 'key2') 根據指定的key計算統計信息Ohio ColoradoGreen Red Green key1 key2 a 1 0 1 22 3 4 5 b 1 6 7 82 9 10 11Ohio ColoradoGreen Red Green key2 1 6 8 10 2 12 14 16

層次化索引 使用DataFrame的列


- 將指定列變為索引
- 移除或保留對象
- reset_index恢復

代碼:

# -*- coding: utf-8 -*- import numpy as np from pandas import DataFrameprint '使用列生成層次索引' frame = DataFrame({'a':range(7),'b':range(7, 0, -1),'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],'d':[0, 1, 2, 0, 1, 2, 3]}) print frame print frame.set_index(['c', 'd']) # 把c/d列變成索引 print frame.set_index(['c', 'd'], drop = False) # 列依然保留 frame2 = frame.set_index(['c', 'd']) print frame2.reset_index() 使用列生成層次索引a b c d 0 0 7 one 0 1 1 6 one 1 2 2 5 one 2 3 3 4 two 0 4 4 3 two 1 5 5 2 two 2 6 6 1 two 3a b c d one 0 0 71 1 62 2 5 two 0 3 41 4 32 5 23 6 1a b c d c d one 0 0 7 one 01 1 6 one 12 2 5 one 2 two 0 3 4 two 01 4 3 two 12 5 2 two 23 6 1 two 3c d a b 0 one 0 0 7 1 one 1 1 6 2 one 2 2 5 3 two 0 3 4 4 two 1 4 3 5 two 2 5 2 6 two 3 6 1

其它話題

其它話題 整數索引


- 歧義的產生
- 可靠的,不考慮索引類型的,基于位置的索引

代碼:

# -*- coding: utf-8 -*- import numpy as np import sys from pandas import Series, DataFrameprint '整數索引' ser = Series(np.arange(3.)) print ser try:print ser[-1] # 這里會有歧義 except:print sys.exc_info()[0] ser2 = Series(np.arange(3.), index = ['a', 'b', 'c']) print ser2[-1] ser3 = Series(range(3), index = [-5, 1, 3]) print ser3.iloc[2] # 避免直接用[2]產生的歧義 printprint '對DataFrame使用整數索引' frame = DataFrame(np.arange(6).reshape((3, 2)), index = [2, 0, 1]) print frame print frame.iloc[0] print frame.iloc[:, 1] 整數索引 0 0.0 1 1.0 2 2.0 dtype: float64 <type 'exceptions.KeyError'> 2.0 2對DataFrame使用整數索引0 1 2 0 1 0 2 3 1 4 5 0 0 1 1 Name: 2, dtype: int64 2 1 0 3 1 5 Name: 1, dtype: int64

其它話題 面板(Pannel)數據


- 通過三維ndarray創建pannel對象
- 通過ix[…]選取需要的數據
- 訪問順序:item -> major -> minor
- 通過stack展現面板數據

代碼:

# -*- coding: utf-8 -*- import numpy as np import pandas as pd import pandas.io.data as web from pandas import Series, DataFrame, Index, Panelpdata = Panel(dict((stk, web.get_data_yahoo(stk, '1/1/2016', '1/15/2016')) for stk in ['AAPL', 'GOOG', 'BIDU', 'MSFT'])) print pdata pdata = pdata.swapaxes('items', 'minor') print pdata printprint "訪問順序:# Item -> Major -> Minor" print pdata['Adj Close'] print pdata[:, '1/5/2016', :] print pdata['Adj Close', '1/6/2016', :] printprint 'Panel與DataFrame相互轉換' stacked = pdata.ix[:, '1/7/2016':, :].to_frame() print stacked print stacked.to_panel() ---------------------------------------------------------------------------ImportError Traceback (most recent call last)<ipython-input-83-82a16090a331> in <module>()3 import numpy as np4 import pandas as pd ----> 5 import pandas.io.data as web6 from pandas import Series, DataFrame, Index, Panel7 /Users/robot1/wfy/soft/anconda/anaconda2/lib/python2.7/site-packages/pandas/io/data.py in <module>()1 raise ImportError( ----> 2 "The pandas.io.data module is moved to a separate package "3 "(pandas-datareader). After installing the pandas-datareader package "4 "(https://github.com/pydata/pandas-datareader), you can change "5 "the import ``from pandas.io import data, wb`` to "ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.

總結

以上是生活随笔為你收集整理的pandasStudyNoteBook的全部內容,希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯,歡迎將生活随笔推薦給好友。