十分钟快速入门 Pandas
Pandas?是我最喜愛的庫之一。通過帶有標(biāo)簽的列和索引,Pandas?使我們可以以一種所有人都能理解的方式來處理數(shù)據(jù)。它可以讓我們毫不費(fèi)力地從諸如?csv?類型的文件中導(dǎo)入數(shù)據(jù)。我們可以用它快速地對(duì)數(shù)據(jù)進(jìn)行復(fù)雜的轉(zhuǎn)換和過濾等操作。Pandas?真是超級(jí)棒。
我覺得它和?Numpy、Matplotlib?一起構(gòu)成了一個(gè) Python 數(shù)據(jù)探索和分析的強(qiáng)大基礎(chǔ)。Scipy?(將會(huì)在下一篇推文里介紹)當(dāng)然也是一大主力并且是一個(gè)絕對(duì)贊的庫,但是我覺得前三者才是 Python 科學(xué)計(jì)算真正的頂梁柱。
那么,趕緊看看 python 科學(xué)計(jì)算系列的第三篇推文,一窺?Pandas?的芳容吧。如果你還沒看其它幾篇文章的話,別忘了去看看。
導(dǎo)入 Pandas
第一件事當(dāng)然是請(qǐng)出我們的明星 —— Pandas。
Python <span class="kn">import</span> <span class="nn"><span class="wp_keywordlink_affiliate"><a href="https://www.168seo.cn/tag/pandas" title="View all posts in pandas" target="_blank">pandas</a></span></span> <span class="kn">as</span> <span class="nn">pd</span> <span class="c"># This is the standard</span>| 1 | <span class="kn">import</span><span class="nn">pandas</span><span class="kn">as</span><span class="nn">pd</span><span class="c"># This is the standard</span> |
這是導(dǎo)入?pandas?的標(biāo)準(zhǔn)方法。我們不想一直寫?pandas?的全名,但是保證代碼的簡潔和避免命名沖突都很重要,所以折中使用?pd?。如果你去看別人使用?pandas?的代碼,就會(huì)看到這種導(dǎo)入方式。
Pandas 中的數(shù)據(jù)類型
Pandas 基于兩種數(shù)據(jù)類型,series 和 dataframe。
series 是一種一維的數(shù)據(jù)類型,其中的每個(gè)元素都有各自的標(biāo)簽。如果你之前看過這個(gè)系列關(guān)于?Numpy?的推文,你可以把它當(dāng)作一個(gè)由帶標(biāo)簽的元素組成的?numpy?數(shù)組。標(biāo)簽可以是數(shù)字或者字符。
dataframe 是一個(gè)二維的、表格型的數(shù)據(jù)結(jié)構(gòu)。Pandas 的 dataframe 可以儲(chǔ)存許多不同類型的數(shù)據(jù),并且每個(gè)軸都有標(biāo)簽。你可以把它當(dāng)作一個(gè) series 的字典。
將數(shù)據(jù)導(dǎo)入 Pandas
在對(duì)數(shù)據(jù)進(jìn)行修改、探索和分析之前,我們得先導(dǎo)入數(shù)據(jù)。多虧了 Pandas ,這比在?Numpy?中還要容易。
這里我鼓勵(lì)你去找到自己感興趣的數(shù)據(jù)并用來練習(xí)。你的(或者別的)國家的網(wǎng)站就是不錯(cuò)的數(shù)據(jù)源。如果要舉例的話,首推英國政府?dāng)?shù)據(jù)和美國政府?dāng)?shù)據(jù)。Kaggle也是個(gè)很好的數(shù)據(jù)源。
我將使用英國降雨數(shù)據(jù),這個(gè)數(shù)據(jù)集可以很容易地從英國政府網(wǎng)站上下載到。此外,我還下載了一些日本降雨量的數(shù)據(jù)。
英國降雨數(shù)據(jù):下載地址?日本的數(shù)據(jù)實(shí)在是沒找到,抱歉。
Python <span class="c"># Reading a csv into Pandas.</span> <span class="n">df</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'uk_rain_2014.csv'</span><span class="p">,</span> <span class="n">header</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>| 12 | <span class="c"># Reading a csv into Pandas.</span><span class="n">df</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'uk_rain_2014.csv'</span><span class="p">,</span><span class="n">header</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span> |
譯者注:如果你的數(shù)據(jù)集中有中文的話,最好在里面加上?encoding = 'gbk'?,以避免亂碼問題。后面的導(dǎo)出數(shù)據(jù)的時(shí)候也一樣。
這里我們從?csv?文件里導(dǎo)入了數(shù)據(jù),并儲(chǔ)存在 dataframe 中。這一步非常簡單,你只需要調(diào)用?read_csv?然后將文件的路徑傳進(jìn)去就行了。header?關(guān)鍵字告訴 Pandas 哪些是數(shù)據(jù)的列名。如果沒有列名的話就將它設(shè)定為?None?。Pandas 非常聰明,所以這個(gè)經(jīng)常可以省略。
準(zhǔn)備好要進(jìn)行探索和分析的數(shù)據(jù)
現(xiàn)在數(shù)據(jù)已經(jīng)導(dǎo)入到 Pandas 了,我們也許想看一眼數(shù)據(jù)來得到一些基本信息,以便在真正開始探索之前找到一些方向。
查看前 x 行的數(shù)據(jù):
Python <span class="c"># Getting first x rows.</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 12 | <span class="c"># Getting first x rows.</span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
我們只需要調(diào)用?head()?函數(shù)并且將想要查看的行數(shù)傳入。
得到的結(jié)果如下:
你可能還想看看最后幾行:
Python <span class="c"># Getting last x rows.</span> <span class="n">df</span><span class="o">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 12 | <span class="c"># Getting last x rows.</span><span class="n">df</span><span class="o">.</span><span class="n">tail</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
跟?head?一樣,我們只需要調(diào)用?tail?并且傳入想要查看的行數(shù)即可。注意,它并不是從最后一行倒著顯示的,而是按照數(shù)據(jù)原來的順序顯示。
得到的結(jié)果如下:
你通常使用列的名字來在 Pandas 中查找列。這一點(diǎn)很好而且易于使用,但是有時(shí)列名太長,比如調(diào)查問卷的一整個(gè)問題。不過你把列名縮短之后一切就好說了。
Python <span class="c"># Changing column labels.</span> <span class="n">df</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'water_year'</span><span class="p">,</span><span class="s">'rain_octsep'</span><span class="p">,</span> <span class="s">'outflow_octsep'</span><span class="p">,</span> <span class="s">'rain_decfeb'</span><span class="p">,</span> <span class="s">'outflow_decfeb'</span><span class="p">,</span> <span class="s">'rain_junaug'</span><span class="p">,</span> <span class="s">'outflow_junaug'</span><span class="p">]</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 12345 | <span class="c"># Changing column labels.</span><span class="n">df</span><span class="o">.</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'water_year'</span><span class="p">,</span><span class="s">'rain_octsep'</span><span class="p">,</span><span class="s">'outflow_octsep'</span><span class="p">,</span><span class="s">'rain_decfeb'</span><span class="p">,</span><span class="s">'outflow_decfeb'</span><span class="p">,</span><span class="s">'rain_junaug'</span><span class="p">,</span><span class="s">'outflow_junaug'</span><span class="p">]</span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
需要注意的一點(diǎn)是,我故意沒有在每列的標(biāo)簽中使用空格和破折號(hào)。之后你會(huì)看到這樣為變量命名可以使我們少打一些字符。
你得到的數(shù)據(jù)與之前的一樣,只是換了列的名字:
你通常會(huì)想知道數(shù)據(jù)的另一個(gè)特征——它有多少條記錄。在 Pandas 中,一條記錄對(duì)應(yīng)著一行,所以我們可以對(duì)數(shù)據(jù)集調(diào)用?len?方法,它將返回?cái)?shù)據(jù)集的總行數(shù):
Python <span class="c"># Finding out how many rows dataset has.</span> <span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span>| 12 | <span class="c"># Finding out how many rows dataset has.</span><span class="nb">len</span><span class="p">(</span><span class="n">df</span><span class="p">)</span> |
上面的代碼返回一個(gè)表示數(shù)據(jù)行數(shù)的整數(shù),在我的數(shù)據(jù)集中,這個(gè)值是 33 。
你可能還想知道數(shù)據(jù)集的一些基本的統(tǒng)計(jì)數(shù)據(jù),在 Pandas 中,這個(gè)操作簡單到哭:
Python <span class="c"># Finding out basic statistical information on your dataset.</span> <span class="n">pd</span><span class="o">.</span><span class="n">options</span><span class="o">.</span><span class="n">display</span><span class="o">.</span><span class="n">float_format</span> <span class="o">=</span> <span class="s">'{:,.3f}'</span><span class="o">.</span><span class="n">format</span> <span class="c"># Limit output to 3 decimal places.</span> <span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span>| 123 | <span class="c"># Finding out basic statistical information on your dataset.</span><span class="n">pd</span><span class="o">.</span><span class="n">options</span><span class="o">.</span><span class="n">display</span><span class="o">.</span><span class="n">float_format</span><span class="o">=</span><span class="s">'{:,.3f}'</span><span class="o">.</span><span class="n">format</span><span class="c"># Limit output to 3 decimal places.</span><span class="n">df</span><span class="o">.</span><span class="n">describe</span><span class="p">()</span> |
這將返回一張表,其中有諸如總數(shù)、均值、標(biāo)準(zhǔn)差之類的統(tǒng)計(jì)數(shù)據(jù):
過濾
在探索數(shù)據(jù)的時(shí)候,你可能經(jīng)常想要抽取數(shù)據(jù)中特定的樣本,比如你有一個(gè)關(guān)于工作滿意度的調(diào)查表,你可能就想要提取特定行業(yè)或者年齡的人的數(shù)據(jù)。
在 Pandas 中有多種方法可以實(shí)現(xiàn)提取我們想要的信息:
有時(shí)你想提取一整列,使用列的標(biāo)簽可以非常簡單地做到:
Python <span class="c"># Getting a column by label</span> <span class="n">df</span><span class="p">[</span><span class="s">'rain_octsep'</span><span class="p">]</span>| 12 | <span class="c"># Getting a column by label</span><span class="n">df</span><span class="p">[</span><span class="s">'rain_octsep'</span><span class="p">]</span> |
注意,當(dāng)我們提取列的時(shí)候,會(huì)得到一個(gè) series ,而不是 dataframe 。記得我們前面提到過,你可以把 dataframe 看作是一個(gè) series 的字典,所以在抽取列的時(shí)候,我們就會(huì)得到一個(gè) series。
還記得我在命名列標(biāo)簽的時(shí)候特意指出的嗎?不用空格、破折號(hào)之類的符號(hào),這樣我們就可以像訪問對(duì)象屬性一樣訪問數(shù)據(jù)集的列——只用一個(gè)點(diǎn)號(hào)。
Python <span class="c"># Getting a column by label using .</span> <span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span>| 12 | <span class="c"># Getting a column by label using .</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> |
這句代碼返回的結(jié)果與前一個(gè)例子完全一樣——是我們選擇的那列數(shù)據(jù)。
如果你讀過這個(gè)系列關(guān)于?Numpy?的推文,你可能還記得一個(gè)叫做?布爾過濾(boolean masking)的技術(shù),通過在一個(gè)數(shù)組上運(yùn)行條件來得到一個(gè)布林?jǐn)?shù)組。在 Pandas 里也可以做到。
Python <span class="c"># Creating a series of booleans based on a conditional</span> <span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> <span class="o"><</span> <span class="mi">1000</span> <span class="c"># Or df['rain_octsep] < 1000</span>| 12 | <span class="c"># Creating a series of booleans based on a conditional</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span><span class="o"><</span><span class="mi">1000</span><span class="c"># Or df['rain_octsep] < 1000</span> |
上面的代碼將會(huì)返回一個(gè)由布爾值構(gòu)成的 dataframe。True?表示在十月-九月降雨量小于 1000 mm,False?表示大于等于 1000 mm。
我們可以用這些條件表達(dá)式來過濾現(xiàn)有的 dataframe。
Python <span class="c"># Using a series of booleans to filter</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">]</span>| 12 | <span class="c"># Using a series of booleans to filter</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span><span class="o"><</span><span class="mi">1000</span><span class="p">]</span> |
這條代碼只返回十月-九月降雨量小于 1000 mm 的記錄:
也可以通過復(fù)合條件表達(dá)式來進(jìn)行過濾:
Python <span class="c"># Filtering by multiple conditionals</span> <span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> <span class="o"><</span> <span class="mi">1000</span><span class="p">)</span> <span class="o">&</span> <span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">outflow_octsep</span> <span class="o"><</span> <span class="mi">4000</span><span class="p">)]</span> <span class="c"># Can't use the keyword 'and'</span>| 12 | <span class="c"># Filtering by multiple conditionals</span><span class="n">df</span><span class="p">[(</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span><span class="o"><</span><span class="mi">1000</span><span class="p">)</span><span class="o">&</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">outflow_octsep</span><span class="o"><</span><span class="mi">4000</span><span class="p">)]</span><span class="c"># Can't use the keyword 'and'</span> |
這條代碼只會(huì)返回?rain_octsep?中小于 1000 的和?outflow_octsep?中小于 4000 的記錄:
注意重要的一點(diǎn):這里不能用?and?關(guān)鍵字,因?yàn)闀?huì)引發(fā)操作順序的問題。必須用?&?和圓括號(hào)。
如果你的數(shù)據(jù)中字符串,好消息,你也可以使用字符串方法來進(jìn)行過濾:
Python <span class="c"># Filtering by string methods</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">water_year</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'199'</span><span class="p">)]</span>| 12 | <span class="c"># Filtering by string methods</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">water_year</span><span class="o">.</span><span class="n">str</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'199'</span><span class="p">)]</span> |
注意,你必須用?.str.[string method]?,而不能直接在字符串上調(diào)用字符方法。上面的代碼返回所有 90 年代的記錄:
索引
之前的部分展示了如何通過列操作來得到數(shù)據(jù),但是 Pandas 的行也有標(biāo)簽。行標(biāo)簽可以是基于數(shù)字的或者是標(biāo)簽,而且獲取行數(shù)據(jù)的方法也根據(jù)標(biāo)簽的類型各有不同。
如果你的行標(biāo)簽是數(shù)字型的,你可以通過?iloc?來引用:
Python <span class="c"># Getting a row via a numerical index</span> <span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">30</span><span class="p">]</span>| 12 | <span class="c"># Getting a row via a numerical index</span><span class="n">df</span><span class="o">.</span><span class="n">iloc</span><span class="p">[</span><span class="mi">30</span><span class="p">]</span> |
iloc?只對(duì)數(shù)字型的標(biāo)簽有用。它會(huì)返回給定行的 series,行中的每一列都是返回 series 的一個(gè)元素。
也許你的數(shù)據(jù)集中有年份或者年齡的列,你可能想通過這些年份或者年齡來引用行,這個(gè)時(shí)候我們就可以設(shè)置一個(gè)(或者多個(gè))新的索引:
Python <span class="c"># Setting a new index from an existing column</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'water_year'</span><span class="p">])</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 123 | <span class="c"># Setting a new index from an existing column</span><span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">set_index</span><span class="p">([</span><span class="s">'water_year'</span><span class="p">])</span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
上面的代碼將?water_year?列設(shè)置為索引。注意,列的名字實(shí)際上是一個(gè)列表,雖然上面的例子中只有一個(gè)元素。如果你想設(shè)置多個(gè)索引,只需要在列表中加入列的名字即可。
上例中我們?cè)O(shè)置的索引列中都是字符型數(shù)據(jù),這意味著我們不能繼續(xù)使用?iloc?來引用,那我們用什么呢?用?loc?。
Python <span class="c"># Getting a row via a label-based index</span> <span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'2000/01'</span><span class="p">]</span>| 12 | <span class="c"># Getting a row via a label-based index</span><span class="n">df</span><span class="o">.</span><span class="n">loc</span><span class="p">[</span><span class="s">'2000/01'</span><span class="p">]</span> |
和?iloc?一樣,loc?會(huì)返回你引用的列,唯一一點(diǎn)不同就是此時(shí)你使用的是基于字符串的引用,而不是基于數(shù)字的。
還有一個(gè)引用列的常用常用方法——?ix?。如果?loc?是基于標(biāo)簽的,而?iloc?是基于數(shù)字的,那?ix是基于什么的?事實(shí)上,ix?是基于標(biāo)簽的查詢方法,但它同時(shí)也支持?jǐn)?shù)字型索引作為備選。
Python <span class="c"># Getting a row via a label-based or numerical index</span> <span class="n">df</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="s">'1999/00'</span><span class="p">]</span> <span class="c"># Label based with numerical index fallback *Not recommended</span>| 12 | <span class="c"># Getting a row via a label-based or numerical index</span><span class="n">df</span><span class="o">.</span><span class="n">ix</span><span class="p">[</span><span class="s">'1999/00'</span><span class="p">]</span><span class="c"># Label based with numerical index fallback *Not recommended</span> |
與?iloc、loc?一樣,它也會(huì)返回你查詢的行。
如果?ix?可以同時(shí)起到?loc?和?iloc?的作用,那為什么還要用后兩個(gè)?一大原因就是?ix?具有輕微的不可預(yù)測性。還記得我說過它所支持的數(shù)字型索引只是備選嗎?這一特性可能會(huì)導(dǎo)致?ix?產(chǎn)生一些奇怪的結(jié)果,比如講一個(gè)數(shù)字解釋為一個(gè)位置。而使用?iloc?和?loc?會(huì)很安全、可預(yù)測并且讓人放心。但是我要指出的是,ix?比?iloc?和?loc?要快一些。
將索引排序通常會(huì)很有用,在 Pandas 中,我們可以對(duì) dataframe 調(diào)用?sort_index?方法進(jìn)行排序。
Python <span class="n">df</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> <span class="c">#inplace=True to apple the sorting in place</span>| 1 | <span class="n">df</span><span class="o">.</span><span class="n">sort_index</span><span class="p">(</span><span class="n">ascending</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span><span class="c">#inplace=True to apple the sorting in place</span> |
我的索引本來就是有序的,為了演示,我將參數(shù)?ascending?設(shè)置為?false,這樣我的數(shù)據(jù)就會(huì)呈降序排列。
當(dāng)你將一列設(shè)置為索引的時(shí)候,它就不再是數(shù)據(jù)的一部分了。如果你想將索引恢復(fù)為數(shù)據(jù),調(diào)用?set_index?相反的方法?reset_index?即可:
Python <span class="c"># Returning an index to data</span> <span class="n">df</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s">'water_year'</span><span class="p">)</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 123 | <span class="c"># Returning an index to data</span><span class="n">df</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">reset_index</span><span class="p">(</span><span class="s">'water_year'</span><span class="p">)</span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
這一語句會(huì)將索引恢復(fù)成數(shù)據(jù)形式:
對(duì)數(shù)據(jù)集應(yīng)用函數(shù)
有時(shí)你想對(duì)數(shù)據(jù)集中的數(shù)據(jù)進(jìn)行改變或者某種操作。比方說,你有一列年份的數(shù)據(jù),你需要新的一列來表示這些年份對(duì)應(yīng)的年代。Pandas 中有兩個(gè)非常有用的函數(shù),apply?和?applymap。
Python <span class="c"># Applying a function to a column</span> <span class="k">def</span> <span class="nf">base_year</span><span class="p">(</span><span class="n">year</span><span class="p">):</span> <span class="n">base_year</span> <span class="o">=</span> <span class="n">year</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span> <span class="n">base_year</span><span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">base_year</span><span class="p">)</span><span class="o">.</span><span class="n">year</span> <span class="k">return</span> <span class="n">base_year</span> <span class="n">df</span><span class="p">[</span><span class="s">'year'</span><span class="p">]</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">water_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">base_year</span><span class="p">)</span> <span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 12345678 | <span class="c"># Applying a function to a column</span><span class="k">def</span><span class="nf">base_year</span><span class="p">(</span><span class="n">year</span><span class="p">):</span><span class="n">base_year</span><span class="o">=</span><span class="n">year</span><span class="p">[:</span><span class="mi">4</span><span class="p">]</span><span class="n">base_year</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">to_datetime</span><span class="p">(</span><span class="n">base_year</span><span class="p">)</span><span class="o">.</span><span class="n">year</span><span class="k">return</span><span class="n">base_year</span><span class="n">df</span><span class="p">[</span><span class="s">'year'</span><span class="p">]</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">water_year</span><span class="o">.</span><span class="n">apply</span><span class="p">(</span><span class="n">base_year</span><span class="p">)</span><span class="n">df</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
上面的代碼創(chuàng)建了一個(gè)叫做?year?的列,它只將?water_year?列中的年提取了出來。這就是?apply的用法,即對(duì)一列數(shù)據(jù)應(yīng)用函數(shù)。如果你想對(duì)整個(gè)數(shù)據(jù)集應(yīng)用函數(shù),就要使用?applymap?。
操作數(shù)據(jù)集的結(jié)構(gòu)
另一常見的做法是重新建立數(shù)據(jù)結(jié)構(gòu),使得數(shù)據(jù)集呈現(xiàn)出一種更方便并且(或者)有用的形式。
掌握這些轉(zhuǎn)換最簡單的方法就是觀察轉(zhuǎn)換的過程。比起這篇文章的其他部分,接下來的操作需要你跟著練習(xí)以便能掌握它們。
首先,是?groupby?:
Python <span class="c">#Manipulating structure (groupby, unstack, pivot)</span> <span class="c"># Grouby</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">year</span> <span class="o">//</span> <span class="mi">10</span> <span class="o">*</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">max</span><span class="p">()</span>| 123 | <span class="c">#Manipulating structure (groupby, unstack, pivot)</span><span class="c"># Grouby</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">(</span><span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">//</span><span class="mi">10</span><span class="o">*</span><span class="mi">10</span><span class="p">)</span><span class="o">.</span><span class="n">max</span><span class="p">()</span> |
groupby?會(huì)按照你選擇的列對(duì)數(shù)據(jù)集進(jìn)行分組。上例是按照年代分組。不過僅僅這樣做并沒有什么用,我們必須對(duì)其調(diào)用函數(shù),比如?max?、?min?、mean?等等。例中,我們可以得到 90 年代的均值。
你也可以按照多列進(jìn)行分組:
Python <span class="c"># Grouping by multiple columns</span> <span class="n">decade_rain</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">year</span> <span class="o">//</span> <span class="mi">10</span> <span class="o">*</span> <span class="mi">10</span><span class="p">,</span> <span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> <span class="o">//</span> <span class="mi">1000</span> <span class="o">*</span> <span class="mi">1000</span><span class="p">])[[</span><span class="s">'outflow_octsep'</span><span class="p">,</span> <span class="s">'outflow_decfeb'</span><span class="p">,</span> <span class="s">'outflow_junaug'</span><span class="p">]]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span> <span class="n">decade_rain</span>| 123 | <span class="c"># Grouping by multiple columns</span><span class="n">decade_rain</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">groupby</span><span class="p">([</span><span class="n">df</span><span class="o">.</span><span class="n">year</span><span class="o">//</span><span class="mi">10</span><span class="o">*</span><span class="mi">10</span><span class="p">,</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span><span class="o">//</span><span class="mi">1000</span><span class="o">*</span><span class="mi">1000</span><span class="p">])[[</span><span class="s">'outflow_octsep'</span><span class="p">,</span><span class="s">'outflow_decfeb'</span><span class="p">,</span><span class="s">'outflow_junaug'</span><span class="p">]]</span><span class="o">.</span><span class="n">mean</span><span class="p">()</span><span class="n">decade_rain</span> |
接下來是?unstack?,最開始可能有一些困惑,它可以將一列數(shù)據(jù)設(shè)置為列標(biāo)簽。最好還是看看實(shí)際的操作:
Python <span class="c"># Unstacking</span> <span class="n">decade_rain</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span>| 12 | <span class="c"># Unstacking</span><span class="n">decade_rain</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> |
這條語句將上例中的 dataframe 轉(zhuǎn)換為下面的形式。它將第 0 列,也就是?year?列設(shè)置為列的標(biāo)簽。
讓我們?cè)俨僮饕淮巍_@次使用第 1 列,也就是?rain_octsep?列:
Python <span class="c"># More unstacking</span> <span class="n">decade_rain</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>| 12 | <span class="c"># More unstacking</span><span class="n">decade_rain</span><span class="o">.</span><span class="n">unstack</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> |
在進(jìn)行下次操作之前,我們先創(chuàng)建一個(gè)用于演示的 dataframe :
Python <span class="c"># Create a new dataframe containing entries which </span> <span class="c"># has rain_octsep values of greater than 1250</span> <span class="n">high_rain</span> <span class="o">=</span> <span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span> <span class="o">></span> <span class="mi">1250</span><span class="p">]</span> <span class="n">high_rain</span>| 1234 | <span class="c"># Create a new dataframe containing entries which </span><span class="c"># has rain_octsep values of greater than 1250</span><span class="n">high_rain</span><span class="o">=</span><span class="n">df</span><span class="p">[</span><span class="n">df</span><span class="o">.</span><span class="n">rain_octsep</span><span class="o">></span><span class="mi">1250</span><span class="p">]</span><span class="n">high_rain</span> |
上面的代碼將會(huì)產(chǎn)生如下的 dataframe ,我們將會(huì)在上面演示軸向旋轉(zhuǎn)(pivoting)。
軸旋轉(zhuǎn)其實(shí)就是我們之前已經(jīng)看到的那些操作的一個(gè)集合。首先,它會(huì)設(shè)置一個(gè)新的索引(set_index()),然后對(duì)索引排序(sort_index()),最后調(diào)用?unstack?。以上的步驟合在一起就是?pivot?。接下來看看你能不能搞清楚下面的代碼在干什么:
Python <span class="c">#Pivoting</span> <span class="c">#does set_index, sort_index and unstack in a row</span> <span class="n">high_rain</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s">'year'</span><span class="p">,</span> <span class="s">'rain_octsep'</span><span class="p">)[[</span><span class="s">'outflow_octsep'</span><span class="p">,</span> <span class="s">'outflow_decfeb'</span><span class="p">,</span> <span class="s">'outflow_junaug'</span><span class="p">]]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">''</span><span class="p">)</span>| 123 | <span class="c">#Pivoting</span><span class="c">#does set_index, sort_index and unstack in a row</span><span class="n">high_rain</span><span class="o">.</span><span class="n">pivot</span><span class="p">(</span><span class="s">'year'</span><span class="p">,</span><span class="s">'rain_octsep'</span><span class="p">)[[</span><span class="s">'outflow_octsep'</span><span class="p">,</span><span class="s">'outflow_decfeb'</span><span class="p">,</span><span class="s">'outflow_junaug'</span><span class="p">]]</span><span class="o">.</span><span class="n">fillna</span><span class="p">(</span><span class="s">''</span><span class="p">)</span> |
注意,最后有一個(gè)?.fillna('')?。pivot?產(chǎn)生了很多空的記錄,也就是值為?NaN?的記錄。我個(gè)人覺得數(shù)據(jù)集里面有很多?NaN?會(huì)很煩,所以使用了?fillna('')?。你也可以用別的別的東西,比方說 0 。我們也可以使用?dropna(how = 'any')?來刪除有?NaN?的行,不過這樣就把所有的數(shù)據(jù)都刪掉了,所以不這樣做。
上面的 dataframe 展示了所有降雨超過 1250 的?outflow?。誠然,這并不是講解?pivot?實(shí)際應(yīng)用最好的例子,但希望你能明白它的意思。看看你能在你的數(shù)據(jù)集上得到什么結(jié)果。
合并數(shù)據(jù)集
有時(shí)你有兩個(gè)相關(guān)聯(lián)的數(shù)據(jù)集,你想將它們放在一起比較或者合并它們。好的,沒問題,在 Pandas 里很簡單:
Python <span class="c"># Merging two datasets together</span> <span class="n">rain_jpn</span> <span class="o">=</span> <span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'jpn_rain.csv'</span><span class="p">)</span> <span class="n">rain_jpn</span><span class="o">.</span><span class="n">columns</span> <span class="o">=</span> <span class="p">[</span><span class="s">'year'</span><span class="p">,</span> <span class="s">'jpn_rainfall'</span><span class="p">]</span> <span class="n">uk_jpn_rain</span> <span class="o">=</span> <span class="n">df</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">rain_jpn</span><span class="p">,</span> <span class="n">on</span><span class="o">=</span><span class="s">'year'</span><span class="p">)</span> <span class="n">uk_jpn_rain</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>| 123456 | <span class="c"># Merging two datasets together</span><span class="n">rain_jpn</span><span class="o">=</span><span class="n">pd</span><span class="o">.</span><span class="n">read_csv</span><span class="p">(</span><span class="s">'jpn_rain.csv'</span><span class="p">)</span><span class="n">rain_jpn</span><span class="o">.</span><span class="n">columns</span><span class="o">=</span><span class="p">[</span><span class="s">'year'</span><span class="p">,</span><span class="s">'jpn_rainfall'</span><span class="p">]</span><span class="n">uk_jpn_rain</span><span class="o">=</span><span class="n">df</span><span class="o">.</span><span class="n">merge</span><span class="p">(</span><span class="n">rain_jpn</span><span class="p">,</span><span class="n">on</span><span class="o">=</span><span class="s">'year'</span><span class="p">)</span><span class="n">uk_jpn_rain</span><span class="o">.</span><span class="n">head</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span> |
首先你需要通過?on?關(guān)鍵字來指定需要合并的列。通常你可以省略這個(gè)參數(shù),Pandas 將會(huì)自動(dòng)選擇要合并的列。
如下圖所示,兩個(gè)數(shù)據(jù)集在年份這一類上合并了。jpn_rain?數(shù)據(jù)集只有年份和降雨量兩列,通過年份列合并之后,jpn_rain?中只有降雨量那一列合并到了?UK_rain?數(shù)據(jù)集中。
使用 Pandas 快速作圖
Matplotlib?很棒,但是想要繪制出還算不錯(cuò)的圖表卻要寫不少代碼,而有時(shí)你只是想粗略的做個(gè)圖來探索下數(shù)據(jù),搞清楚數(shù)據(jù)的含義。Pandas 通過?plot?來解決這個(gè)問題:
Python <span class="c"># Using pandas to quickly plot graphs</span> <span class="n">uk_jpn_rain</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'year'</span><span class="p">,</span> <span class="n">y</span><span class="o">=</span><span class="p">[</span><span class="s">'rain_octsep'</span><span class="p">,</span> <span class="s">'jpn_rainfall'</span><span class="p">])</span>| 12 | <span class="c"># Using pandas to quickly plot graphs</span><span class="n">uk_jpn_rain</span><span class="o">.</span><span class="n">plot</span><span class="p">(</span><span class="n">x</span><span class="o">=</span><span class="s">'year'</span><span class="p">,</span><span class="n">y</span><span class="o">=</span><span class="p">[</span><span class="s">'rain_octsep'</span><span class="p">,</span><span class="s">'jpn_rainfall'</span><span class="p">])</span> |
這會(huì)調(diào)用?Matplotlib?快速輕松地繪出了你的數(shù)據(jù)圖。通過這個(gè)圖你就可以在視覺上分析數(shù)據(jù),而且它能在探索數(shù)據(jù)的時(shí)候給你一些方向。比如,看到我的數(shù)據(jù)圖,你會(huì)發(fā)現(xiàn)在 1995 年的英國好像有一場干旱。
你會(huì)發(fā)現(xiàn)英國的降雨明顯少于日本,但人們卻說英國總是下雨。
保存你的數(shù)據(jù)集
在清洗、重塑、探索完數(shù)據(jù)之后,你最后的數(shù)據(jù)集可能會(huì)發(fā)生很大改變,并且比最開始的時(shí)候更有用。你應(yīng)該保存原始的數(shù)據(jù)集,但是你同樣應(yīng)該保存處理之后的數(shù)據(jù)。
Python <span class="c"># Saving your data to a csv</span> <span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'uk_rain.csv'</span><span class="p">)</span>| 12 | <span class="c"># Saving your data to a csv</span><span class="n">df</span><span class="o">.</span><span class="n">to_csv</span><span class="p">(</span><span class="s">'uk_rain.csv'</span><span class="p">)</span> |
上面的代碼將會(huì)保存你的數(shù)據(jù)到?csv?文件以便下次使用。
我們對(duì) Pandas 的介紹就到此為止了。就像我之前所說的, Pandas 非常強(qiáng)大,我們只是領(lǐng)略到了一點(diǎn)皮毛而已,不過你現(xiàn)在知道的應(yīng)該足夠你開始清洗和探索數(shù)據(jù)了。
像以前一樣,我建議你用自己感興趣的數(shù)據(jù)集做一下練習(xí),坐下來,一杯啤酒配數(shù)據(jù)。這確實(shí)是你唯一熟悉?Pandas?以及這個(gè)系列其他庫的方式。而且你也許會(huì)發(fā)現(xiàn)一些有趣的東西。
- zeropython 微信公眾號(hào) 5868037 QQ號(hào) 5868037@qq.com QQ郵箱
總結(jié)
以上是生活随笔為你收集整理的十分钟快速入门 Pandas的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 为什么女性朋友容易患上拇外翻?
- 下一篇: 【转】拇指拇外翻的纠正训练