厉害了,在Pandas中用SQL来查询数据,效率超高
今天我們繼續(xù)來(lái)講一下Pandas和SQL之間的聯(lián)用,我們其實(shí)也可以在Pandas當(dāng)中使用SQL語(yǔ)句來(lái)篩選數(shù)據(jù),通過(guò)Pandasql模塊來(lái)實(shí)現(xiàn)該想法,首先我們來(lái)安裝一下該模塊
pip?install?pandasql要是你目前正在使用jupyter notebook,也可以這么來(lái)下載
!pip?install?pandasql導(dǎo)入數(shù)據(jù)
我們首先導(dǎo)入數(shù)據(jù)
import?pandas?as?pd from?pandasql?import?sqldf df?=?pd.read_csv("Dummy_Sales_Data_v1.csv",?sep=",") df.head()output
我們先對(duì)導(dǎo)入的數(shù)據(jù)集做一個(gè)初步的探索性分析,
df.info()output
<class?'pandas.core.frame.DataFrame'> RangeIndex:?9999?entries,?0?to?9998 Data?columns?(total?12?columns):#???Column???????????????Non-Null?Count??Dtype?? ---??------???????????????--------------??-----??0???OrderID??????????????9999?non-null???int64??1???Quantity?????????????9999?non-null???int64??2???UnitPrice(USD)???????9999?non-null???int64??3???Status???????????????9999?non-null???object?4???OrderDate????????????9999?non-null???object?5???Product_Category?????9963?non-null???object?6???Sales_Manager????????9999?non-null???object?7???Shipping_Cost(USD)???9999?non-null???int64??8???Delivery_Time(Days)??9948?non-null???float649???Shipping_Address?????9999?non-null???object?10??Product_Code?????????9999?non-null???object?11??OrderCode????????????9999?non-null???int64?? dtypes:?float64(1),?int64(5),?object(6) memory?usage:?937.5+?KB再開(kāi)始進(jìn)一步的數(shù)據(jù)篩選之前,我們?cè)賹?duì)數(shù)據(jù)集的列名做一個(gè)轉(zhuǎn)換,代碼如下
df.rename(columns={"Shipping_Cost(USD)":"ShippingCost_USD","UnitPrice(USD)":"UnitPrice_USD","Delivery_Time(Days)":"Delivery_Time_Days"},inplace=True) df.info()output
<class?'pandas.core.frame.DataFrame'> RangeIndex:?9999?entries,?0?to?9998 Data?columns?(total?12?columns):#???Column??????????????Non-Null?Count??Dtype?? ---??------??????????????--------------??-----??0???OrderID?????????????9999?non-null???int64??1???Quantity????????????9999?non-null???int64??2???UnitPrice_USD???????9999?non-null???int64??3???Status??????????????9999?non-null???object?4???OrderDate???????????9999?non-null???object?5???Product_Category????9963?non-null???object?6???Sales_Manager???????9999?non-null???object?7???ShippingCost_USD????9999?non-null???int64??8???Delivery_Time_Days??9948?non-null???float649???Shipping_Address????9999?non-null???object?10??Product_Code????????9999?non-null???object?11??OrderCode???????????9999?non-null???int64?? dtypes:?float64(1),?int64(5),?object(6) memory?usage:?937.5+?KB用SQL篩選出若干列來(lái)
我們先嘗試篩選出OrderID、Quantity、Sales_Manager、Status等若干列數(shù)據(jù),用SQL語(yǔ)句應(yīng)該是這么來(lái)寫的
SELECT?OrderID,?Quantity,?Sales_Manager,?\ Status,?Shipping_Address,?ShippingCost_USD?\ FROM?df與Pandas模塊聯(lián)用的時(shí)候就這么來(lái)寫
query?=?"SELECT?OrderID,?Quantity,?Sales_Manager,\ Status,?Shipping_Address,?ShippingCost_USD?\ FROM?df"df_orders?=?sqldf(query) df_orders.head()output
SQL中帶WHERE條件篩選
我們?cè)赟QL語(yǔ)句當(dāng)中添加指定的條件進(jìn)而來(lái)篩選數(shù)據(jù),代碼如下
query?=?"SELECT?*?\FROM?df_orders?\WHERE?Shipping_Address?=?'Kenya'"df_kenya?=?sqldf(query) df_kenya.head()output
而要是條件不止一個(gè),則用AND來(lái)連接各個(gè)條件,代碼如下
query?=?"SELECT?*?\FROM?df_orders?\WHERE?Shipping_Address?=?'Kenya'?\AND?Quantity?<?40?\AND?Status?IN?('Shipped',?'Delivered')" df_kenya?=?sqldf(query) df_kenya.head()output
分組
同理我們可以調(diào)用SQL當(dāng)中的GROUP BY來(lái)對(duì)篩選出來(lái)的數(shù)據(jù)進(jìn)行分組,代碼如下
query?=?"SELECT?Shipping_Address,?\COUNT(OrderID)?AS?Orders?\FROM?df_orders?\GROUP?BY?Shipping_Address"df_group?=?sqldf(query) df_group.head(10)output
排序
而排序在SQL當(dāng)中則是用ORDER BY,代碼如下
query?=?"SELECT?Shipping_Address,?\COUNT(OrderID)?AS?Orders?\FROM?df_orders?\GROUP?BY?Shipping_Address?\ORDER?BY?Orders"df_group?=?sqldf(query) df_group.head(10)output
數(shù)據(jù)合并
我們先創(chuàng)建一個(gè)數(shù)據(jù)集,用于后面兩個(gè)數(shù)據(jù)集之間的合并,代碼如下
query?=?"SELECT?OrderID,\Quantity,?\Product_Code,?\Product_Category,?\UnitPrice_USD?\FROM?df" df_products?=?sqldf(query) df_products.head()output
我們這里采用的兩個(gè)數(shù)據(jù)集之間的交集,因此是INNER JOIN,代碼如下
query?=?"SELECT?T1.OrderID,?\T1.Shipping_Address,?\T2.Product_Category?\FROM?df_orders?T1\INNER?JOIN?df_products?T2\ON?T1.OrderID?=?T2.OrderID"df_combined?=?sqldf(query) df_combined.head()output
與LIMIT之間的聯(lián)用
在SQL當(dāng)中的LIMIT是用于限制查詢結(jié)果返回的數(shù)量的,我們想看查詢結(jié)果的前10個(gè),代碼如下
query?=?"SELECT?OrderID,?Quantity,?Sales_Manager,?\? Status,?Shipping_Address,?\ ShippingCost_USD?FROM?df?LIMIT?10"df_orders_limit?=?sqldf(query) df_orders_limitoutput
END
推薦閱讀牛逼!Python常用數(shù)據(jù)類型的基本操作(長(zhǎng)文系列第①篇) 牛逼!Python的判斷、循環(huán)和各種表達(dá)式(長(zhǎng)文系列第②篇)牛逼!Python函數(shù)和文件操作(長(zhǎng)文系列第③篇)牛逼!Python錯(cuò)誤、異常和模塊(長(zhǎng)文系列第④篇)總結(jié)
以上是生活随笔為你收集整理的厉害了,在Pandas中用SQL来查询数据,效率超高的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 一个悄然成为世界最流行的操作系统诞生!
- 下一篇: PandasSQL语法归纳总结,真的太全