熊猫直播 使用什么sdk_没什么可花的-但是16项基本操作才能让您开始使用熊猫
熊貓直播 使用什么sdk
Python has become the go-to programming language for many data scientists and machine learning researchers. One essential data processing tool for them to make this choice is the pandas library. For sure, the pandas library is so versatile that it can be used for almost all initial data manipulations to get the data ready for conducting statistical analyses or building machine learning models.
對(duì)于許多數(shù)據(jù)科學(xué)家和機(jī)器學(xué)習(xí)研究人員來說,Python已經(jīng)成為編程語言。 供他們選擇的一種重要數(shù)據(jù)處理工具是pandas庫。 可以肯定的是,pandas庫的用途非常廣泛,幾乎可以用于所有初始數(shù)據(jù)操作,從而為進(jìn)行統(tǒng)計(jì)分析或構(gòu)建機(jī)器學(xué)習(xí)模型做好準(zhǔn)備。
However, for the same versatility, it can be overwhelming to start to use it comfortably. If you’re struggling on how to get it started, this article is right for you. Instead of covering too much to lose the points, the goal of the article is to provide an overview of key operations that you want to use in your daily data processing tasks. Each key operation is accompanied with some highlights of essential parameters to consider.
但是,對(duì)于相同的多功能性,開始舒適地使用它可能會(huì)讓人不知所措。 如果您正在努力入門,那么這篇文章很適合您。 本文的目的不是概述過多的問題,而是概述您要在日常數(shù)據(jù)處理任務(wù)中使用的關(guān)鍵操作。 每個(gè)按鍵操作都帶有一些要考慮的基本參數(shù)。
Certainly, the first step is to get the pandas installed, which you can use pip or conda by following the instruction here. In terms of coding environment, I recommend Visual Studio Code, JupyterLab, or Google Colab, all of which requires little efforts to get it set up. When the installation is done, in your preferred coding environment, you should be able to import pandas to your project.
當(dāng)然,第一步是安裝熊貓,您可以按照此處的說明使用pip或conda 。 在編碼環(huán)境方面,我建議使用Visual Studio Code,JupyterLab或Google Colab,所有這些都無需花費(fèi)很多精力即可進(jìn)行設(shè)置。 安裝完成后,在您喜歡的編碼環(huán)境中,您應(yīng)該能夠?qū)⑿茇垖?dǎo)入到您的項(xiàng)目中。
import pandas as pdIf you run the above code without encountering any errors, you’re good to go.
如果您在運(yùn)行上述代碼時(shí)沒有遇到任何錯(cuò)誤,那就很好了。
1.讀取外部數(shù)據(jù) (1. Read External Data)
In most cases, we read data from external sources. If our data are in the spreadsheet-like format, the following functions should serve the purposes.
在大多數(shù)情況下,我們從外部來源讀取數(shù)據(jù)。 如果我們的數(shù)據(jù)采用電子表格格式,則應(yīng)使用以下功能。
# Read a comma-separated filedf = pd.read_csv("the_data.csv")# Read an Excel spreadsheetdf = pd.read_excel("the_data.xlsx")The header should be handled correctly. By default, the reading will assume the first line of data to be the column names. If they are no headers, you have to specify it (e.g., header=None).
標(biāo)頭應(yīng)正確處理。 默認(rèn)情況下,讀數(shù)將假定第一行數(shù)據(jù)為列名。 如果沒有標(biāo)題,則必須指定它(例如, header=None )。
If you’re reading a tab-delimited file, you can use read_csv by specifying the tab as the delimiter (e.g., sep=“\t”).
如果要讀取制表符分隔的文件,則可以通過將制表符指定為分隔符來使用read_csv (例如, sep=“\t” )。
When you read a large file, it’s a good idea by reading a small portion of the data. In this case, you can set the number of rows to be read (e.g., nrows=1000).
讀取大文件時(shí),最好讀取一小部分?jǐn)?shù)據(jù)。 在這種情況下,您可以設(shè)置要讀取的行數(shù)(例如nrows=1000 )。
If your data involve dates, you can consider setting arguments to make the dates right, such as parse_dates, and infer_datetime_format.
如果數(shù)據(jù)涉及日期,則可以考慮設(shè)置參數(shù)以使日期正確,例如parse_dates和infer_datetime_format 。
2.創(chuàng)建系列 (2. Create Series)
During the process of cleaning up your data, you may need to create Series yourself. In most cases, you’ll simply pass an iterable to create a Series object.
在清理數(shù)據(jù)的過程中,您可能需要自己創(chuàng)建Series 。 在大多數(shù)情況下,您只需傳遞一個(gè)iterable即可創(chuàng)建Series對(duì)象。
# Create a Series from an iterableintegers_s = pd.Series(range(10))# Create a Series from a dictionary objectsquares = {x: x*x for x in range(1, 5)}
squares_s = pd.Series(squares)
You can assign a name to the Series object by setting the name argument. This name will become the name if it becomes part of a DataFrame object.
您可以通過設(shè)置name參數(shù)為Series對(duì)象分配name 。 如果該名稱成為DataFrame對(duì)象的一部分,它將成為名稱。
You can also assign index to the Series (e.g., setting the index argument) if you find it more useful than the default 0-based index. Note that the index’s length should match your data’s length.
如果發(fā)現(xiàn)index比默認(rèn)的從0開始的索引更有用,則還可以將索引分配給Series (例如,設(shè)置index參數(shù))。 請(qǐng)注意,索引的長度應(yīng)與數(shù)據(jù)的長度匹配。
If you create a Series object from a dict, the keys will become the index.
如果從dict創(chuàng)建Series對(duì)象,則鍵將成為索引。
3.構(gòu)造DataFrame (3. Construct DataFrame)
Oftentimes, you need to create DataFrame objects using Python built-in objects, such as lists and dictionaries. The following code snippet highlights two common use scenarios.
通常,您需要使用Python內(nèi)置對(duì)象(例如列表和字典)創(chuàng)建DataFrame對(duì)象。 下面的代碼片段重點(diǎn)介紹了兩種常見的使用場(chǎng)景。
# Create a DataFrame from a dictionary of lists as valuesdata_dict = {'a': [1, 2, 3], 'b': [4, 5, 6], 'c': [7, 8, 9]}data_df0 = pd.DataFrame(data_dict)# Create a DataFrame from a listdata_list = [[1, 4, 7], [2, 5, 8], [3, 6, 9]]
data_df1 = pd.DataFrame(data_list, columns=tuple('abc'))
The first one uses a dict object. Its keys will become column names while its values will become values for the corresponding columns.
第一個(gè)使用dict對(duì)象。 其鍵將成為列名,而其值將成為對(duì)應(yīng)列的值。
The second one uses a list object. Unlike the previous method, the constructed DataFrame will use the data row-wise, which means that each inner list will become a row of the created DataFrame object.
第二個(gè)使用列表對(duì)象。 與以前的方法不同,構(gòu)造的DataFrame將按行使用數(shù)據(jù),這意味著每個(gè)內(nèi)部列表將成為創(chuàng)建的DataFrame對(duì)象的一行。
4. DataFrame概述 (4. Overview of DataFrame)
When we have the DataFrame to work with, we may want to take a look at the dataset at the 30,000 feet level. There are several common ways that you can use, as shown below.
當(dāng)我們使用DataFrame時(shí),我們可能要看一下30,000英尺高度的數(shù)據(jù)集。 可以使用幾種常用方法,如下所示。
# Find out how many rows and columns the DataFrame hasdf.shape# Take a quick peak at the beginning and the end of the datadf.head()df.tail()# Get a random sampledf.sample(5)# Get the information of the datasetdf.info()# Get the descriptive stats of the numeric valuesdf.describe()
- It’s important to check both the head and the tail especially if you work with a large amount of data, because you want to make sure that all of your data have been read completely. 檢查頭部和尾部非常重要,尤其是在處理大量數(shù)據(jù)時(shí),因?yàn)橐_保已完全讀取所有數(shù)據(jù)。
The info() function will give you an overview of the columns in terms of data types and item counts.
info()函數(shù)將為您提供有關(guān)數(shù)據(jù)類型和項(xiàng)目計(jì)數(shù)的列概述。
It’s also interesting to get a random sample of your data to check your data’s integrity (i.e., sample() function).
獲取數(shù)據(jù)的隨機(jī)樣本以檢查數(shù)據(jù)的完整性(例如sample()函數(shù))也很有趣。
5.重命名列 (5. Rename Columns)
You notice that some columns of your data don’t make too much sense or the name is too long to work with and you want to rename these columns.
您會(huì)注意到數(shù)據(jù)的某些列沒有太大意義,或者名稱太長而無法使用,并且您想重命名這些列。
# Rename columns using mappingdf.rename({'old_col0': 'col0', 'old_col1': 'col1'}, axis=1)# Rename columns by specifying columns directlydf.rename(columns={'old_col0': 'col0', 'old_col1': 'col1'})You need to specify axis=1 to rename columns if you simply provide a mapping object (e.g., dict).
如果僅提供映射對(duì)象(例如dict ),則需要指定axis=1來重命名列。
Alternatively, you can specify the mapping object to the columns argument explicitly.
另外,您可以顯式地指定到columns參數(shù)的映射對(duì)象。
The rename function will create a new DataFrame by default. If you want to rename the DataFrame inplace, you need to specify inplace=True.
rename功能將默認(rèn)創(chuàng)建一個(gè)新的DataFrame。 如果要重命名DataFrame inplace ,則需要指定inplace=True 。
6.排序數(shù)據(jù) (6. Sort Data)
To make your data more structured, you need to sort the DataFrame object.
為了使數(shù)據(jù)更加結(jié)構(gòu)化,您需要對(duì)DataFrame對(duì)象進(jìn)行排序。
# Sort datadf.sort_values(by=['col0', 'col1'])By default, the sort_values function will sort your rows (axis=0). In most cases, we use columns as the sorting keys.
默認(rèn)情況下, sort_values函數(shù)將對(duì)行進(jìn)行排序( axis=0 )。 在大多數(shù)情況下,我們使用列作為排序鍵。
The sort_values function will create a sorted DataFrame object by default. To change it, use inplace=True.
默認(rèn)情況下, sort_values函數(shù)將創(chuàng)建一個(gè)排序的DataFrame對(duì)象。 要更改它,請(qǐng)使用inplace=True 。
By default, the sorting is based on ascending for all sorting keys. If you want to use descending, specify ascending=False. If you want to have mixed orders (e.g., some keys are ascending and some descending), you can create a list of boolean values to match the number of keys, like by=[‘col0’, ‘col1’, ‘col2’], ascending=[True, False, True].
默認(rèn)情況下,排序基于所有排序鍵的升序。 如果要使用降序,請(qǐng)指定ascending=False 。 如果要混合使用順序(例如,某些鍵是升序,某些鍵是降序),則可以創(chuàng)建一個(gè)布爾值列表以匹配鍵的數(shù)量,例如by=['col0', 'col1', 'col2'], ascending=[True, False, True] 。
The original index will go with their old data rows. In many cases, you need to re-index. Instead of calling reset_index function directly, you can specify the ignore_index to be True, which will reset the index for you after the sorting is completed.
原始索引將與它們的舊數(shù)據(jù)行一起顯示。 在許多情況下,您需要重新編制索引。 您可以將ignore_index指定為True ,而不是直接調(diào)用rese t _index函數(shù),它將在排序完成后為您重置索引。
7.處理重復(fù) (7. Deal With Duplicates)
It’s a common scenario in real-life datasets that they contain duplicate records, either by human mistake or database glitch. We want to remove these duplicates because they can cause unexpected problems later on.
在現(xiàn)實(shí)的數(shù)據(jù)集中,這種情況很常見,它們包含重復(fù)記錄,可能是由于人為錯(cuò)誤或數(shù)據(jù)庫故障引起的。 我們要?jiǎng)h除這些重復(fù)項(xiàng),因?yàn)樗鼈兛赡茉谝院髮?dǎo)致意外問題。
# To examine whether there are duplicates using all columnsdf.duplicated().any()# To examine whether there are duplicates using particular columnsdf.duplicated(['col0', 'col1']).any()The above functions will return you a boolean value telling you whether any duplicate records exist in your dataset. To find out the exact number of duplicate records, you can use the sum() function by taking advantage of the returned Series object of boolean values from duplicated() function (Python treats True as a value of 1), as show below. One additional thing to note is that when the argument keep is set to be False, it will mark any duplicates as True. Suppose that there are three duplicate records, when keep=False, both records will be marked as True (i.e., being duplicated). When keep= “first” or keep=“l(fā)ast”, only the first or the last record marked as True.
上面的函數(shù)將返回一個(gè)布爾值,告訴您數(shù)據(jù)集中是否存在重復(fù)的記錄。 要找出重復(fù)記錄的確切數(shù)目,可以利用sum()函數(shù),方法是利用來自duplicated()函數(shù)的布爾值返回的Series對(duì)象(Python將True的值視為1),如下所示。 需要注意的另一件事是,當(dāng)參數(shù)keep設(shè)置為False ,它將所有重復(fù)項(xiàng)標(biāo)記為True 。 假設(shè)有三個(gè)重復(fù)記錄,當(dāng)keep=False ,兩個(gè)記錄都將被標(biāo)記為True (即,被重復(fù))。 當(dāng)keep= “first”或keep=“l(fā)ast” ,僅第一個(gè)或最后一個(gè)記錄標(biāo)記為True 。
# To find out the number of duplicatesdf.duplicated().sum()df.duplicated(keep=False).sum()
To actually view the duplicated records, you need to select data from the original dataset using the generated duplicated Series object, as shown below.
要實(shí)際查看重復(fù)的記錄,您需要使用生成的重復(fù)的Series對(duì)象從原始數(shù)據(jù)集中選擇數(shù)據(jù),如下所示。
# Get the duplicate recordsduplicated_indices = df.duplicated(['col0', 'col1'], keep=False)duplicates = df.loc[duplicated_indices, :].sort_values(by=['col0', 'col1'], ignore_index=True)We get all duplicate records by setting the keep argument to be False.
通過將keep參數(shù)設(shè)置為False可以獲得所有重復(fù)記錄。
To better view the duplicate records, you may want to sort the generated DataFrame using the same set of keys.
為了更好地查看重復(fù)的記錄,您可能希望使用同一組鍵對(duì)生成的DataFrame進(jìn)行排序。
Once you have a good idea about the duplicate records with your dataset, you can drop them like below.
一旦對(duì)數(shù)據(jù)集中的重復(fù)記錄有了一個(gè)好主意,就可以像下面這樣刪除它們。
# Drop the duplicate recordsdf.drop_duplicates(['col0', 'col1'], keep="first", inplace=True, ignore_index=True)- By default, the kept record will be the first of the duplicates. 默認(rèn)情況下,保留的記錄將是重復(fù)記錄中的第一個(gè)。
You’ll need to specify inplace=True if you want to update the DataFrame inplace. BTW: many other functions have this option, the discussion of which will be skipped most of the time if not all.
如果要就地更新DataFrame,則需要指定inplace=True 。 順便說一句:許多其他功能都有此選項(xiàng),如果不是全部,大部分時(shí)間都將跳過其討論。
As with the sort_values() function, you may want to reset index afterwards by specifying the ignore_index argument (a new feature in Pandas 1.0).
與sort_values()函數(shù)一樣,您之后可能希望通過指定ignore_index參數(shù)(Pandas 1.0中的一項(xiàng)新功能)來重置索引。
8.處理丟失的數(shù)據(jù) (8. Handle Missing Data)
Missing data are common in real-life datasets, which can be due to measures not available or simply human data entry mistakes resulting in meaningless data that are deemed to be missing. To have an overall idea of how many missing values your dataset has, you’ve seen that the info() function tells us how many non-null values each column has. We can get information about data missingness in a more structured fashion, as shown below.
丟失數(shù)據(jù)在現(xiàn)實(shí)生活數(shù)據(jù)集中是很常見的,這可能是由于無法采取的措施或者僅僅是由于人類數(shù)據(jù)輸入錯(cuò)誤導(dǎo)致了被認(rèn)為丟失的毫無意義的數(shù)據(jù)所致。 為了全面了解數(shù)據(jù)集有多少個(gè)缺失值,您已經(jīng)了解到info()函數(shù)告訴我們每列有多少個(gè)非空值。 我們可以以更結(jié)構(gòu)化的方式獲取有關(guān)數(shù)據(jù)丟失的信息,如下所示。
# Find out how many missing values for each columndf.isnull().sum()# Find out how many missing values for the entire datasetdf.isnull().sum().sum()The isnull() function creates a DataFrame of the same shape as your original DataFrame with each value indicating the original value to be missing (True) or not (False). As a related note, you can use the notnull() function if you want to generate a DataFrame indicating the non-null values.
notull isnull()函數(shù)創(chuàng)建的形狀與原始DataFrame形狀相同的DataFrame ,每個(gè)值指示缺失的原始值( True )或不存在( False )。 作為相關(guān)說明,如果要生成指示非空值的notnull() ,則可以使用notnull()函數(shù)。
As mentioned previously, a True value in Python is arithmetically equal to 1. The sum() function will compute the sum of these boolean values for each column (by default, it’s calculating the sum column-wise), which reflect the number of missing values.
如前所述,Python中的True值在算術(shù)上等于sum()函數(shù)將為每一列計(jì)算這些布爾值的總和(默認(rèn)情況下,它按列計(jì)算總和),這反映了缺失的數(shù)量價(jià)值觀。
By having some idea about the missingness of your dataset, we usually want to deal with them. Possible solutions include drop the records with any missing values or fill them with applicable values.
通過對(duì)數(shù)據(jù)集的缺失有所了解,我們通常希望對(duì)其進(jìn)行處理。 可能的解決方案包括刪除任何缺少的值的記錄或使用適用的值填充它們。
# Drop the rows with any missing valuesdf.dropna(axis=0, how="any")# Drop the rows without 2 or more non-null valuesdf.dropna(thresh=2)# Drop the columns with all values missingdf.dropna(axis=1, how="all")By default, the dropna() function works column-wise (i.e., axis=0). If you specify how=“any”, rows with any missing values will be dropped.
默認(rèn)情況下, dropna()函數(shù)按列工作(即axis=0 )。 如果指定how=“any” ,則將刪除所有缺少值的行。
When you set the thresh argument, it requires that the row (or the column when axis=1) have the number of non-missing values.
設(shè)置thresh參數(shù)時(shí),它要求該行(或axis=1時(shí)的列)具有非缺失值的數(shù)量。
As many other functions, when you set axis=1, you’re performing operations column-wise. In this case, the above function call will remove the columns for those who have all of their values missing.
與其他許多功能一樣,當(dāng)您設(shè)置axis=1 ,您將按列執(zhí)行操作。 在這種情況下,上面的函數(shù)調(diào)用將刪除那些缺少所有值的列。
Besides the operation of dropping data rows or columns with missing values, it’s also possible to fill the missing values with some values, as shown below.
除了刪除具有缺失值的數(shù)據(jù)行或數(shù)據(jù)列的操作外,還可以用一些值填充缺失值,如下所示。
# Fill missing values with 0 or any other value is applicabledf.fillna(value=0)# Fill the missing values with customized mapping for columnsdf.fillna(value={"col0": 0, "col1": 999})# Fill missing values with the next valid observationdf.fillna(method="bfill")# Fill missing values with the last valid observationdf.fillna(method="ffill")To fill the missing values with specified values, you set the value argument either with a fixed value for all or you can set a dict object which will instruct the filling based on each column.
要用指定的值填充缺失值,可以將value參數(shù)設(shè)置為全部固定值,也可以設(shè)置dict對(duì)象,該對(duì)象將根據(jù)每一列指示填充。
- Alternatively, you can fill the missing values by using existing observations surrounding the missing holes, either back fill or forward fill. 或者,您可以通過使用圍繞缺失Kong的現(xiàn)有觀測(cè)值來填充缺失值,即回填或正向填充。
9.分組描述統(tǒng)計(jì) (9. Descriptive Statistics by Group)
When you conduct machine learning research or data analysis, it’s often necessary to perform particular operations with some grouping variables. In this case, we need to use the groupby() function. The following code snippet shows you some common scenarios that apply.
當(dāng)您進(jìn)行機(jī)器學(xué)習(xí)研究或數(shù)據(jù)分析時(shí),通常需要使用一些分組變量來執(zhí)行特定的操作。 在這種情況下,我們需要使用groupby()函數(shù)。 以下代碼段顯示了一些適用的常見方案。
# Get the count by group, a 2 by 2 exampledf.groupby(['col0', 'col1']).size()# Get the mean of all applicable columns by groupdf.groupby(['col0']).mean()# Get the mean for a particular columndf.groupby(['col0'])['col1'].mean()# Request multiple descriptive statsdf.groupby(['col0', 'col1']).agg({'col2': ['min', 'max', 'mean'],
'col3': ['nunique', 'mean']
})
By default, the groupby() function will return a GroupBy object. If you want to convert it to a DataFrame, you can call the reset_index() on the object. Alternatively, you can specify the as_index=False in the groupby() function call to create a DataFrame directly.
默認(rèn)情況下, groupby()函數(shù)將返回一個(gè)GroupBy對(duì)象。 如果要將其轉(zhuǎn)換為DataFrame ,則可以在對(duì)象上調(diào)用reset_index() 。 另外,您可以在groupby()函數(shù)調(diào)用中指定as_index=False以直接創(chuàng)建DataFrame 。
The size() is useful if you want to know the frequency of each group.
如果您想知道每個(gè)組的頻率,則size()很有用。
The agg() function allows you to generate multiple descriptive statistics. You can simply pass a set of function names, which will apply to all columns. Alternatively, you can pass a dict object with functions to apply to specific columns.
agg()函數(shù)使您可以生成多個(gè)描述性統(tǒng)計(jì)信息。 您可以簡單地傳遞一組函數(shù)名稱,該名稱將應(yīng)用于所有列。 另外,您可以傳遞一個(gè)dict對(duì)象,該對(duì)象具有要應(yīng)用于特定列的函數(shù)。
10.寬到長格式轉(zhuǎn)換 (10. Wide to Long Format Transformation)
Depending on how the data are collected, the original dataset may be in the “wide” format — each row represents a data record with multiple measures (e.g., different time points for a subject in a research study). If we want to convert the “wide” format to the “l(fā)ong” format (e.g., each time point becomes a data row and thus a subject has multiple rows), we can use the melt() function, as shown below.
取決于數(shù)據(jù)的收集方式,原始數(shù)據(jù)集可能采用“寬”格式-每行代表具有多種度量(例如,研究對(duì)象的不同時(shí)間點(diǎn))的數(shù)據(jù)記錄。 如果我們想將“寬”格式轉(zhuǎn)換為“長”格式(例如,每個(gè)時(shí)間點(diǎn)變成一個(gè)數(shù)據(jù)行,因此一個(gè)主體有多個(gè)行),則可以使用melt()函數(shù),如下所示。
Wide to Long Transformation從寬到長的轉(zhuǎn)變The melt() function is essentially “unpivoting” a data table (we’ll talk about pivoting next). You specify the id_vars to be the columns that are used as identifiers in the original dataset.
melt()函數(shù)本質(zhì)上是“取消透視”數(shù)據(jù)表(我們接下來將討論透視)。 您將id_vars指定為原始數(shù)據(jù)集中用作標(biāo)識(shí)符的列。
The value_vars argument is set using the columns that contain the values. By default, the columns will become the values for the var_name column in the melted dataset.
使用包含值的列設(shè)置value_vars參數(shù)。 默認(rèn)情況下,這些列將成為融化數(shù)據(jù)集中var_name列的值。
11.從長格式到寬格式的轉(zhuǎn)換 (11. Long to Wide Format Transformation)
The opposite operation to the melt() function is called pivoting, which we can realize with the pivot() function. Suppose that the created “wide” format DataFrame is called df_long. The following function shows you how we can convert the wide format to the long format — basically reverse the process that we did in the previous section.
與melt()函數(shù)相反的操作稱為pivoting,我們可以通過pivot()函數(shù)來實(shí)現(xiàn)。 假設(shè)創(chuàng)建的“寬”格式DataFrame稱為df_long 。 以下功能向您展示了如何將寬格式轉(zhuǎn)換為長格式-基本上逆轉(zhuǎn)了上一節(jié)中的過程。
Long to Wide Transformation從長到寬的轉(zhuǎn)變Besides the pivot() function, a closely related function is the pivot_table() function, which is more general than the pivot() function by allowing duplicate index or columns (see here for a more detailed discussion).
除pivot_table() pivot()函數(shù)外,一個(gè)密切相關(guān)的函數(shù)是pivot_table()函數(shù),它比pivot()函數(shù)更通用,它允許重復(fù)的索引或列(請(qǐng)參見此處以獲取更詳細(xì)的討論)。
12.選擇數(shù)據(jù) (12. Select Data)
When we work with a complex dataset, we need to select a subset of the dataset for particular operations based on some criteria. If you select some columns, the following code shows you how to do it. The selected data will include all the rows.
當(dāng)我們處理復(fù)雜的數(shù)據(jù)集時(shí),我們需要根據(jù)一些條件為特定操作選擇數(shù)據(jù)集的子集。 如果選擇一些列,則以下代碼向您展示如何執(zhí)行此操作。 所選數(shù)據(jù)將包括所有行。
# Select a columndf_wide['subject']# Select multiple columnsdf_wide[['subject', 'before_meds']]If you want to select certain rows with all columns, do the following.
如果要選擇具有所有列的某些行,請(qǐng)執(zhí)行以下操作。
# Select rows with a specific conditiondf_wide[df_wide['subject'] == 100]What if you want to select certain rows and columns, we should consider using the iloc or loc methods. The major difference between these methods is that the iloc method uses 0-based index, while the loc method uses labels.
如果要選擇某些行和列,應(yīng)該考慮使用iloc或loc方法。 這些方法之間的主要區(qū)別在于iloc方法使用基于0的索引,而loc方法使用標(biāo)簽。
Data Selection資料選擇- The above pairs of calls create the same output. For clarity, only one output is listed. 上面的調(diào)用對(duì)創(chuàng)建相同的輸出。 為了清楚起見,僅列出了一個(gè)輸出。
When you use slice objects with the iloc, the stop index isn’t included, just as regular Python slice objects. However, the slice objects include the stop index in the loc method. See Lines 15–17.
當(dāng)您將切片對(duì)象與iloc一起使用iloc ,不包括stop索引,就像常規(guī)的Python切片對(duì)象一樣。 但是,切片對(duì)象在loc方法中包含停止索引。 參見第15-17行。
As noted in Line 22, when you use a boolean array, you need to use the actual values (using the values method, which will return the underlying numpy array). If you don’t do that, you’ll probably encounter the following error: NotImplementedError: iLocation based boolean indexing on an integer type is not available.
如第22行所述,當(dāng)您使用布爾數(shù)組時(shí),需要使用實(shí)際值(使用values方法,這將返回基礎(chǔ)的numpy數(shù)組)。 如果不這樣做,則可能會(huì)遇到以下錯(cuò)誤: NotImplementedError: iLocation based boolean indexing on an integer type is not available 。
The use of labels in loc methods happens to be the same as index in terms of selecting rows, because the index has the same name as the index labels. In other words, iloc will always use 0-based index based on the position regardless of the numeric values of the index.
就選擇行而言,在loc方法中使用標(biāo)簽恰好與索引相同,因?yàn)樗饕拿Q與索引標(biāo)簽的名稱相同。 換句話說, iloc將始終基于位置使用基于0的索引,而不管索引的數(shù)值如何。
13.使用現(xiàn)有數(shù)據(jù)的新列(映射并應(yīng)用) (13. New Columns Using Existing Data (map and apply))
Existing columns don’t always present the data in the format we want. Thus, we often need to generate new columns using existing data. Two functions are particularly useful in this case: map() and apply(). There are too many possible ways that we can use them to create new columns. For instance, the apply() function can have a more complex mapping function and it can create multiple columns. I’ll just show you two most use common cases with the following rules of thumb. Let’s keep our goal simple — just create one column with either use case.
現(xiàn)有列并不總是以我們想要的格式顯示數(shù)據(jù)。 因此,我們經(jīng)常需要使用現(xiàn)有數(shù)據(jù)來生成新列。 在這種情況下,兩個(gè)函數(shù)特別有用: map()和apply() 。 我們可以使用太多方式來創(chuàng)建新列。 例如, apply()函數(shù)可以具有更復(fù)雜的映射函數(shù),并且可以創(chuàng)建多個(gè)列。 我將通過以下經(jīng)驗(yàn)法則向您展示兩個(gè)最常用的案例。 讓我們保持目標(biāo)簡單-只需使用任一用例創(chuàng)建一列。
If your data conversion involves just one column, simply use the map() function on the column (in essence, it’s a Series object).
如果您的數(shù)據(jù)轉(zhuǎn)換僅涉及一列,則只需在該列上使用map()函數(shù)(本質(zhì)上是一個(gè)Series對(duì)象)。
If your data conversion involves multiple columns, use the apply() function.
如果您的數(shù)據(jù)轉(zhuǎn)換涉及多個(gè)列,請(qǐng)使用apply()函數(shù)。
In both cases, I used lambda functions. However, you can use regular functions. It’s also possible to provide a dict object for the map() function, which will map the old values to the new values based on the key-value pairs with keys being the old values and the values being the new values.
在兩種情況下,我都使用lambda函數(shù)。 但是,您可以使用常規(guī)功能。 也可以為map()函數(shù)提供一個(gè)dict對(duì)象,該對(duì)象將根據(jù)鍵值對(duì)將舊值映射到新值,其中鍵為舊值,而值為新值。
For the apply() function, when we create a new column, we need to specify axis=1, because we’re accessing data row-wise.
對(duì)于apply()函數(shù),當(dāng)我們創(chuàng)建新列時(shí),我們需要指定axis=1 ,因?yàn)槲覀円鹦性L問數(shù)據(jù)。
For the apply() function, the example shown is intended for demonstration purposes, because I could’ve used the original column to do a simpler arithmetic subtraction like this: df_wide[‘change’] = df_wide[‘before_meds’] —df_wide[‘a(chǎn)fter_meds’].
對(duì)于apply()函數(shù),所示示例僅用于演示目的,因?yàn)槲铱梢允褂迷剂衼磉M(jìn)行如下更簡單的算術(shù)減法: df_wide['change'] = df_wide['before_meds'] —df_wide['after_meds'] 。
14.串聯(lián)與合并 (14. Concatenation and Merging)
When we have multiple datasets, it’s necessary to put them together from time to time. There are two common scenarios. The first scenario is when you have datasets of similar shape, either sharing the same index or same columns, you can consider concatenating them directly. The following code shows you some possible concatenations.
當(dāng)我們有多個(gè)數(shù)據(jù)集時(shí),有必要不時(shí)將它們放在一起。 有兩種常見方案。 第一種情況是,當(dāng)您擁有形狀相似的數(shù)據(jù)集(共享相同的索引或相同的列)時(shí),可以考慮直接將它們連接起來。 以下代碼顯示了一些可能的連接。
# When the data have the same columns, concatenate them verticallydfs_a = [df0a, df1a, df2a]pd.concat(dfs_a, axis=0)# When the data have the same index, concatenate them horizontallydfs_b = [df0b, df1b, df2b]
pd.concat(dfs_b, axis=1)
- By default, the concatenation performs an “outer” join, which means that if there are any non-overlapping index or columns, all of them will be kept. In other words, it’s like creating a union of two sets. 默認(rèn)情況下,串聯(lián)執(zhí)行“外部”聯(lián)接,這意味著如果存在任何不重疊的索引或列,則將全部保留它們。 換句話說,這就像創(chuàng)建兩個(gè)集合的并集。
Another thing to remember is that if you need to concatenate multiple DataFrame objects, it’s recommended that you create a list to store these objects, and perform concatenation just once by avoiding generating intermediate DataFrame objects if you perform concatenation sequentially.
要記住的另一件事是,如果需要連接多個(gè)DataFrame對(duì)象,建議您創(chuàng)建一個(gè)列表來存儲(chǔ)這些對(duì)象,并通過順序執(zhí)行串聯(lián)操作避免生成中間DataFrame對(duì)象,從而只執(zhí)行一次連接。
If you want to reset the index for the concatenated DataFrame, you can set ignore_index=True argument.
如果要重置串聯(lián)的DataFrame的索引,則可以設(shè)置ignore_index=True參數(shù)。
The other scenario is to merge datasets that have one or two overlapping identifiers. For instance, one DataFrame has id number, name and gender, and the other has id number and transaction records. You can merge them using the id number column. The following code shows you how to merge them.
另一種情況是合并具有一個(gè)或兩個(gè)重疊標(biāo)識(shí)符的數(shù)據(jù)集。 例如,一個(gè)DataFrame具有ID號(hào),名稱和性別,而另一個(gè)具有ID號(hào)和交易記錄。 您可以使用ID號(hào)列合并它們。 以下代碼顯示了如何合并它們。
# Merge DataFrames that have the same merging keysdf_a0 = pd.DataFrame(dict(), columns=['id', 'name', 'gender'])df_b0 = pd.DataFrame(dict(), columns=['id', 'name', 'transaction'])
merged0 = df_a0.merge(df_b0, how="inner", on=["id", "name"])# Merge DataFrames that have different merging keysdf_a1 = pd.DataFrame(dict(), columns=['id_a', 'name', 'gender'])
df_b1 = pd.DataFrame(dict(), columns=['id_b', 'transaction'])
merged1 = df_a1.merge(df_b1, how="outer", left_on="id_a", right_on="id_b")
When both DataFrame objects share the same key or keys, you can simply specify them (either one or multiple is fine) using the on argument.
當(dāng)兩個(gè)DataFrame對(duì)象共享一個(gè)或多個(gè)相同的鍵時(shí),您可以使用on參數(shù)簡單地指定它們(一個(gè)或多個(gè)都可以)。
When they have different names, you can specify which one for the left DataFrame and which one for the right DataFrame.
當(dāng)他們有不同的名稱,你可以指定一個(gè)左數(shù)據(jù)框和一個(gè)合適的數(shù)據(jù)幀 。
By default, the merging will use the inner join method. When you want to have other join methods (e.g., left, right, outer), you set the proper value for the how argument.
默認(rèn)情況下,合并將使用內(nèi)部連接方法。 當(dāng)您要使用其他聯(lián)接方法(例如,左,右,外)時(shí),可以為how參數(shù)設(shè)置適當(dāng)?shù)闹怠?
15.放置列 (15. Drop Columns)
Although you can keep all the columns in the DataFrame by renaming them without any conflict, sometimes you’d like to drop some columns to keep the dataset clean. In this case, you should use the drop() function.
盡管您可以通過重命名所有列將其保留在DataFrame中,而不會(huì)發(fā)生任何沖突,但是有時(shí)您還是希望刪除一些列以保持?jǐn)?shù)據(jù)集的整潔。 在這種情況下,應(yīng)使用drop()函數(shù)。
# Drop the unneeded columnsdf.drop(['col0', 'col1'], axis=1)By default, the drop() function uses labels to refer to columns or index, and thus you may want to make sure that the labels are contained in the DataFrame object.
默認(rèn)情況下, drop()函數(shù)使用標(biāo)簽來引用列或索引,因此您可能需要確保標(biāo)簽包含在DataFrame對(duì)象中。
To drop index, you use axis=0. If you drop columns, which I find them to be more common, you use axis=1.
要?jiǎng)h除索引,請(qǐng)使用axis=0 。 如果刪除列(我發(fā)現(xiàn)它們更常見),則使用axis=1 。
Again, this operation creates a DataFrame object, and if you prefer changing the original DataFrame, you specify inplace=True.
同樣,此操作將創(chuàng)建一個(gè)DataFrame對(duì)象,并且如果您希望更改原始DataFrame ,則可以指定inplace inplace=True 。
16.寫入外部文件 (16. Write to External Files)
When you want to communicate data with your collaborators or teammates, you need to write your DataFrame objects to external files. In most cases, the comma-delimited files should serve the purposes.
當(dāng)您想與合作者或隊(duì)友交流數(shù)據(jù)時(shí),您需要將DataFrame對(duì)象寫入外部文件。 在大多數(shù)情況下,以逗號(hào)分隔的文件應(yīng)能達(dá)到目的。
# Write to a csv file, which will keep the indexdf.to_csv("filename.csv")# Write to a csv file without the indexdf.to_csv("filename.csv", index=False)# Write to a csv file without the headerdf.to_csv("filename.csv", header=False)By default, the generated file will keep the index. You need to specify index=False to remove the index from the output.
默認(rèn)情況下,生成的文件將保留索引。 您需要指定index=False才能從輸出中刪除索引。
By default, the generated file will keep the header (e.g., column names). You need to specify header=False to remove the headers.
默認(rèn)情況下,生成的文件將保留標(biāo)題(例如,列名)。 您需要指定header=False來刪除標(biāo)題。
結(jié)論 (Conclusion)
In this article, we reviewed the basic operations that you’ll find them useful to get you started with the pandas library. As indicated by the article’s title, these techniques aren’t intended to handle the data in a fancy way. Instead, they’re all basic techniques to allow you to process the data in the way you want. Later on, you can probably find fancier ways to get some operations done.
在本文中,我們回顧了基本操作,您會(huì)發(fā)現(xiàn)它們對(duì)開始使用pandas庫很有用。 如文章標(biāo)題所示,這些技術(shù)并非旨在以一種奇特的方式處理數(shù)據(jù)。 相反,它們都是允許您以所需方式處理數(shù)據(jù)的基本技術(shù)。 稍后,您可能會(huì)找到更理想的方法來完成一些操作。
翻譯自: https://towardsdatascience.com/nothing-fancy-but-16-essential-operations-to-get-you-started-with-pandas-5b0c2f649068
熊貓直播 使用什么sdk
總結(jié)
以上是生活随笔為你收集整理的熊猫直播 使用什么sdk_没什么可花的-但是16项基本操作才能让您开始使用熊猫的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 女人梦到乌鸦什么预兆
- 下一篇: 小程序 国际化_在国际化您的应用程序时忘