日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程语言 > python >内容正文

python

python 数据框缺失值_Python:处理数据框中的缺失值

發(fā)布時(shí)間:2023/11/29 python 36 豆豆
生活随笔 收集整理的這篇文章主要介紹了 python 数据框缺失值_Python:处理数据框中的缺失值 小編覺得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

python 數(shù)據(jù)框缺失值

介紹 (Introduction)

In the last article we went through on how to find the missing values. This link has the details on the how to find missing values in the data frame. https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

在上一篇文章中,我們探討了如何找到缺失的值。 該鏈接包含有關(guān)如何在數(shù)據(jù)框中查找缺失值的詳細(xì)信息。 https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

Now that you have identified all the missing values, what to do with these missing values? In this article we will go over on how to handle missing data in a data frame.

現(xiàn)在,您已經(jīng)確定了所有缺失值,如何處理這些缺失值? 在本文中,我們將探討如何處理數(shù)據(jù)幀中的丟失數(shù)據(jù)。

There are multiple ways of handling missing data and this varies case by case. There is no universal best way in dealing with the missing data. Use your best judgement and explore different options to determine which method is best for your data set.

有多種處理丟失數(shù)據(jù)的方法,具體情況視情況而定。 沒有通用的最佳方法來處理丟失的數(shù)據(jù)。 根據(jù)您的最佳判斷,探索不同的選項(xiàng),以確定哪種方法最適合您的數(shù)據(jù)集。

  • Deleting all rows/columns with missing data: This can be used when you have rows/columns where majority of the data is missing. When you are deleting rows/columns you might be losing some valuable information and lead to biased models. So analyze your data before deleting and check if there is any particular reason for missing data.

    刪除所有缺少數(shù)據(jù)的行/列 :當(dāng)您缺少大部分?jǐn)?shù)據(jù)的行/列時(shí),可以使用此方法。 當(dāng)您刪除行/列時(shí),您可能會丟失一些有價(jià)值的信息,并導(dǎo)致模型有偏差。 因此,請?jiān)趧h除數(shù)據(jù)之前分析您的數(shù)據(jù),并檢查是否有任何特殊原因?qū)е聰?shù)據(jù)丟失。

  • Imputing data: This is by far the most common way used to handle missing data. In this method you impute a value where data is missing. Imputing data can introduce bias into the datasets. Imputation can be done multiple ways.

    估算數(shù)據(jù) :這是迄今為止處理缺失數(shù)據(jù)的最常用方法。 在此方法中,您將在缺少數(shù)據(jù)的地方估算一個(gè)值。 估算數(shù)據(jù)可能會使數(shù)據(jù)集產(chǎn)生偏差。 插補(bǔ)可以通過多種方式完成。

  • a. You can impute mean, median or mode values of a column into the missing values in a column.

    一個(gè)。 您可以將一列的均值,中位數(shù)或眾數(shù)值插入一列的缺失值中。

    b. You use predictive algorithms to impute missing values.

    b。 您可以使用預(yù)測算法來估算缺失值。

    c. For categorical variables you can label missing data as a category.

    C。 對于分類變量,可以將缺少的數(shù)據(jù)標(biāo)記為類別。

    For this exercise we will use the Seattle Airbnb data set which can be found in the below link. https://www.kaggle.com/airbnb/seattle?select=listings.csv

    在本練習(xí)中,我們將使用Seattle Airbnb數(shù)據(jù)集,該數(shù)據(jù)集可在下面的鏈接中找到。 https://www.kaggle.com/airbnb/seattle?select=listings.csv

    Load the data and find the missing values.

    加載數(shù)據(jù)并找到缺少的值。

    The details of this steps can be found in the previous post under the below link. https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

    有關(guān)此步驟的詳細(xì)信息,請參見上一篇文章的以下鏈接。 https://medium.com/@kallepalliravi/python-finding-missing-values-in-a-data-frame-3030aaf0e4fd

    Load the data file and check the structure of data加載數(shù)據(jù)文件并檢查數(shù)據(jù)結(jié)構(gòu) % of missing data on each numerical column每個(gè)數(shù)字列上丟失數(shù)據(jù)的百分比 % of missing data in categorical columns分類列中丟失數(shù)據(jù)的百分比

    1.刪??除缺少數(shù)據(jù)的行/列: (1. Deleting rows/columns with missing data:)

    Deleting Specific rows/columns

    刪除特定的行/列

    From the above you can see that 100% of the values in license column and 97% of the square_feet column are missing data in numerical columns.

    從上面可以看到,許可證列中的100%的值和square_feet列中的97%的值在數(shù)字列中丟失。

    60% of the values in monthly_price, 51% of values in security_deposit and 47% of values in weekly_price are missing data

    缺少數(shù)據(jù)的month_price中的值的60%,security_deposit中的51%的值和weekly_price中的47%的值

    Lets try deleting these 5 columns.

    讓我們嘗試刪除這5列。

    Pandas drop function can be used to delete rows and columns. Full details of this function can be found in the below https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.drop.html

    熊貓拖放功能可用于刪除行和列。 可以在下面的https://pandas.pydata.org/pandasdocs/stable/reference/api/pandas.DataFrame.drop.html中找到此功能的完整詳細(xì)信息。

    All columns which should be deleted should be included in columns parameter. axis =1 represents column, axis=0 represent rows. In the case we are telling to delete all columns specified in the columns parameter.

    應(yīng)該刪除的所有列都應(yīng)包含在columns參數(shù)中。 軸= 1代表列,軸= 0代表行。 在這種情況下,我們告訴您刪除columns參數(shù)中指定的所有列。

    As you can see below now you do not have columns which have been deleted.

    如下所示,您現(xiàn)在沒有已刪除的列。

    Deleting rows/columns with NA

    用NA刪除行/列

    If you want to delete rows/columns with NA we can use dropna function in pandas. Details of this function can be found in the below link. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

    如果您想使用NA刪除行/列,我們可以在熊貓中使用dropna函數(shù)。 可以在下面的鏈接中找到此功能的詳細(xì)信息。 https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

    dropna function has multiple parameters, the 3 main ones are

    dropna函數(shù)有多個(gè)參數(shù),其中三個(gè)主要參數(shù)是

  • how : this has 2 options “any” or “all”. If you set to “any” even if one value has NA in row or column it will delete those columns. If you set to “all” only if all the values in rows/columns have NA deletion will happen.

    方式:這有2個(gè)選項(xiàng)“任何”或“全部”。 如果您設(shè)置為“ any”,即使一個(gè)值在行或列中具有NA,它將刪除這些列。 如果僅將行/列中的所有值都具有NA刪除,則設(shè)置為“所有”。
  • axis : this can be set to 0 or 1. If 0 then drops rows with NA values, if 1 then drops columns with NA values.

    axis:可以將其設(shè)置為0或1。如果為0,則刪除具有NA值的行,如果為1,則刪除具有NA值的列。
  • subset: if you want the operation to be performed only on certain columns then mention the column name int he subset. If subset is not define then the operation is performed on all the columns.

    子集:如果您希望僅對某些列執(zhí)行操作,請?jiān)谧蛹刑峒傲忻?如果未定義子集,則對所有列執(zhí)行該操作。
  • 2.估算數(shù)據(jù) (2. Imputing Data)

    With imputing you are trying to assign a value through inference from the values to which it contributes. In this case you are assigning a value in the place of a missing value by using different methods on the feature which has missing value. Methods can as simple as assigning mean, median, mode of the column to the missing values or you can use machine learning techniques to predict the missing values. Imputation methods can be different for numerical and categorical variables.

    使用插補(bǔ)時(shí),您試圖通過推斷貢獻(xiàn)值來分配一個(gè)值。 在這種情況下,您可以通過對具有缺失值的要素使用不同的方法來為缺失值分配一個(gè)值。 方法可以簡單到為缺失值分配列的均值,中位數(shù),眾數(shù)模式,也可以使用機(jī)器學(xué)習(xí)技術(shù)來預(yù)測缺失值。 數(shù)值和分類變量的插補(bǔ)方法可能不同。

    Imputation for Numerical values:

    數(shù)值的估算:

    With numerical columns the most common approach to impute data is by imputing mean, median or mode of the column in place of the missing values.

    對于數(shù)字列,最常用的估算數(shù)據(jù)方法是通過估算列的均值,中位數(shù)或眾數(shù)來代替缺失值。

    To do that we will write a function to fill na with mean/median/mode and then apply that function to all the columns.

    為此,我們將編寫一個(gè)用均值/中位數(shù)/眾數(shù)填充na的函數(shù),然后將該函數(shù)應(yīng)用于所有列。

    In the below i am showing a example to fill the missing data with the mean of the column.

    在下面的示例中,我展示了使用列的平均值填充缺失數(shù)據(jù)的示例。

    fill_mean function iterates through each column in the data frame and fill’s na with the column mean.

    fill_mean函數(shù)遍歷數(shù)據(jù)幀中的每一列,并用列均值填充na。

    You can then use apply() function to apply fill_mean function on one column or multiple columns in a data frame.

    然后,您可以使用apply()函數(shù)將fill_mean函數(shù)應(yīng)用于數(shù)據(jù)框中的一列或多列。

    This example shows using mean, you can use median() and mode() function in place of mean() if you want to impute median or mode of the column .

    此示例顯示了使用均值,如果要對列的中值或眾數(shù)進(jìn)行插值,則可以使用mean()和mode()函數(shù)代替mean()。

    Imputation for Categorical values:

    分類值的插補(bǔ):

    For categorical variables clearly you cannot use mean or median for imputation. But we can use mode which is use the most frequently used value or the one other way is to missing data as category by itself.

    顯然,對于分類變量,您不能使用均值或中位數(shù)進(jìn)行插補(bǔ)。 但是我們可以使用使用最常用值的模式,或者另一種方法是單獨(dú)丟失數(shù)據(jù)作為類別。

    Since i have already went through on how to impute most frequently value, in this step i will show how make a missing data as a category. This is very straight forward, you just replace NA with “missing data” category. Missing data will be one of the levels in each categorical variable.

    由于我已經(jīng)介紹了如何估算最頻繁的值,因此在這一步中,我將說明如何將缺失的數(shù)據(jù)作為類別。 這很簡單,您只需將NA替換為“缺少數(shù)據(jù)”類別。 丟失的數(shù)據(jù)將是每個(gè)分類變量中的級別之一。

    Imputation using a model to predict missing values:

    使用模型進(jìn)行插補(bǔ)以預(yù)測缺失值:

    One more option is to use model to predict missing values. To perform this task you can IterativeImputer from sklearn library. You can find details on this in the below link

    另一種選擇是使用模型來預(yù)測缺失值。 要執(zhí)行此任務(wù),您可以從sklearn庫中獲取IterativeImputer。 您可以在以下鏈接中找到詳細(xì)信息

    https://scikit-learn.org/stable/modules/generated/sklearn.impute.IterativeImputer.html

    https://scikit-learn.org/stable/modules/generation/sklearn.impute.IterativeImputer.html

    Iterative imputer considers features with missing values and develops a model as function of other features. It then estimates the missing value and imputes those values.

    迭代沖刺者會考慮具有缺失值的要素,并根據(jù)其他要素開發(fā)模型。 然后,它估計(jì)缺失值并估算這些值。

    It does it in a iterative manner, meaning it will take a 1st feature with missing values which it considers as response variable and considers all the other features as input variables. Using these input variables it will estimate the values for the missing values in the response variable. In the next step it will consider the 2nd feature with missing values as response variable and use all the other features as input variables and estimate missing values. This process will continue until all the features with missing values are addressed.

    它以迭代方式進(jìn)行,這意味著它將采用第一個(gè)具有缺失值的特征,將其視為響應(yīng)變量,并將所有其他特征視為輸入變量。 使用這些輸入變量,它將估計(jì)響應(yīng)變量中缺少的值的值。 在下一步中,它將把具有缺失值的第二個(gè)特征視為響應(yīng)變量,并將所有其他特征用作輸入變量并估計(jì)缺失值。 此過程將繼續(xù)進(jìn)行,直到解決所有缺少值的功能。

    In the below example i am using Random forest in the imputer to estimate the missing values and fitting the imputer to a data frame.

    在下面的示例中,我將在嵌入式計(jì)算機(jī)中使用隨機(jī)森林來估計(jì)缺失值,并將嵌入式計(jì)算機(jī)擬合到數(shù)據(jù)幀。

    結(jié)論: (Conclusion:)

    In this article we went through on how to handle the missing values in a data frame.

    在本文中,我們探討了如何處理數(shù)據(jù)框中的缺失值。

  • Delete the rows/columns with missing values

    刪除缺少值的行/列
  • Imputing the missing values with statistic like mean, mean or mode.

    用均值,均值或眾數(shù)等統(tǒng)計(jì)數(shù)據(jù)來估算缺失值。
  • For categorical variables making missing data as a category.

    對于類別變量,將缺少的數(shù)據(jù)作為類別。
  • Using Iterative Imputer develop a model to predict missing values in each of the features.

    使用Iterative Imputer開發(fā)一個(gè)模型來預(yù)測每個(gè)功能部件中的缺失值。
  • 翻譯自: https://medium.com/analytics-vidhya/python-handling-missing-values-in-a-data-frame-4156dac4399

    python 數(shù)據(jù)框缺失值

    總結(jié)

    以上是生活随笔為你收集整理的python 数据框缺失值_Python:处理数据框中的缺失值的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。