日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問 生活随笔!

生活随笔

當(dāng)前位置: 首頁 > 编程资源 > 编程问答 >内容正文

编程问答

eda分析_EDA理论指南

發(fā)布時間:2023/11/29 编程问答 29 豆豆
生活随笔 收集整理的這篇文章主要介紹了 eda分析_EDA理论指南 小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

eda分析

Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to choose for a problem.

中號 OST的數(shù)據(jù)分析問題開始理解數(shù)據(jù)。 這是最關(guān)鍵和最復(fù)雜的步驟。 此步驟還會影響我們在預(yù)測建模問題中做出的進(jìn)一步?jīng)Q策,其中一項是我們要為問題選擇的算法。

In this article, we will see a complete tough guide for such a problem.

在本文中,我們將看到有關(guān)此問題的完整指南。

Content

內(nèi)容

  • Reading Data

    讀取數(shù)據(jù)
  • Variable Identification

    變量識別
  • Univariate analysis

    單變量分析
  • Bivariate analysis

    雙變量分析
  • Missing values- types and analysis

    缺失值-類型和分析
  • Outlier treatment

    離群值處理
  • Variable Transformation

    變量變換
  • 讀取數(shù)據(jù)和變量識別 (Reading data and Variable Identification)

    Reading the data infers getting the answers to the following questions

    讀取數(shù)據(jù)可以得出以下問題的答案

    • What is the shape of my data?

      數(shù)據(jù)的形狀如何?
    • How many features does my data contain?

      我的數(shù)據(jù)包含多少個功能?
    • What does it look like?

      它是什么樣子的?
    • What are the types of variables?

      變量的類型是什么?
    Guide1: Types of Variables指南1:變量類型

    單變量分析(UA) (Univariate Analysis (UA))

    什么是UA? (What is UA?)

    When we explore a single variable at a time from a given list of features, its called UA. We summarize the variable and help us better understand the data.

    當(dāng)我們一次從給定的功能列表中探索單個變量時,其稱為UA。 我們總結(jié)了變量并幫助我們更好地理解了數(shù)據(jù)。

    We see for the following things in UA

    我們在UA中看到以下內(nèi)容

    • Central tendency (mean, median, mode) and dispersion of the variable

      變量的集中趨勢(均值,中位數(shù),眾數(shù))和離散
    • Distribution of variable- symmetric, right-skewed or left-skewed

      對稱分布,右偏或左偏的分布
    • Missing values and outliers

      缺失值和離群值
    • Count and count percent: Observing the frequency of each category in a categorical variable helps us to understand and deal with that variable.

      計算百分比:觀察類別變量中每個類別的頻率有助于我們理解和處理該變量。

    為什么選擇UA? (Why UA?)

    We explore that variable, checks for anomalies like outliers, and missing values that we will see in the latter part.

    我們將探索該變量,檢查異常值(如異常值)和缺失值,我們將在后面的部分中看到這些值。

    UA方法 (Methods for UA)

    For Continuous Variables:

    對于連續(xù)變量:

  • Tabular Method: Used to describe central tendencies, dispersion, and missing values.

    表格方法:用于描述中心趨勢,離散度和缺失值。
  • Graphical Method: Used for distribution and checking Outliers. We can use Histograms for understanding distribution and Box Plots for outliers detection.

    圖形方法:用于分發(fā)和檢查離群值。 我們可以使用直方圖來了解分布,而可以使用箱形圖來檢測異常值。

  • A combination of Histograms and Box plots is called a Violin Plot

    直方圖和箱形圖的組合稱為小提琴圖

    Guide2: Methods of Univariate Analysis for continuous variables指南2:連續(xù)變量的單變量分析方法

    For Categorical variables:

    對于分類變量:

  • Tabular Method: “.value_counts()” operation in python gives a tabular form of frequencies.

    表格方法:python中的“ .value_counts()”操作提供了表格形式的頻率。
  • Graphical Method: The best graph that is used in the case of a categorical variable is barplot.

    圖形方法:對于分類變量,使用的最佳圖形是條形圖。
  • Guide3: Methods of Univariate Analysis for categorical variables指南3:分類變量的單變量分析方法

    雙變量分析(BA) (Bivariate Analysis (BA))

    什么是學(xué)士學(xué)位? (What is BA?)

    When we study the empirical relationship of two variables concerning each other, it is called BA.

    當(dāng)我們研究兩個變量彼此相關(guān)的經(jīng)驗關(guān)系時,稱為BA。

    為什么要學(xué)士學(xué)位? (Why BA?)

    It helps to detect anomalies, understand the dependence of two variables on each other, and the impact of each variable ion the target variable.

    它有助于檢測異常,了解兩個變量之間的依賴性,以及每個變量對目標(biāo)變量的影響。

    BA的方法 (Methods for BA)

  • For Continuous-Continuous types: There are two methods to study the relationship between two continuous variables i.e. A scatter plot and the correlation analysis.

    對于連續(xù)-連續(xù)類型 :有兩種方法研究兩個連續(xù)變量之間的關(guān)系,即散點圖相關(guān)性分析

  • Guide4: Bivariate analysis for Continuous-Continuous type variables指南4:連續(xù)-連續(xù)類型變量的雙變量分析

    2. For categorical-continuous types: Under this head, we can use bar plots and T-tests for the analysis purpose.

    2. 對于連續(xù)類別:在此標(biāo)題下,我們可以使用條形圖T檢驗進(jìn)行分析。

    The T-test is a type of inferential statistic used to determine if there is a significant difference between the means of two or more groups/categories. Calculating a t-test requires the difference between the mean values and the standard deviation from each category.

    T檢驗是一種推論統(tǒng)計量,用于確定兩個或多個組/類別的均值之間是否存在顯著差異。 計算t檢驗需要每個類別的平均值和標(biāo)準(zhǔn)偏差之間的差。

    Guide5: Bivariate analysis for categorical-Continuous type variables指南5:分類連續(xù)類型變量的雙變量分析

    3. For Categorical-categorical types: Two-way table and Chi-square test are used to analyze the relationship of two categorical variables.

    3. 對于分類類別類型:使用雙向表和卡方檢驗分析兩個分類變量之間的關(guān)系。

    缺失值 (Missing Values)

    缺少價值的原因? (Reasons for Missing Values?)

    There can be various missing values in data, some of which can be

    數(shù)據(jù)中可能存在各種缺失值,其中一些可能是

    • There may not be may response recorded.

      可能沒有記錄響應(yīng)。
    • There can be some error while recording the data

      記錄數(shù)據(jù)時可能會出現(xiàn)一些錯誤
    • There can be some error while reading the data, etc.

      讀取數(shù)據(jù)時可能會出錯,等等。

    缺失值的類型? (Types of Missing values?)

  • Missing Completely at Random (MCAR): These are the missing values that do not have any relation with any other variable or the variable in which they are occurring.

    完全隨機缺失(MCAR):這些缺失值與任何其他變量或發(fā)生它們的變量沒有任何關(guān)系。

  • Missing at random (MAR): The missing values that do not have any relation within the variable they exist but may have an observable trend in other variables. Eg. The income data for people having age greater than 60 years can be missing as people with that age are generally retired.

    隨機缺失(MAR):這些缺失值在存在的變量中沒有任何關(guān)系,但在其他變量中可能有可觀察的趨勢。 例如 。 年齡超過60歲的人的收入數(shù)據(jù)可能會丟失,因為該年齡的人通常已經(jīng)退休。

  • Missing Not at Random (MNAR): The missing value has a relation in the variable they exist. Eg. House having a price more than Rs. 2 crores can be missing in the database as for that price there cannot be frequent buyers.

    隨機缺失(MNAR):缺失值與它們存在的變量有關(guān)。 例如 。 價格超過Rs的房子。 數(shù)據(jù)庫中可能缺少2千萬,因為該價格不能頻繁購買。

  • 缺失值的處理方法 (Methods of dealing Missing Values)

    There are two basic methods to deal with missing values

    有兩種處理缺失值的基本方法

  • Deletion: We delete all the missing value rows from the dataset before training the model.

    刪除:我們在訓(xùn)練模型之前從數(shù)據(jù)集中刪除所有缺失值行。

  • Imputation: There are various methods by which we can fill the missing values.

    歸因:我們可以通過多種方法來填充缺失值。

  • Guide6: Treating Missing values指南6:處理缺失值

    離群值 (Outliers)

    離群值的類型及其識別 (Types of Outliers and their identification)

    There are two types of outliers:

    有兩種異常值:

  • Univariate Outlier: It can be identified using a box plot.

    單變量離群值:可以使用箱形圖進(jìn)行識別。

  • Bivariate Outliers: It can be identified using a scatter plot between the two variables.

    雙變量離群值:可以使用兩個變量之間的散點圖來識別。

  • 離群值的標(biāo)準(zhǔn) (Criteria for an outlier)

    Criteria for X to be outlier:Q1: median for first 25% observation when sorted in ascending order
    Q2: median for last 25% observation when sorted in ascending order
    Q3: median of all observationIQR: Inter quartile range = Q3-Q1
    if X is outlier then X must satisfy:X > (Q3 + 1.5*IQR) OR X < (Q1-1.5*IQR)

    異常值的處理 (Treatment of outlier)

  • We can delete that observation.

    我們可以刪除該觀察。
  • We can impute the value of outlier by the methods discussed in ways for imputing missing values.

    我們可以通過以估算缺失值的方式討論的方法來估算離群值。
  • We can apply transformations (to be discussed next)

    我們可以應(yīng)用轉(zhuǎn)換(將在下面討論)
  • 變量變換 (Variable Transformation)

    We all know that normalization increases the accuracy of the model. But what exactly is normalization? It is one of the techniques of variable transformation.

    眾所周知,歸一化可以提高模型的準(zhǔn)確性。 但是規(guī)范化到底是什么? 它是變量轉(zhuǎn)換的技術(shù)之一。

    In variable transformation, we replace the variable by one of its functions. for example, replace the variable x by its log value.

    在變量轉(zhuǎn)換中,我們用變量的功能之一代替變量。 例如,將變量x替換為其對數(shù)值。

    We can try to fix the following things that we have obtained as an observation in previous EDA processes:

    我們可以嘗試修復(fù)在以前的EDA過程中觀察得到的以下問題:

  • We can change the scale of the variable (redefining the limits of a variable)

    我們可以更改變量的小數(shù)位數(shù)(重新定義變量的限制)
  • Conversion of a non-linear relationship into a linear relationship

    將非線性關(guān)系轉(zhuǎn)換為線性關(guān)系
  • It is observed that algorithms better perform on symmetrically distributed variables than skewed so we can convert skewed distribution to symmetric distribution.

    可以看出,算法在對稱分布變量上的性能要優(yōu)于偏態(tài)分布,因此我們可以將偏態(tài)分布轉(zhuǎn)換為對稱分布。
  • 變量轉(zhuǎn)換方法 (Methods of Variable Transformation)

  • Non-linear transformation: We can replace the variable by its log value, square root, or cube root. These are non-linear transformations, hence help us to deal with all the points stated above.

    非線性轉(zhuǎn)換 :我們可以用變量的對數(shù)值,平方根或立方根替換變量。 這些是非線性變換,因此有助于我們處理上述所有問題。

  • Binning: We can divide the continuous values into various bins hence converting a continuous variable into categorical. This may help us to categorize the outlier into some categories with which our model can deal.

    Binning:我們可以將連續(xù)值劃分為不同的bin,從而將連續(xù)變量轉(zhuǎn)換為分類變量。 這可以幫助我們將異常值分類為模型可以處理的某些類別。

  • 加起來 (Summing up)

    This is an extensive guide for Exploratory Data Analysis. This not only includes how to detect anomalies but also how to deal and get rid of them. This is a very naive approach to EDA hence most of the chapters are covered yet.

    這是探索性數(shù)據(jù)分析的詳盡指南。 這不僅包括如何檢測異常,還包括如何處理和消除異常。 這是一種非常幼稚的EDA方法,因此大多數(shù)章節(jié)都已介紹。

    翻譯自: https://towardsdatascience.com/the-eda-theoretical-guide-b7cef7653f0d

    eda分析

    總結(jié)

    以上是生活随笔為你收集整理的eda分析_EDA理论指南的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。

    如果覺得生活随笔網(wǎng)站內(nèi)容還不錯,歡迎將生活随笔推薦給好友。

    主站蜘蛛池模板: avtt久久 | 亚洲熟女乱色综合亚洲小说 | 国产av成人一区二区三区高清 | 久久资源365 | 欧产日产国产精品 | 国产手机av在线 | 一级片久久久久 | 亚洲综合五区 | 少妇精品无码一区二区 | 成人精品视频99在线观看免费 | 91视频综合 | 自拍偷拍导航 | 国产成人无码www免费视频播放 | 国产免费黄色 | 国产chinesehd精品露脸 | 精品三级视频 | 欧美一级爱爱 | 久草老司机| 国产麻豆午夜三级精品 | 黄色国产视频 | 91成人免费在线 | 波多野结衣av电影 | 一本一道久久综合狠狠老精东影业 | 国产伦精品一区二区三区免.费 | 成年人看片网站 | 91成人福利视频 | 国产美女永久无遮挡 | 少妇愉情理伦片bd | 黑人高潮一区二区三区在线看 | 亚洲色图视频在线观看 | 欧洲一区二区视频 | 亚洲精品视频一二三区 | 国产精品美女主播 | 欧美激情一区二区三区p站 欧美mv日韩mv国产网站app | 日本免费一区二区三区四区 | 亚洲 欧美 日韩系列 | 国产精品久久久午夜夜伦鲁鲁 | 精品国产乱码久久久久久浪潮 | sm久久捆绑调教精品一区 | 日日夜夜精品视频 | 777免费视频| 精品视频免费在线 | 裸体男女树林做爰 | 九九热视频这里只有精品 | 国产高清在线视频观看 | 精品女同一区二区三区 | 三级免费黄 | 欧美自拍偷拍第一页 | 日韩一区在线播放 | 丰腴饱满的极品熟妇 | 亚洲三级理论 | 田中瞳av | 精品日本一区二区三区 | 久久精品国产欧美亚洲人人爽 | 免费看黄色片的网站 | 少妇av在线播放 | 九九热免费 | 日本特级黄色 | 日本三级中文字幕在线观看 | 国产精品一二三四 | 亚洲影院一区 | 女人扒开屁股让我添 | 亚洲精品性 | 日韩一区二 | 中文在线а√天堂官网 | 丁香花免费高清完整在线播放 | 潘金莲黄色一级片 | a色网站| 亚洲国产一区二区三区在线观看 | 小嫩女直喷白浆 | 国产综合在线观看视频 | 麻豆成人免费 | 国产wwwwwww | 免费的毛片网站 | 淫五月 | 美女视频污 | 免费看av软件 | 久热久操 | 在线a网| 天天色棕合合合合合合合 | 激情五月婷婷在线 | 日韩激情网站 | 欧美福利网站 | 色婷婷av久久久久久久 | 风间ゆみ大战黑人 | avtt亚洲天堂 | 五月婷婷综合在线观看 | 大乳女喂男人吃奶 | 国产精品欧美综合亚洲 | 懂色av成人一区二区三区 | 丁香六月婷婷激情 | 91高清在线免费观看 | 天天操天天摸天天干 | 91久久久久久| 五月天婷婷综合 | 亚洲人成小说 | 亚洲黄v | 视频三区在线 | 日韩成人高清在线 |