當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

DataCamp课程＜Tidyverse＞ Chapter.3 分组和概括

發布時間：2023/12/20 编程问答 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 DataCamp课程＜Tidyverse＞ Chapter.3 分组和概括小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

Tidyverse課程目錄

Chapter 1. 數據整形
Chapter 2. 數據可視化
Chapter 3. 分組和概括
Chapter 4. 可視化類型

Chapter 3. 分組和概括

用summarize進行描述性統計

summarize的功能就是對某個變量根據指定(比方說平均數，中位數)就行概述。
舉個例子，我們要看一下lifeExp的中位數。

# Summarize to find the median life expectancy gapminder %>% summarize(medianLifeExp=median(lifeExp)) # A tibble: 1 x 1medianLifeExp<dbl> 1 60.7

接下來結合一下之前學到的filter，統計分析一下year為1957的數據里的lifeExp的中位數。

# Filter for 1957 then summarize the median life expectancy gapminder %>% filter(year==1957)%>% summarize(medianLifeExp=median(lifeExp)) # A tibble: 1 x 1medianLifeExp<dbl> 1 48.4

當然也可以同時統計兩個變量。比方說用max()查看最大值。

# Filter for 1957 then summarize the median life expectancy and the maximum GDP per capita gapminder %>% filter(year==1957)%>% summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap)) # A tibble: 1 x 2medianLifeExp maxGdpPercap<dbl> <dbl> 1 48.4 113523.

用group_by進行分組描述性統計

在summarize之前用group_by的話可以實現根據某個變量的種類進行分類描述性統計。比方說，根據下面的代碼尋找每一年的lifeExp的中位數和gdpPercap的最大值。

# Find median life expectancy and maximum GDP per capita in each year gapminder %>% group_by(year) %>% summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap)) # A tibble: 12 x 3year medianLifeExp maxGdpPercap* <int> <dbl> <dbl>1 1952 45.1 108382.2 1957 48.4 113523.3 1962 50.9 95458.4 1967 53.8 80895.5 1972 56.5 109348.6 1977 59.7 59265.7 1982 62.4 33693.8 1987 65.8 31541.9 1992 67.7 34933. 10 1997 69.4 41283. 11 2002 70.8 44684. 12 2007 71.9 49357.

結合filter，這次我們需要尋找year為1957數據里每個continent里的lifeExp的中位數和gdpPercap的最大值。

gapminder %>% filter(year==1957) %>% group_by(continent) %>% summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap))

當然，group_by里的變量也可以是多個，比方說

# Find median life expectancy and maximum GDP per capita in each continent/year combination gapminder %>% group_by(continent,year) %>% summarize(medianLifeExp=median(lifeExp),maxGdpPercap=max(gdpPercap)) `summarise()` has grouped output by 'continent'. You can override using the `.groups` argument. # A tibble: 60 x 4 # Groups: continent [5]continent year medianLifeExp maxGdpPercap<fct> <int> <dbl> <dbl>1 Africa 1952 38.8 4725.2 Africa 1957 40.6 5487.3 Africa 1962 42.6 6757.4 Africa 1967 44.7 18773.5 Africa 1972 47.0 21011.6 Africa 1977 49.3 21951.7 Africa 1982 50.8 17364.8 Africa 1987 51.6 11864.9 Africa 1992 52.4 13522. 10 Africa 1997 52.8 14723. # … with 50 more rows

描述性統計的可視化

先根據year總結出每年lifeExp的中位數和gdpPercap的最大值。然后用ggplot2對其進行可視化，此處加入了expand_limits(y = 0)這條指令，這是為了讓y軸包含0值。

by_year <- gapminder %>%group_by(year) %>%summarize(medianLifeExp = median(lifeExp),maxGdpPercap = max(gdpPercap))# Create a scatter plot showing the change in medianLifeExp over time ggplot(by_year,aes(x=year,y=medianLifeExp))+ geom_point()+ expand_limits(y = 0)

接下來畫一個稍微復雜的圖，此處會用到Chapter.2數據可視化的知識。首先根據year和continent將數據進行組化，并且計算gdpPercap的中位數。然后將數據可視化，橫軸是year，縱軸是medianGdpPercap。并且根據continent進行上色。

# Summarize medianGdpPercap within each continent within each year: by_year_continent by_year_continent <- gapminder %>% group_by(year,continent) %>% summarize(medianGdpPercap = median(gdpPercap))# Plot the change in medianGdpPercap in each continent over time ggplot(by_year_continent,aes(x=year,y=medianGdpPercap,color=continent)) + geom_point()+ expand_limits(y=0)

還可以可視化兩個變量的描述性統計的關聯。比方說根據下面的代碼可以對2007年的gdpPercap的中位數和lifeExp的中位數進行可視化，并根據continent給圖形上色。

# Summarize the median GDP and median life expectancy per continent in 2007 by_continent_2007<- gapminder %>% filter(year==2007) %>% group_by(continent) %>% summarize(medianGdpPercap=median(gdpPercap),medianLifeExp=median(lifeExp))# Use a scatter plot to compare the median GDP and median life expectancy ggplot(by_continent_2007,aes(x=medianGdpPercap,y=medianLifeExp,color=continent))+ geom_point()+ expand_limits(y = 0)

總結

以上是生活随笔為你收集整理的DataCamp课程＜Tidyverse＞ Chapter.3 分组和概括的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： java 合并多个文件_java中如何将
下一篇： STM32从零到一，从标准库移植到HAL