summarise函數基本上都是與group_by一同使用,輸出的項目除了字符無法進行統計外,數字、浮點數、因子都可以進行統計,統計輸出的項目基本有以下六種
Min. >> 欄位值範圍的最小值
1st Qu. >> 第一四分位數
Median >> 中位數
Mean >> 平均數
3rd Qu >> 第三四分衛數
Max. >> 欄位值範圍的最大值
若是沒有透過group_by分組,想單純整個數據集進行統計,也有`summary()`函數可以使用,輸出的項目也是一樣以上六種,對於summarise的異變函數也有`summarise_all()`等等,以下挑出幾個逐步的介紹
summarise基本語法
group_by() %>%
summarise(欄位變量)
summary基本語法
summary(dataset) or summarise(dataset$欄位變量)
輔助函數
中心計算函數 : `mean()`、`median()`..
統計函數 : `sd()`、`mad()`..
範圍計算 : `min()`、`max()`、`quantile()`..
計數函數 : `n()`..
首先來創建一個包含浮點數、整數、字符與因子的數據框
# 創建數據框
summarise_df_1 <- tibble(
number = as.integer(c(1, 2, -3, 4, -5, 1, 1, -5, 7)),
float = as.numeric(c(1.1, 2.6, 5.8, 3.9, 55.77, 82.34, 99.1, 23.6, 45.4)),
letter = c("ap", "ef", "tg", "hk", "bu", "xi", "ux", "it", NA),
weekday = c("Sunday", "Wednesday", "Monday", "Friday", "Saturday",
"Tuesday", "Thursday", "Sunday", "Wednesday"),
factor = as.factor(c("a", "b", "c", "a", "c", "a", "d", "k", "e"))
)
先使用summary進行整個數據集彙整,可以看到除了字符外,都直接進行彙整輸出,因子的部分則是進行各個值的分組統計數量
# summary彙整
summary(summarise_df_1)
r$> summary(summarise_df_1)
number float letter weekday factor
Min. :-5.0000 Min. : 1.10 Length:9 Length:9 a:3
1st Qu.:-3.0000 1st Qu.: 3.90 Class :character Class :character b:1
Median : 1.0000 Median :23.60 Mode :character Mode :character c:2
Mean : 0.3333 Mean :35.51 d:1
3rd Qu.: 2.0000 3rd Qu.:55.77 e:1
Max. : 7.0000 Max. :99.10 k:1
接下來將weekday進行分組,看對應值的最大值與平均值彙整情況
summarise_weekday <- summarise_df_1 %>%
group_by(weekday) %>%
summarise(
number = mean(number),
float = max(float)
)
r$> summarise_weekday
# A tibble: 7 x 3
weekday number float
<chr> <dbl> <dbl>
1 Friday 4 3.9
2 Monday -3 5.8
3 Saturday -5 55.8
4 Sunday -2 23.6
5 Thursday 1 99.1
6 Tuesday 1 82.3
7 Wednesday 4.5 45.4
再來將因子欄位分組後觀察對應值的標準差與計數數量,可以看到使用n()後自動將weekday欄位的值進行統計數量後輸出
summarise_weekday <- summarise_df_1 %>%
group_by(factor) %>%
summarise(
float = sd(float),
total = n()
)
r$> summarise_weekday
# A tibble: 7 x 3
weekday float total
<chr> <dbl> <int>
1 Friday NA 1
2 Monday NA 1
3 Saturday NA 1
4 Sunday 15.9 2
5 Thursday NA 1
6 Tuesday NA 1
7 Wednesday 30.3 2
倘若分組變量不只一個如何計算?可以看到輸出結果只計算唯一對應值的總額計數
summarise_weekday <- summarise_df_1 %>%
group_by(factor, letter) %>%
summarise(
total = n()
)
r$> summarise_weekday
# A tibble: 9 x 3
# Groups: factor [6]
factor letter total
<fct> <chr> <int>
1 a ap 1
2 a hk 1
3 a xi 1
4 b ef 1
5 c bu 1
6 c tg 1
7 d ux 1
8 e NA 1
9 k it 1
最後如果整份數據集需要的是分組後進行一樣的統計方式,則可以使用summarise_all函數進行統計,須注意的是summarise_all只統計數值,若是分組完尚有不是數值的欄位,輸出的值會變成NA
summarise_weekday <- summarise_df_1 %>%
group_by(weekday) %>%
summarise_all(mean)
r$> summarise_weekday
# A tibble: 7 x 5
weekday number float letter factor
<chr> <dbl> <dbl> <dbl> <dbl>
1 Friday 4 3.9 NA NA
2 Monday -3 5.8 NA NA
3 Saturday -5 55.8 NA NA
4 Sunday -2 12.4 NA NA
5 Thursday 1 99.1 NA NA
6 Tuesday 1 82.3 NA NA
7 Wednesday 4.5 24 NA NA
可以擴大分組,或是在統計前將不是數值的欄位移除
summarise_weekday <- summarise_df_1 %>%
group_by(weekday) %>%
select(-c("letter", "factor")) %>%
summarise_all(mean)
r$> summarise_weekday
# A tibble: 7 x 3
weekday number float
<chr> <dbl> <dbl>
1 Friday 4 3.9
2 Monday -3 5.8
3 Saturday -5 55.8
4 Sunday -2 12.4
5 Thursday 1 99.1
6 Tuesday 1 82.3
7 Wednesday 4.5 24
沒有留言:
張貼留言