filter篩選的方式與寫法會比較複雜,除了一般篩選,也可以進行運算符、接近值篩選等等,創造出更多不同型態的數據資料。
基本語法
filter(dataset, 變量篩選條件..)
可以搭配的函數
- 比較運算符 `==`,`!=`,`>`,`<`,`>=`,`<=`
- 邏輯運算符 `&`,`|`,`!`,`xor()`
- 之間或接近值篩選 `between()`,`near()`
先載入dplyr
library(dplyr)
一部分會以Cyclistic數據集作演示,先載入看一下資料內容
# 載入數據
cyclistic_dplyr <- read.csv("combine_datas_clearn.csv")
dplyr_operate <- cyclistic_dplyr
# 檢視數據
r$> str(dplyr_operate)
'data.frame': 3876042 obs. of 16 variables:
$ ride_id : chr "22178529" "22178530" "22178531" "22178532" ...
$ started_at : chr "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
$ ended_at : chr "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
$ rideable_type : chr "6251" "6226" "5649" "4151" ...
$ start_station_id : int 81 317 283 26 202 420 503 260 211 211 ...
$ start_station_name : chr "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
$ end_station_id : int 56 59 174 133 129 426 500 499 211 211 ...
$ end_station_name : chr "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
$ member_casual : chr "member" "member" "member" "member" ...
$ year : int 2019 2019 2019 2019 2019 2019 2019 2019 2019 2019 ...
$ month : chr "April" "April" "April" "April" ...
$ week : chr "Monday" "Monday" "Monday" "Monday" ...
$ day : int 1 1 1 1 1 1 1 1 1 1 ...
$ hour : int 0 0 0 0 0 0 0 0 0 0 ...
$ ride_length : int 446 1048 252 357 1007 257 548 383 2137 2120 ...
$ ride_length_minutes: num 7.43 17.47 4.2 5.95 16.78 ...
一般篩選情形,篩選出會員是casual的
# 有兩種會員
unique(dplyr_operate$member_casual)
# 輸出結果
r$> unique(dplyr_operate$member_casual)
[1] "member" "casual"
# 篩選出休閒會員
filter_casual <- filter(
dplyr_operate,
member_casual == "casual"
)
unique(filter_casual$member_casual)
# 輸出結果
r$> unique(filter_casual$member_casual)
[1] "casual"
需要注意的是”==”,單獨一個”=” R應該是判斷成變量位置而不是取值,會出現error,但是貼心的是error會詢問是否想這樣寫呢~修改就很快了
# 取值方式錯誤
error_df <- filter(
dplyr_operate,
member_casual = "casual"
)
# 輸出結果出現錯誤
Error: Problem with `filter()` input `..1`.
x Input `..1` is named.
i This usually means that you've used `=` instead of `==`.
i Did you mean `member_casual == "casual"`?
加入比較運算篩選條件,除了會員不是casual之外,騎行時間要超過30分鐘
# 加入比較運算符
filter_compare <- filter(
dplyr_operate,
member_casual != "casual",
ride_length_minutes >= 30
)
# 確認篩選的正確性
max(filter_compare$ride_length_minutes)
min(filter_compare$ride_length_minutes)
unique(filter_compare$member_casual)
# 結果輸出
r$> max(filter_compare$ride_length_minutes)
min(filter_compare$ride_length_minutes)
unique(filter_compare$member_casual)
[1] 150943.9
[1] 30
[1] "member"
加入邏輯運算篩選,篩選出星期一,騎行時間介於1~2分鐘
# 加入邏輯運算
filter_logic <- filter(
dplyr_operate,
week == "Monday",
ride_length_minutes >= 1 & ride_length_minutes <= 2
)
# 結果輸出
r$> max(filter_logic$ride_length_minutes)
min(filter_logic$ride_length_minutes)
unique(filter_logic$week)
[1] 2
[1] 1
[1] "Monday"
between函數語法
between(變數, 左邊範圍, 右邊範圍)
會找出在左右兩邊的中間所有值,範圍值是有包含設定值,也就是">=","<="
下面篩選出騎行時間介於10~15分鐘的範圍
# between函數用法
filter_between <- filter(
dplyr_operate,
between(ride_length_minutes, 10, 15)
)
# 檢查條件範圍
max(filter_between$ride_length_minutes)
min(filter_between$ride_length_minutes)
# 結果輸出
r$> max(filter_between$ride_length_minutes)
min(filter_between$ride_length_minutes)
[1] 15
[1] 10
接下來的函數near()比較特別,R裡面計算小數點並不是無限,所以有可能造成值的偏差,這時就需要這個函數取接近值,如下所示
(1 / 2) * 2 == 1
[1] TRUE
(1 / 49) * 49 == 1
[1] FALSE
加入near()與接近值
near((1 / 49) * 49, 1)
r$> near((1 / 49) * 49, 1)
[1] TRUE
沒有留言:
張貼留言