這篇發的間隔有點久,慢慢的加入了比較多的編程與練習,像是HackerRank解題,也接了一些數據分析的案子在執行,上完Google數據分析的課程之後,一直想找個時間把Study 1也完成,邊處理其他事情也邊進行研究,除了加入雙語的版本之外,也多了其他的函數輔助分析,當然也還是遇到不少問題,像是用VScode、RStudio寫的程式碼都沒問題,偏偏丟到Kaggle上就出現錯誤,連lubridate都無法使用,換了好多種方式目前都還無法解決,為了之後一直有這種問題產生,之後小雷除了部落格之外,也會把完整的Rmarkdown報告放置在Google雲端中,有興趣的朋友除了Github之外,也可以下載瀏覽,載點就統一放在各個分析文章的末端。
至於為什麼沒有一開始就選這個Study來研究,因為小雷不愛騎腳踏車..雙語的版本,小雷的英文能力並沒有太好,所以其中若是有文法上的錯誤,還請見諒,分析報告中覺得好用的文法,也會寫成語法解析。
小雷的Kaggle >> 點我連結
小雷的Github >> 點我連結
觀看完整報告,請至雲端下載PDF : 點我連結View the full report, please go to the google cloud to download the PDF
---------------------------------------------做個分隔線------------------------------------------------
共享單車Cyclistic案例分析 | Data Analysis Cyclistic by R
關於Cyclistic | About Cyclistic
========================================================================================================
2016年Cyclistic推出共享單車的產品,截至目前共擁有超過5824輛自行車,與超過692個站點。
In 2016, Cyclistic launched a bike-sharing product. Up to now, it has more than 5,824 bikes and more than 692 sites。
營運模式 | Business Model
========================================================================================================
Cyclistic會員 : 購買年度會員。
休閒顧客 : 購買單程通行票、全天通行票。
輔助自行車(斜倚自行車、手三輪車和貨運自行車)約8%使用者。
約30%單車使用者每天為通勤需求。
Cyclistic member : Annual membersgip。
Casual customer : One-way passes、All-day passes。
About 8% users of assistive bicycles(reslining bicycles, hand tricycles and cargo bicycles)。
About 30% users of cycliists commute every work day。
主要目標 | Main Target
========================================================================================================
公司未來增長的關鍵為年度會員。
最大化將休閒顧客轉化年度會員。
分析Cyclistic共享單車數據與趨勢識別。
The key to the company’s future growth is annual membership。
Maximize the conversion of casual customers into annual membership。
Analysis of Cyclistic shared bicycle data and trend identification。
七個分析階段與流程 | Into Seven Analysis Phases and Processes
========================================================================================================
問題解析。
數據檢視。
數據清洗。
數據統整。
趨勢分析
可視化圖表。
結論與建議。
Problem analysis。
Data Set View。
Clearn data。
Data consolidation。
Analysis and trends。
Visualization chart。
Conclusion and suggertion。
第一階段 : 問題解析 | Phase One : Problem Analysis
1. 確定利益相關人與團隊 | Identify stakeholders and teams
主要利益關係者 : Cyclistic、Lily Moreno。
次要利益相關者 : 循環營銷分析團隊、循環執行團隊。
Primary stakeholders : Cyclistic、Lily Moreno。
Secondary stakeholders : Cyclistic marketing analytics team、Cyclistic executive team。
2. 主要業務目標 | Main bussiness objective
年會會員和休閒騎手使用 Cyclistic 自行車有何不同?
數據中的趨勢如何提供建議使Cyclistic進行策略研擬?
What the difference between membership and casual riders useing cyclistic bike?
How trends in data advise cyclistic for strategy development?
3. 數據使用範圍 | Data usage range
禁止使用使用者的身分個人資料、通行證購買是否多次。
數據操作範圍為2019_Q2至2020_Q1共一年數據操作使用。
數據格式、欄位以最新數據2020_Q1為基準。
Prohibit the use of the user’s identity personal data, whether the passes has been purchased times。
The data range chosenfrom 2019_Q2 to 2020_Q1。
The data type and field are based on the latest data 2020_Q1。
4. 需要解決的問題 | Issues that need resolving
使用者標籤紀錄共有四個,需要一致性處理。
數據中因紀錄問題而產生負數,需要刪除調整。
數據時間範圍劃分需要其他細部數據供彙整。
數據欄位與2020_Q1不符處需做調整。
There are four user tag records in total, which need to be processed consistently。
Negative numbers are generated in the data set to recording problem and need to bd deleted and adjusted。
Data fields that don’t match 2020_Q1 need to be adjusted。
第二階段 : 數據檢視 | Phase Two : Data Set View
特定數據集出處引用 : https://www.kaggle.com/malhaddi/divvy-trips
來源取得由Kaggle,檔案為可信任之公開數據集。
數據內容作者 : Mohammed
數據存放 : MySQL、Kaggle。
編碼位置 : Kaggle、Github。
編碼語言 : R。
IDE : RStudio、VScode。
References to specific datasets。
The source is obtained from Kaggle, the file is a trusted public dataset。
Data storage : MySQL、Kaggle。
Coding position : Kaggle、Github。
Coding language : R。
第三階段 : 數據清洗 | Phase Three : Clearn data
1. 數據確認 | Check data
- 載入所需模組。
- Load the required modules。
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.6 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.1.1 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(lubridate)
##
## 載入套件:'lubridate'
## 下列物件被遮斷自 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
library(tidyr)
library(gcookbook)
- 數據檔案導入。
- Data file import。
# --------------------- Data Location ---------------------------------
setwd("D:/Github_version_file_R/data_set/clylistic_data")
# --------------------- import data set ---------------------------------
# 2019_data
q2_19 <- read.csv("Divvy_Trips_2019_Q2.csv")
q3_19 <- read.csv("Divvy_Trips_2019_Q3.csv")
q4_19 <- read.csv("Divvy_Trips_2019_Q4.csv")
# 2020_data
q1_20 <- read.csv("Divvy_Trips_2020_Q1.csv")
- 數據內容預覽。
- Data view。
str(q2_19)
## 'data.frame': 1108163 obs. of 12 variables:
## $ X01...Rental.Details.Rental.ID : int 22178529 22178530 22178531 22178532 22178533 22178534 22178535 22178536 22178537 22178538 ...
## $ X01...Rental.Details.Local.Start.Time : chr "2019-04-01 00:02:22" "2019-04-01 00:03:02" "2019-04-01 00:11:07" "2019-04-01 00:13:01" ...
## $ X01...Rental.Details.Local.End.Time : chr "2019-04-01 00:09:48" "2019-04-01 00:20:30" "2019-04-01 00:15:19" "2019-04-01 00:18:58" ...
## $ X01...Rental.Details.Bike.ID : int 6251 6226 5649 4151 3270 3123 6418 4513 3280 5534 ...
## $ X01...Rental.Details.Duration.In.Seconds.Uncapped: chr "446.0" "1,048.0" "252.0" "357.0" ...
## $ X03...Rental.Start.Station.ID : int 81 317 283 26 202 420 503 260 211 211 ...
## $ X03...Rental.Start.Station.Name : chr "Daley Center Plaza" "Wood St & Taylor St" "LaSalle St & Jackson Blvd" "McClurg Ct & Illinois St" ...
## $ X02...Rental.End.Station.ID : int 56 59 174 133 129 426 500 499 211 211 ...
## $ X02...Rental.End.Station.Name : chr "Desplaines St & Kinzie St" "Wabash Ave & Roosevelt Rd" "Canal St & Madison St" "Kingsbury St & Kinzie St" ...
## $ User.Type : chr "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ Member.Gender : chr "Male" "Female" "Male" "Male" ...
## $ X05...Member.Details.Member.Birthday.Year : int 1975 1984 1990 1993 1992 1999 1969 1991 NA NA ...
str(q3_19)
## 'data.frame': 1640718 obs. of 12 variables:
## $ trip_id : int 23479388 23479389 23479390 23479391 23479392 23479393 23479394 23479395 23479396 23479397 ...
## $ start_time : chr "2019-07-01 00:00:27" "2019-07-01 00:01:16" "2019-07-01 00:01:48" "2019-07-01 00:02:07" ...
## $ end_time : chr "2019-07-01 00:20:41" "2019-07-01 00:18:44" "2019-07-01 00:27:42" "2019-07-01 00:27:10" ...
## $ bikeid : int 3591 5353 6180 5540 6014 4941 3770 5442 2957 6091 ...
## $ tripduration : chr "1,214.0" "1,048.0" "1,554.0" "1,503.0" ...
## $ from_station_id : int 117 381 313 313 168 300 168 313 43 43 ...
## $ from_station_name: chr "Wilton Ave & Belmont Ave" "Western Ave & Monroe St" "Lakeview Ave & Fullerton Pkwy" "Lakeview Ave & Fullerton Pkwy" ...
## $ to_station_id : int 497 203 144 144 62 232 62 144 195 195 ...
## $ to_station_name : chr "Kimball Ave & Belmont Ave" "Western Ave & 21st St" "Larrabee St & Webster Ave" "Larrabee St & Webster Ave" ...
## $ usertype : chr "Subscriber" "Customer" "Customer" "Customer" ...
## $ gender : chr "Male" "" "" "" ...
## $ birthyear : int 1992 NA NA NA NA 1990 NA NA NA NA ...
str(q4_19)
## 'data.frame': 704054 obs. of 12 variables:
## $ trip_id : int 25223640 25223641 25223642 25223643 25223644 25223645 25223646 25223647 25223648 25223649 ...
## $ start_time : chr "2019-10-01 00:01:39" "2019-10-01 00:02:16" "2019-10-01 00:04:32" "2019-10-01 00:04:32" ...
## $ end_time : chr "2019-10-01 00:17:20" "2019-10-01 00:06:34" "2019-10-01 00:18:43" "2019-10-01 00:43:43" ...
## $ bikeid : int 2215 6328 3003 3275 5294 1891 1061 1274 6011 2957 ...
## $ tripduration : chr "940.0" "258.0" "850.0" "2,350.0" ...
## $ from_station_id : int 20 19 84 313 210 156 84 156 156 336 ...
## $ from_station_name: chr "Sheffield Ave & Kingsbury St" "Throop (Loomis) St & Taylor St" "Milwaukee Ave & Grand Ave" "Lakeview Ave & Fullerton Pkwy" ...
## $ to_station_id : int 309 241 199 290 382 226 142 463 463 336 ...
## $ to_station_name : chr "Leavitt St & Armitage Ave" "Morgan St & Polk St" "Wabash Ave & Grand Ave" "Kedzie Ave & Palmer Ct" ...
## $ usertype : chr "Subscriber" "Subscriber" "Subscriber" "Subscriber" ...
## $ gender : chr "Male" "Male" "Female" "Male" ...
## $ birthyear : int 1987 1998 1991 1990 1987 1994 1991 1995 1993 NA ...
str(q1_20)
## 'data.frame': 426887 obs. of 13 variables:
## $ ride_id : chr "EACB19130B0CDA4A" "8FED874C809DC021" "789F3C21E472CA96" "C9A388DAC6ABF313" ...
## $ rideable_type : chr "docked_bike" "docked_bike" "docked_bike" "docked_bike" ...
## $ started_at : chr "2020-01-21 20:06:59" "2020-01-30 14:22:39" "2020-01-09 19:29:26" "2020-01-06 16:17:07" ...
## $ ended_at : chr "2020-01-21 20:14:30" "2020-01-30 14:26:22" "2020-01-09 19:32:17" "2020-01-06 16:25:56" ...
## $ start_station_name: chr "Western Ave & Leland Ave" "Clark St & Montrose Ave" "Broadway & Belmont Ave" "Clark St & Randolph St" ...
## $ start_station_id : int 239 234 296 51 66 212 96 96 212 38 ...
## $ end_station_name : chr "Clark St & Leland Ave" "Southport Ave & Irving Park Rd" "Wilton Ave & Belmont Ave" "Fairbanks Ct & Grand Ave" ...
## $ end_station_id : int 326 318 117 24 212 96 212 212 96 100 ...
## $ start_lat : num 42 42 41.9 41.9 41.9 ...
## $ start_lng : num -87.7 -87.7 -87.6 -87.6 -87.6 ...
## $ end_lat : num 42 42 41.9 41.9 41.9 ...
## $ end_lng : num -87.7 -87.7 -87.7 -87.6 -87.6 ...
## $ member_casual : chr "member" "member" "member" "member" ...
colnames(q2_19)
## [1] "X01...Rental.Details.Rental.ID"
## [2] "X01...Rental.Details.Local.Start.Time"
## [3] "X01...Rental.Details.Local.End.Time"
## [4] "X01...Rental.Details.Bike.ID"
## [5] "X01...Rental.Details.Duration.In.Seconds.Uncapped"
## [6] "X03...Rental.Start.Station.ID"
## [7] "X03...Rental.Start.Station.Name"
## [8] "X02...Rental.End.Station.ID"
## [9] "X02...Rental.End.Station.Name"
## [10] "User.Type"
## [11] "Member.Gender"
## [12] "X05...Member.Details.Member.Birthday.Year"
colnames(q3_19)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q4_19)
## [1] "trip_id" "start_time" "end_time"
## [4] "bikeid" "tripduration" "from_station_id"
## [7] "from_station_name" "to_station_id" "to_station_name"
## [10] "usertype" "gender" "birthyear"
colnames(q1_20)
## [1] "ride_id" "rideable_type" "started_at"
## [4] "ended_at" "start_station_name" "start_station_id"
## [7] "end_station_name" "end_station_id" "start_lat"
## [10] "start_lng" "end_lat" "end_lng"
## [13] "member_casual"
2. 數據一致性整理與合併 | Data consistency sorting and merging
- 欄位名稱一致性調整,變更符合最新數據之欄位名稱。
- Col name consistency adjustment, change the col name to mach the latest data。
# 調整名稱與2020數據一致 rename consistent to the 2020 data colnames
q4_19 <- q4_19 %>%
rename(
ride_id = trip_id,
started_at = start_time,
ended_at = end_time,
rideable_type = bikeid,
start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
member_casual = usertype
)
q3_19 <- q3_19 %>%
rename(
ride_id = trip_id,
started_at = start_time,
ended_at = end_time,
rideable_type = bikeid,
start_station_id = from_station_id,
start_station_name = from_station_name,
end_station_id = to_station_id,
end_station_name = to_station_name,
member_casual = usertype
)
q2_19 <- q2_19 %>%
rename(
ride_id = X01...Rental.Details.Rental.ID,
rideable_type = X01...Rental.Details.Bike.ID,
started_at = X01...Rental.Details.Local.Start.Time,
ended_at = X01...Rental.Details.Local.End.Time,
start_station_name = X03...Rental.Start.Station.Name,
start_station_id = X03...Rental.Start.Station.ID,
end_station_name = X02...Rental.End.Station.Name,
end_station_id = X02...Rental.End.Station.ID,
member_casual = User.Type
)
- 變更欄位型態,進行合併前準備動作。
- Change the data type and perform pre-merge preparation actions。
# 轉換數據型態準備整合 change data type before combine
q2_19 <- q2_19 %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type)
)
q3_19 <- q3_19 %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type)
)
q4_19 <- q4_19 %>%
mutate(
ride_id = as.character(ride_id),
rideable_type = as.character(rideable_type)
)
- 數據集合併,依照欄位進行。
- Data set merging, according to the col。
# 依照欄位進行合併 merge data set by rows
combine_datas <- bind_rows(q2_19, q3_19, q4_19, q1_20)
- 移除過舊欄位。
- Remove outdated col。
# 移除最新的檔案(2020)中沒有的項目 Remove items is not in the 2020 data set
combine_datas <- combine_datas %>%
select(-c(
"X01...Rental.Details.Duration.In.Seconds.Uncapped",
"Member.Gender",
"X05...Member.Details.Member.Birthday.Year",
"tripduration",
"gender",
"birthyear",
"start_lat",
"start_lng",
"end_lat",
"end_lng"
))
- 檢視合併後數據內容。
- View the merged data content。
# 檢查合併後數據內容 Check data values
dim(combine_datas)
## [1] 3879822 9
tail(combine_datas)
## ride_id started_at ended_at rideable_type
## 3879817 6F4D221BDDFD943F 2020-03-10 10:40:27 2020-03-10 10:40:29 docked_bike
## 3879818 ADDAA33CEBCAE733 2020-03-10 10:40:06 2020-03-10 10:40:07 docked_bike
## 3879819 82B10FA3994BC66A 2020-03-07 15:25:55 2020-03-07 16:14:03 docked_bike
## 3879820 AA0D5AAA0B59C8AA 2020-03-01 13:12:38 2020-03-01 13:38:29 docked_bike
## 3879821 3296360A7BC20FB8 2020-03-07 18:02:45 2020-03-07 18:13:18 docked_bike
## 3879822 064EC7698E4FF9B3 2020-03-08 13:03:57 2020-03-08 13:32:27 docked_bike
## start_station_id start_station_name end_station_id
## 3879817 675 HQ QR 675
## 3879818 675 HQ QR 675
## 3879819 161 Rush St & Superior St 240
## 3879820 141 Clark St & Lincoln Ave 210
## 3879821 672 Franklin St & Illinois St 264
## 3879822 110 Dearborn St & Erie St 85
## end_station_name member_casual
## 3879817 HQ QR casual
## 3879818 HQ QR casual
## 3879819 Sheridan Rd & Irving Park Rd member
## 3879820 Ashland Ave & Division St casual
## 3879821 Stetson Ave & South Water St member
## 3879822 Michigan Ave & Oak St casual
summary(combine_datas)
## ride_id started_at ended_at rideable_type
## Length:3879822 Length:3879822 Length:3879822 Length:3879822
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_station_id start_station_name end_station_id end_station_name
## Min. : 1.0 Length:3879822 Min. : 1.0 Length:3879822
## 1st Qu.: 77.0 Class :character 1st Qu.: 77.0 Class :character
## Median :174.0 Mode :character Median :174.0 Mode :character
## Mean :202.9 Mean :203.8
## 3rd Qu.:291.0 3rd Qu.:291.0
## Max. :675.0 Max. :675.0
## NA's :1
## member_casual
## Length:3879822
## Class :character
## Mode :character
##
##
##
##
3. 數據清理 | Clearn data
- 檢查使用者標籤與變更重複命名之欄位。
- Check user labels and change duplicated named fileds。
# 檢查member內的值 Check values in merber in the table
table(combine_datas$member_casual)
##
## casual Customer member Subscriber
## 48480 857474 378407 2595461
# 重複命名的rows變更 >> Subscriber與Customer Fix duplicate naming in Subscriber & Customer
combine_datas <- combine_datas %>%
mutate(member_casual = recode(member_casual, "Subscriber" = "member", "Customer" = "casual"))
- 增加日期格式與檢察數據值。
- Add date format and check data values。
combine_datas <- combine_datas %>%
mutate(
year = year(combine_datas$started_at),
month = month(combine_datas$started_at, label = TRUE),
week = wday(combine_datas$started_at, label = TRUE),
day = mday(combine_datas$started_at),
hour = hour(combine_datas$started_at)
)
# 檢查變更後的值 Check value
unique(combine_datas$year)
## [1] 2019 2020
unique(combine_datas$month)
## [1] 四月 五月 六月 七月 八月 九月 十月 十一月 十二月 一月
## [11] 二月 三月
## 12 Levels: 一月 < 二月 < 三月 < 四月 < 五月 < 六月 < 七月 < 八月 < ... < 十二月
unique(combine_datas$week)
## [1] 週一 週二 週三 週四 週五 週六 週日
## Levels: 週日 < 週一 < 週二 < 週三 < 週四 < 週五 < 週六
unique(combine_datas$day)
## [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31
unique(combine_datas$hour)
## [1] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
- 重新給定相關日期值。
- Reassign the relevant date values。
# 變更日期格式,新增欄位月、日、年、周 Format date type in month、day、year、week
# 重新給定星期的值 Change chinese to english in weeks
combine_datas <- combine_datas %>%
mutate(week = recode(week, "週日" = "Sunday", "週一" = "Monday", "週二" = "Tuesday", "週三" = "Wednesday", "週四" = "Thursday", "週五" = "Friday", "週六" = "Saturday"))
# 重新給定月份的值 Change chinese to english in month
combine_datas <- combine_datas %>%
mutate(month = recode(month, "一月" = "January", "二月" = "February", "三月" = "March", "四月" = "April", "五月" = "May", "六月" = "June", "七月" = "July", "八月" = "August", "九月" = "September", "十月" = "October", "十一月" = "November", "十二月" = "December"))
- 增加騎乘期間長度值,並轉換型態供統計使用。
- Increase the length value of the riding period and convert the type for statistical use。
# 增加欄位 >> 騎乘長度(單位 : 秒), Add Rows >> ride_length (in seconds)
combine_datas$ride_length <- difftime(combine_datas$ended_at, combine_datas$started_at)
# 轉換為數字作為統計使用 Change numeric type for statistics use
combine_datas$ride_length <- as.numeric(as.character(combine_datas$ride_length))
# 增加欄位 >> 騎乘長度(單位 : 分鐘),Add Rows >> ride_length (in minutes)
combine_datas <- combine_datas %>%
mutate(ride_length_minutes = (ride_length / 60))
# 轉換為數字作為統計使用 Change numeric type for statistics use
combine_datas$ride_length_minutes <- as.numeric(as.character(combine_datas$ride_length_minutes))
# 確認是否為數字 Check num type
is.numeric(combine_datas$ride_length_minutes)
## [1] TRUE
is.numeric(combine_datas$ride_length)
## [1] TRUE
- 移除錯誤數據、負值,並檢查是否乾淨。
- Remove bad data、negative values, and check for cleanliness。
# 移除錯誤數據 Remove bad data
combine_datas <- combine_datas[!(combine_datas$start_station_name == "HQ QR" | combine_datas$ride_length < 0 | combine_datas$ride_length_minutes < 0), ]
# 檢查是否移除乾淨 Check clearn or not
sum(combine_datas$start_station_name == "HQ QR")
## [1] 0
sum(combine_datas$ride_length <= 0)
## [1] 0
sum(combine_datas$ride_length_minutes <= 0)
## [1] 0
- 日期中星期、月份順序調整。
- Order of the week、month to adjusted。
# 存成新檔案 Save data sat as new file
combine_datas_clearn <- combine_datas
# 星期進行排序 Sort Week
combine_datas_clearn$week <- ordered(combine_datas_clearn$week, levels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday"))
# 月份進行排序 Sort month
combine_datas_clearn$month <- ordered(combine_datas_clearn$month, levels = c("January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"))
第四階段 : 數據統整 | Phase Four : Data consolidation
1. 使用者數量統計 | Statistics of the number of users。
# 使用者數量統計並顯示比例
user_sum <- combine_datas_clearn %>%
group_by(member_casual) %>%
summarise(total_riders = n())
user_sum
## # A tibble: 2 x 2
## member_casual total_riders
## <chr> <int>
## 1 casual 902182
## 2 member 2973860
per_rider <- paste(
round(user_sum$total_riders / sum(user_sum$total_riders) * 100, digits = 2), "%"
)
per_rider
## [1] "23.28 %" "76.72 %"
2. 以星期、型態進行數據彙整 | Data aggregation by week and member type。
# 依照星期、型態分組進行騎乘者分析 Data aggregation by week and member type。
riders_weeks <- combine_datas_clearn %>%
group_by(member_casual, week) %>%
summarise(
total_riders = n(),
avg_length_minutes = mean(ride_length_minutes)
)
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
riders_weeks
## # A tibble: 14 x 4
## # Groups: member_casual [2]
## member_casual week total_riders avg_length_minutes
## <chr> <ord> <int> <dbl>
## 1 casual Sunday 181293 59.7
## 2 casual Monday 103296 56.2
## 3 casual Tuesday 90510 59.9
## 4 casual Wednesday 92457 62.0
## 5 casual Thursday 102679 61.4
## 6 casual Friday 122404 62.9
## 7 casual Saturday 209543 55.5
## 8 member Sunday 267965 15.3
## 9 member Monday 472196 14.0
## 10 member Tuesday 508445 13.8
## 11 member Wednesday 500329 13.7
## 12 member Thursday 484177 13.7
## 13 member Friday 452790 13.7
## 14 member Saturday 287958 16.1
3. 以小時、會員總數進行數據彙整 | Data aggregation by hour and total number of members。
# 以小時、會員總數進行觀察 Data aggregation by hour and total number of members
riders_day_of_hour <- combine_datas_clearn %>%
group_by(member_casual, hour) %>%
summarise(
total_riders = n(),
avg_length_minutes = mean(ride_length_minutes)
)
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
riders_day_of_hour
## # A tibble: 48 x 4
## # Groups: member_casual [2]
## member_casual hour total_riders avg_length_minutes
## <chr> <int> <int> <dbl>
## 1 casual 0 8363 104.
## 2 casual 1 5495 104.
## 3 casual 2 3361 104.
## 4 casual 3 1982 170.
## 5 casual 4 1196 127.
## 6 casual 5 2690 99.0
## 7 casual 6 6291 66.4
## 8 casual 7 13302 32.2
## 9 casual 8 22304 54.8
## 10 casual 9 29057 68.2
## # ... with 38 more rows
4. 以月份、會員數進行數據彙整 | Data aggregation by month and total number of members。
# 以月、會員總數進行觀察 Data aggregation by month and total number of members
riders_month <- combine_datas_clearn %>%
group_by(member_casual, month) %>%
summarise(
total_riders = n(),
avg_length_minutes = mean(ride_length_minutes),
avg_length_seconds = mean(ride_length)
) %>%
arrange(month)
## `summarise()` has grouped output by 'member_casual'. You can override using the `.groups` argument.
riders_month
## # A tibble: 24 x 5
## # Groups: member_casual [2]
## member_casual month total_riders avg_length_minutes avg_length_seconds
## <chr> <ord> <int> <dbl> <dbl>
## 1 casual January 7785 162. 9699.
## 2 member January 136099 11.1 669.
## 3 casual February 12314 133. 7997.
## 4 member February 126714 12.8 768.
## 5 casual March 24615 70.8 4250.
## 6 member March 115593 14.3 860.
## 7 casual April 47744 50.9 3057.
## 8 member April 217566 13.5 811.
## 9 casual May 81624 51.2 3074.
## 10 member May 285834 13.8 831.
## # ... with 14 more rows
5. 站點使用次數彙整,取出使用最高與最低的站點比較 | Summarize the number of sites used, and compare top and last sites。
# 會員與站點次數,取出使用最高與最低的站點比較 Summarize the number of sites used, and compare top and last sites
# Top Station
member_station_top <- combine_datas_clearn %>%
group_by(start_station_name, member_casual) %>%
summarise(
total_riders = n(),
avg_length_minutes = mean(ride_length_minutes)
) %>%
arrange(desc(total_riders))
## `summarise()` has grouped output by 'start_station_name'. You can override using the `.groups` argument.
member_station_top[1:10, ]
## # A tibble: 10 x 4
## # Groups: start_station_name [10]
## start_station_name member_casual total_riders avg_length_minutes
## <chr> <chr> <int> <dbl>
## 1 Streeter Dr & Grand Ave casual 53415 49.1
## 2 Canal St & Adams St member 51948 12.6
## 3 Clinton St & Madison St member 46191 11.8
## 4 Clinton St & Washington Blvd member 43590 11.8
## 5 Lake Shore Dr & Monroe St casual 39686 49.2
## 6 Columbus Dr & Randolph St member 31053 12.0
## 7 Franklin St & Monroe St member 30982 14.0
## 8 Kingsbury St & Kinzie St member 30916 9.99
## 9 Daley Center Plaza member 30611 13.8
## 10 Canal St & Madison St member 26173 15.9
# Bottom Station
member_station_bottom <- combine_datas_clearn %>%
group_by(start_station_name, member_casual) %>%
summarise(
total_riders = n(),
avg_length_minutes = mean(ride_length_minutes)
) %>%
arrange(total_riders)
## `summarise()` has grouped output by 'start_station_name'. You can override using the `.groups` argument.
member_station_bottom[1:10, ]
## # A tibble: 10 x 4
## # Groups: start_station_name [10]
## start_station_name member_casual total_riders avg_length_minut~
## <chr> <chr> <int> <dbl>
## 1 LBS - BBB La Magie casual 1 2.5
## 2 Special Events casual 1 9.25
## 3 DIVVY Map Frame B/C Station casual 2 2.58
## 4 MTL-ECO5.1-01 casual 2 1.92
## 5 South Chicago Ave & Elliot Ave member 2 23.3
## 6 DIVVY CASSETTE REPAIR MOBILE ST~ member 3 6.98
## 7 Stewart Ave & 63rd St (*) casual 3 25.0
## 8 Vincennes Ave & 75th St member 3 22.4
## 9 Racine Ave & 61st St member 4 10367.
## 10 Racine Ave & 65th St member 4 20.6
第五階段 : 趨勢分析 | Phase Five : Analysis and trend
綜合以上數據清理、聚合結果,針對使用者人數、日期、比例、騎乘時間長度以及站點使用情況等五點進行交叉分析
1. 從使用者人數趨勢觀察,在行為訴求與外在條件的影響下,活躍的使用人數有不小的落差趨勢存在,統整以下兩點作為人數分析依據
星期統計數據顯示,周末休閒騎手活躍人數最多(超過18萬人次),平日則呈現下滑的趨勢(9萬至12萬人次),年度會員則是星期一至星期五活躍人數最多(45至50萬人次),周末則是減少了近50%的人次,顯示年度會員與休閒騎手在使用訴求上有明顯的差異性。
月統計數據顯示,6、7、8、9月為休閒騎手最活躍的月份(超過10萬人次以上),最低的月份則為1、2、11、12月(不滿2萬人次),年度會員活躍的月份則為4至10月(20至40萬人次),11月至3月則是呈現下滑(11萬至15萬人次不等),氣候的穩定度將會影響使用者(無論休閒騎手或是年度會員)騎乘意願。
2. 年度會員與休閒騎手比例有令人驚訝的部分,在於人數統計的差距,休閒騎手的占比並沒有預期中的高,以下為人數占比的相關數據統整
總使用者為3,876,042(人/次),其中年度會員占比約76.72%,休閒會員占約23.28%,顯示大部分使用者都已經是年度會員。
3. 在騎行時間的數據統計上,呈現極大的落差趨勢,在休閒騎手與年度會員的選擇上,有起到一定程度的決定因子,以下為統整後的數據狀況
騎行長度與星期比對數據顯示休閒騎手平均騎行時間約為60分鐘,年度會員則為14分鐘左右,在不同的訴求情況下大幅度的影響了騎行者的時間長短。
月統計數據顯示於1、2月份有較特別的騎行長度,最低的休閒人數卻有最久的騎行長度,表示有特殊選擇的使用者會在天氣寒冷的情況進行長時間的訓練或是休閒騎乘。
4. 最後則是各個站點於年度會員與休閒騎手的使用情形,使用率越高的站點,年度會員人數越密集,休閒騎手的密集程度則降低,使用率越低的站點,密度越趨平均,有幾種可以思考的假設,一為環境因素影響,二為心理情緒所影響,使用率越高的站點,顯示周遭交通與人流越頻繁程度,無論是空氣品質、周遭安全性與放鬆程度,都有可能降低休閒者的使用意願。
Based on the above data cleaning and aggregation results, cross-analysis is carried out on five points, including the number of users, date, ratio, length of riding time, and site usage.
1. From the observation of the trend of the number of users, under the influence of behavioral demands and external conditions, there is a large gap in the number of active users. The following two points are unified as the basis for the analysis of the number of users
Weekly statistics show that recreational riders are the most active on weekends (over 180,000), while on weekdays there is a downward trend (90,000 to 120,000), and annual members are the most active from Monday to Friday (450,000 to 500,000). ), and the number of membership decreased by nearly 50% on weekends, showing that there is a clear difference in the usage demands of annual members and recreational riders.
Monthly statistics show that June, July, August, and September are the most active months for recreational riders (more than 100,000 person-times), while the lowest months are January, February, November, and December (less than 20,000 person-times). The active months are April to October (200,000 to 400,000 passengers), and November to March shows a decline (110,000 to 150,000 passengers), and the stability of the climate will affect users (regardless of recreational riders). or Annual Member) willingness to use.
2. The ratio of annual members to recreational riders is surprising. It is due to the difference in statistics in the number of people. The proportion of recreational riders is not as high as expected. The following is the relevant data of the proportion of the number of people.
The total number of users is 3,876,042 (person/time), of which annual members account for about 76.72% and leisure members account for about 23.28%, indicating that most users are already annual members.
3. In the statistics of riding time, there is a huge gap trend. There is a certain degree of determinant in the choice of recreational riders and annual members. The following is the consolidated data status
The comparison data of riding length and week shows that the average riding time of leisure riders is about 60 minutes, and that of annual members is about 14 minutes, which greatly affects the length of riding time under different demands.
Monthly statistics show that in January and February, there are more special riding lengths, and the least number of casual riders but has the longest riding length, which means that users with special choices will conduct long-term training or leisure riding in cold weather.
4. The last is the usage of annual members and casual riders at each site. The higher the usage rate, the denser the annual membership, and the lower the density of recreational riders. The lower the usage rate, the more average the density. several hypotheses can be considered, one is the influence of environmental factors, and the other is the influence of psychological emotions. The higher the usage rate of the site, the more frequent the traffic and the flow of people around, whether it is the air quality, the surrounding safety, and the degree of relaxation, which may reduce the willingness of leisure users to use.
第六階段 : 可視化圖表 | Phase Six : Visualization chart
1. 統計所有使用者圖表 | Statistics of all users graph
# 使用者數量統計
user_sum %>%
ggplot(mapping = aes(
x = member_casual,
y = total_riders,
fill = member_casual
)) +
geom_col(position = "dodge") +
scale_y_continuous(
breaks = c(0, max(user_sum$total_riders), 500000),
name = "Total Riders"
) +
scale_x_discrete(
name = "Member & Casual"
) +
scale_fill_manual(values = c("tomato1", "springgreen3")) +
labs(title = "使用者總人數比對 Comparison of the Total Number of Users")
# 轉化圓餅圖
pie_percent <- pie(user_sum$total_riders,
labels = per_rider,
col = c("yellow2", "violetred"),
main = "年度會員與休閒會員比例 | Annual membership to the casual ratio"
)
# 添加說明
legend("topright", legend = c("Casual", "Member"), cex = 1, fill = c("yellow2", "violetred"))
2. 依照星期、平均騎乘時間進行使用者可視化 | User visualzation by week、average ride time。
# 依照星期進行騎乘者分析
riders_weeks %>%
ggplot(mapping = aes(
x = week,
y = total_riders,
fill = member_casual
)) +
geom_col(position = "dodge") +
scale_y_continuous(
breaks = c(seq(0, max(riders_weeks$total_riders), 50000)),
name = "Total Riders"
) +
scale_x_discrete(name = "Week Of Day") +
scale_fill_manual(values = c("violetred4", "springgreen4")) +
labs(
title = "星期與使用者 Week & Riders",
subtitle = "用戶數於週統計 Number of Users by Weeks"
)
# 平均騎乘時間
riders_weeks %>%
ggplot(mapping = aes(
x = week,
y = avg_length_minutes,
fill = member_casual
)) +
geom_col(position = "dodge") +
scale_y_continuous(
name = "Average Length (Unit : Minutes)"
) +
scale_x_discrete("Week Of Day") +
scale_fill_manual(values = c("violetred4", "springgreen4")) +
labs(
title = "各使用者平均騎乘時間 Average Ride Length by User",
subtitle = "統計於週的時間分布表 Statistical Time Distribution Table by Week"
)
3. 依照小時進行可視化 | Visualzation by hour。
# 依照小時進行騎乘者分析
riders_day_of_hour %>%
ggplot(mapping = aes(
x = hour,
y = total_riders,
fill = member_casual
)) +
geom_col(position = "dodge") +
scale_y_continuous(
breaks = c(seq(0, max(riders_day_of_hour$total_riders), 50000)),
name = "Total Riders"
) +
scale_x_continuous(
name = "Hours"
) +
scale_fill_manual(values = c("royalblue3", "orange2")) +
labs(
title = "不同時段人數比較 The Number of Users in Different Time Periods",
subtitle = " 使用者時段統計 User Period Statistics"
)
# 平均騎乘時間
riders_day_of_hour %>%
ggplot(mapping = aes(
x = avg_length_minutes,
y = total_riders,
fill = member_casual
)) +
geom_violin() +
scale_y_continuous(
breaks = c(seq(0, max(riders_day_of_hour$total_riders), 50000)),
name = "Total Riders"
) +
scale_x_continuous(
name = "Averge Length (Unit : Minute)"
) +
scale_fill_manual(values = c("yellow3", "purple3")) +
labs(title = "使用者統計比對騎乘時間 User Statistics vs. Ride Time")
4. 以月份可視化圖表 | Visualzation by month。
# 依照月進行騎乘者分析
riders_month %>%
ggplot(mapping = aes(
x = month,
y = total_riders,
fill = member_casual
)) +
geom_col(position = "dodge") +
scale_y_continuous(
breaks = c(seq(0, max(riders_month$total_riders), 50000)),
name = "Total Riders"
) +
scale_x_discrete("Month", labels = abbreviate) + # 自動縮減labels長度
scale_fill_manual(values = c("lightsalmon3", "cornflowerblue")) +
labs(title = "以月份觀察使用者使用情形 Observing User Usage by Month")
# 平均騎乘時間
riders_month %>%
ggplot(mapping = aes(
x = avg_length_minutes,
y = month,
fill = member_casual
)) +
geom_col(position = "dodge") +
xlab("Average Length (Unit : Minutes)") +
ylab("Month") +
scale_fill_manual(values = c("lightsalmon3", "cornflowerblue")) +
labs(title = "使用者於各月份騎乘情形統計 Statistics of Users Situation in Month")
5. 比對排名為前二十、後二十個站點使用情況 | The comparison ranking is the useage of the top 20 and the last 20 sites。
# 熱門站點使用狀況
member_station_top[1:20, ] %>%
ggplot(mapping = aes(
x = start_station_name,
y = total_riders,
fill = avg_length_minutes
)) +
geom_col() +
scale_y_continuous(
breaks = c(seq(0, max(member_station_top$total_riders), 5000)),
name = "Total Riders"
) +
scale_x_discrete("Start Station (Unit : Top 20)", labels = abbreviate) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "熱門站點使用者人數比對 Comparison of Users vs Popular Station")
# 冷門站點使用狀況
member_station_bottom[1:20, ] %>%
ggplot(mapping = aes(
x = start_station_name,
y = total_riders,
fill = avg_length_minutes
)) +
geom_col(position = "dodge") +
scale_y_continuous(
name = "Total Riders"
) +
scale_x_discrete("Start Station", labels = abbreviate) +
theme(axis.text.x = element_text(angle = 45)) +
labs(title = "冷門站點使用者人數比對 Comparison of Users vs Unpopular Station")
6. 使用分布顯示前後500個站點使用情形 | Usage distribution shows the usage of top and bottom 500 sites。
# 熱門站點使用狀況
member_station_top[1:500, ] %>%
ggplot(mapping = aes(
x = total_riders,
y = avg_length_minutes,
color = member_casual,
shape = member_casual,
)) +
geom_point(position = "jitter") +
facet_wrap(~member_casual) +
ylab("Average Length (Unit : Minutes)") +
xlab("Total Riders") +
labs(title = "使用最多的500個站點使用情形 Top 500 sites used")
# 冷門站點使用狀況
member_station_bottom[1:500, ] %>%
ggplot(mapping = aes(
x = total_riders,
y = avg_length_minutes,
color = member_casual,
shape = member_casual
)) +
geom_point(position = "jitter") +
theme(axis.text.x = element_text(angle = 30)) +
facet_wrap(~member_casual) +
ylab("Average Length (Unit : Minutes)") +
xlab("Total Riders") +
labs(title = "使用最少的500個站點使用情形 Minimum used 500 sites")
第七階段 : 結論與建議 | Phase Seven : Conclusion and suggertion
分析後結論
Cyclistic的財務策略認為年度會員比休閒騎手營利更可觀,但是針對以上所有數據統整、分析結果顯示,休閒騎手與年度會員不同之處不只在於資格花費上,更多的是使用訴求的差異(訓練用途、休閒用途、固定路徑),資格價值的認定(方便性、機動性)、氣候穩定度(寒冷、炎熱、潮濕)、周遭安全(人流量、車流量)與放鬆程度都將影響休閒騎手轉換為年度會員需要考量的因素。
Conclusion after analysis
Cyclistic’s financial strategy believes that annual membership is more profitable than recreational riders, but the results of the above data integration and analysis show that the difference between recreational riders and annual memberships is not only in the cost of qualifications but more in the difference in usage demands (training, leisure, fixed path), determination of membership value (convenience, mobility), climate stability (cold, hot, humid), surrounding safety (crowd flow, traffic flow), and degree of relaxation will all effect casual rider convert to an annual membership.
建議事項
1. 營銷策略建議著重於創造休閒騎手持續使用的習慣,持續習慣的培養建立,有助於休閒騎手添加其他不同於休閒的用途,進而創造資格價值產生的落差,自然的轉換為年度會員。
2. 行銷操作上建議不只是價格上的露出,媒體、網路、固定式廣告、各個站點中放入更多關於多工使用的標語、圖像等,引發休閒騎手對於其他騎乘方式的興趣,更好的達到宣傳的效果。
3. 除了媒體之外,建議搭配APP、穿戴式裝置等,結合休閒、通勤、臨時機動、運動、訓練等模式,並且提醒宣達使用者穩定騎乘的重要,不僅休閒會員的轉換,連帶年度會員也能產生更好的忠誠黏度,一舉數得。
Suggestions
1. Marketing strategy recommendations focus on creating the habit of continuous use by recreational riders. The cultivation and establishment of continuous habits will help recreational riders add other uses that are different from leisure, thereby creating a gap in qualification value, and naturally converting to annual membership.
2. In terms of marketing operations, it is recommended More than just show price differences. but also to put more slogans and images about multi-tasking in the media, the Internet, stationary advertisements, various sites, etc., to arouse the interest of recreational riders in other riding methods. Better to achieve the effect of publicity.
3. In addition to the media, it is recommended to use APPs, wearable devices, etc., combined with leisure, commuting, temporary maneuvering, sports, training, and other modes, and remind users of the importance of stable riding, not only the conversion of casual members but also the annual membership. Can generate better loyalty.
========================================================================================================