Hotel Bookings Analytic – R 語法解析
上一篇我將Hotel Bookings透過R進行分析的文章放了上來,這篇我想分享一下開始分析時我的流程(英文好的大大這部分可以省去很多麻煩..)、遇到的疑難問題 ,以及我選擇的解決方案,也會針對裡面有用到的一些Function進行解析與使用時機,對於跟我一樣的新手數據分析師來說可以當作分析實作的參考
若大大們有更好的建議,請不吝指教!!若有需要返回看Hotel Bookings的大大~
上一篇我將Hotel Bookings透過R進行分析的文章放了上來,這篇我想分享一下開始分析時我的流程(英文好的大大這部分可以省去很多麻煩..)、遇到的疑難問題 ,以及我選擇的解決方案,也會針對裡面有用到的一些Function進行解析與使用時機,對於跟我一樣的新手數據分析師來說可以當作分析實作的參考
若大大們有更好的建議,請不吝指教!!若有需要返回看Hotel Bookings的大大~
這是上完google data analytic 課程後自我進行的數據分析實例,放上來讓大家可以做個參考。
並沒有像網路上的神人分析的那麼徹底,很多還加上了機器學習的領域....讓我這個小小的數據分析新手自嘆不如,但多看多學還是好事~對於google data analytic課程感興趣的朋友,也放上相關的連結,感興趣的朋友可以先行查閱,網路上也有不少的資訊分享,至於自己進修數據分析課程的相關內容,整理好後也會逐步po出來
google課程連結 >> coursera - google data analyze
夜深了,還是趕緊進入主題比較好,免得肚子裡的深夜食堂開張又得去搜刮食物了。
這次分析的主軸是兩家飯店,Resort Hotel與City Hotel
整個數據集包含了32列與119390行,說真的在Kaggle或是其他數據集取得的數據量體都不會太小,直接用excel打開的話一定跑不出全部的數據
我是用RStudio寫Markdown報告,用VScode寫R編程,數據集就存回MySQL可以空閒時練習Query語法,至於為什麼要這麼麻煩呢,因為VScode與RStudio兩者都有我覺得好用的地方,當然還是會慢慢摸索找出個最好的整合辦法,只是現階段來說先以兩者的優點來進行分析與報告產出還是較方便,畢竟花不到一個月的時間上完Google Data Analyze課程,對於英文不擅長的我來說已經很吃力了,還要快速的整合這些..那乾脆先睡比較實在
以下是我用R Markdown寫完的分析報告,我的Kggle上也有相同的報告書
若大大們對於編程的內容有更好的建議,請不吝指教,分析的內容並不是全面性的,而是以飯店基本面的客流量、銷售管道、旅客重複與否性質做分析
未來也可以後續再做進化,納入更多的因素,相信會有更深一層的分析報告。
---------------------------分隔線-----------------------------------
透過兩家不同類型的飯店,Resort Hotel與City Hotel,進行分析彼此的基本情況、來源管道與旅客交互情形。
library(tidyverse)
library(dplyr)
library(tidyr)
library(here)
library(skimr)
library(janitor)
library(lubridate)
library(ggplot2)
# Data Set Locarion
setwd("D:/Github_version_file_R/data_set/data_frame_hotel")
# File Check
hotel_df <- read.csv("hotel_bookings.csv")
預覽數據結果 :
1. NA值 : 4個,位於children處。
2. NULL值 : 112593個,位於company處。
3. NULL值 : 16340個,位於agent處。
4. NUll值 : 488個,位於country處。
5. Undefined不明確紀錄值 : 1169個,位於meal處。
6. Undefined不明確紀錄值 : 5個,位於distribution_channel處。
7. 拼寫檢查 : 無錯誤。
8. 確認數據為119390行,32列。
str(hotel_df)
## 'data.frame': 119390 obs. of 32 variables:
## $ hotel : chr "Resort Hotel" "Resort Hotel" "Resort Hotel" "Resort Hotel" ...
## $ is_canceled : int 0 0 0 0 0 0 0 0 1 1 ...
## $ lead_time : int 342 737 7 13 14 14 0 9 85 75 ...
## $ arrival_date_year : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ arrival_date_month : chr "July" "July" "July" "July" ...
## $ arrival_date_week_number : int 27 27 27 27 27 27 27 27 27 27 ...
## $ arrival_date_day_of_month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stays_in_weekend_nights : int 0 0 0 0 0 0 0 0 0 0 ...
## $ stays_in_week_nights : int 0 0 1 1 2 2 2 2 3 3 ...
## $ adults : int 2 2 1 1 2 2 2 2 2 2 ...
## $ children : int 0 0 0 0 0 0 0 0 0 0 ...
## $ babies : int 0 0 0 0 0 0 0 0 0 0 ...
## $ meal : chr "BB" "BB" "BB" "BB" ...
## $ country : chr "PRT" "PRT" "GBR" "GBR" ...
## $ market_segment : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ distribution_channel : chr "Direct" "Direct" "Direct" "Corporate" ...
## $ is_repeated_guest : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_cancellations : int 0 0 0 0 0 0 0 0 0 0 ...
## $ previous_bookings_not_canceled: int 0 0 0 0 0 0 0 0 0 0 ...
## $ reserved_room_type : chr "C" "C" "A" "A" ...
## $ assigned_room_type : chr "C" "C" "C" "A" ...
## $ booking_changes : int 3 4 0 0 0 0 0 0 0 0 ...
## $ deposit_type : chr "No Deposit" "No Deposit" "No Deposit" "No Deposit" ...
## $ agent : chr "NULL" "NULL" "NULL" "304" ...
## $ company : chr "NULL" "NULL" "NULL" "NULL" ...
## $ days_in_waiting_list : int 0 0 0 0 0 0 0 0 0 0 ...
## $ customer_type : chr "Transient" "Transient" "Transient" "Transient" ...
## $ adr : num 0 0 75 75 98 ...
## $ required_car_parking_spaces : int 0 0 0 0 0 0 0 0 0 0 ...
## $ total_of_special_requests : int 0 0 0 0 1 1 0 1 1 0 ...
## $ reservation_status : chr "Check-Out" "Check-Out" "Check-Out" "Check-Out" ...
## $ reservation_status_date : chr "2015-07-01" "2015-07-01" "2015-07-02" "2015-07-02" ...
head(hotel_df)
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## arrival_date_week_number arrival_date_day_of_month stays_in_weekend_nights
## 1 27 1 0
## 2 27 1 0
## 3 27 1 0
## 4 27 1 0
## 5 27 1 0
## 6 27 1 0
## stays_in_week_nights adults children babies meal country market_segment
## 1 0 2 0 0 BB PRT Direct
## 2 0 2 0 0 BB PRT Direct
## 3 1 1 0 0 BB GBR Direct
## 4 1 1 0 0 BB GBR Corporate
## 5 2 2 0 0 BB GBR Online TA
## 6 2 2 0 0 BB GBR Online TA
## distribution_channel is_repeated_guest previous_cancellations
## 1 Direct 0 0
## 2 Direct 0 0
## 3 Direct 0 0
## 4 Corporate 0 0
## 5 TA/TO 0 0
## 6 TA/TO 0 0
## previous_bookings_not_canceled reserved_room_type assigned_room_type
## 1 0 C C
## 2 0 C C
## 3 0 A C
## 4 0 A A
## 5 0 A A
## 6 0 A A
## booking_changes deposit_type agent company days_in_waiting_list customer_type
## 1 3 No Deposit NULL NULL 0 Transient
## 2 4 No Deposit NULL NULL 0 Transient
## 3 0 No Deposit NULL NULL 0 Transient
## 4 0 No Deposit 304 NULL 0 Transient
## 5 0 No Deposit 240 NULL 0 Transient
## 6 0 No Deposit 240 NULL 0 Transient
## adr required_car_parking_spaces total_of_special_requests reservation_status
## 1 0 0 0 Check-Out
## 2 0 0 0 Check-Out
## 3 75 0 0 Check-Out
## 4 75 0 0 Check-Out
## 5 98 0 1 Check-Out
## 6 98 0 1 Check-Out
## reservation_status_date
## 1 2015-07-01
## 2 2015-07-01
## 3 2015-07-02
## 4 2015-07-02
## 5 2015-07-03
## 6 2015-07-03
colnames(hotel_df)
## [1] "hotel" "is_canceled"
## [3] "lead_time" "arrival_date_year"
## [5] "arrival_date_month" "arrival_date_week_number"
## [7] "arrival_date_day_of_month" "stays_in_weekend_nights"
## [9] "stays_in_week_nights" "adults"
## [11] "children" "babies"
## [13] "meal" "country"
## [15] "market_segment" "distribution_channel"
## [17] "is_repeated_guest" "previous_cancellations"
## [19] "previous_bookings_not_canceled" "reserved_room_type"
## [21] "assigned_room_type" "booking_changes"
## [23] "deposit_type" "agent"
## [25] "company" "days_in_waiting_list"
## [27] "customer_type" "adr"
## [29] "required_car_parking_spaces" "total_of_special_requests"
## [31] "reservation_status" "reservation_status_date"
nrow(hotel_df) # 119390行
## [1] 119390
ncol(hotel_df) # 32列
## [1] 32
if (sum(is.na(hotel_df)) > 0) {
print(sum(is.na(hotel_df)))
} else {
print("not na")
}
## [1] 4
# check unique >> chr col
unique(hotel_df$hotel)
## [1] "Resort Hotel" "City Hotel"
unique(hotel_df$meal)
## [1] "BB" "FB" "HB" "SC" "Undefined"
unique(hotel_df$country)
## [1] "PRT" "GBR" "USA" "ESP" "IRL" "FRA" "NULL" "ROU" "NOR" "OMN"
## [11] "ARG" "POL" "DEU" "BEL" "CHE" "CN" "GRC" "ITA" "NLD" "DNK"
## [21] "RUS" "SWE" "AUS" "EST" "CZE" "BRA" "FIN" "MOZ" "BWA" "LUX"
## [31] "SVN" "ALB" "IND" "CHN" "MEX" "MAR" "UKR" "SMR" "LVA" "PRI"
## [41] "SRB" "CHL" "AUT" "BLR" "LTU" "TUR" "ZAF" "AGO" "ISR" "CYM"
## [51] "ZMB" "CPV" "ZWE" "DZA" "KOR" "CRI" "HUN" "ARE" "TUN" "JAM"
## [61] "HRV" "HKG" "IRN" "GEO" "AND" "GIB" "URY" "JEY" "CAF" "CYP"
## [71] "COL" "GGY" "KWT" "NGA" "MDV" "VEN" "SVK" "FJI" "KAZ" "PAK"
## [81] "IDN" "LBN" "PHL" "SEN" "SYC" "AZE" "BHR" "NZL" "THA" "DOM"
## [91] "MKD" "MYS" "ARM" "JPN" "LKA" "CUB" "CMR" "BIH" "MUS" "COM"
## [101] "SUR" "UGA" "BGR" "CIV" "JOR" "SYR" "SGP" "BDI" "SAU" "VNM"
## [111] "PLW" "QAT" "EGY" "PER" "MLT" "MWI" "ECU" "MDG" "ISL" "UZB"
## [121] "NPL" "BHS" "MAC" "TGO" "TWN" "DJI" "STP" "KNA" "ETH" "IRQ"
## [131] "HND" "RWA" "KHM" "MCO" "BGD" "IMN" "TJK" "NIC" "BEN" "VGB"
## [141] "TZA" "GAB" "GHA" "TMP" "GLP" "KEN" "LIE" "GNB" "MNE" "UMI"
## [151] "MYT" "FRO" "MMR" "PAN" "BFA" "LBY" "MLI" "NAM" "BOL" "PRY"
## [161] "BRB" "ABW" "AIA" "SLV" "DMA" "PYF" "GUY" "LCA" "ATA" "GTM"
## [171] "ASM" "MRT" "NCL" "KIR" "SDN" "ATF" "SLE" "LAO"
unique(hotel_df$market_segment)
## [1] "Direct" "Corporate" "Online TA" "Offline TA/TO"
## [5] "Complementary" "Groups" "Undefined" "Aviation"
unique(hotel_df$distribution_channel)
## [1] "Direct" "Corporate" "TA/TO" "Undefined" "GDS"
unique(hotel_df$deposit_type)
## [1] "No Deposit" "Refundable" "Non Refund"
unique(hotel_df$reserved_room_type)
## [1] "C" "A" "D" "E" "G" "F" "H" "L" "P" "B"
unique(hotel_df$assigned_room_type)
## [1] "C" "A" "D" "E" "G" "F" "I" "B" "H" "P" "L" "K"
unique(hotel_df$agent)
## [1] "NULL" "304" "240" "303" "15" "241" "8" "250" "115" "5"
## [11] "175" "134" "156" "243" "242" "3" "105" "40" "147" "306"
## [21] "184" "96" "2" "127" "95" "146" "9" "177" "6" "143"
## [31] "244" "149" "167" "300" "171" "305" "67" "196" "152" "142"
## [41] "261" "104" "36" "26" "29" "258" "110" "71" "181" "88"
## [51] "251" "275" "69" "248" "208" "256" "314" "126" "281" "273"
## [61] "253" "185" "330" "334" "328" "326" "321" "324" "313" "38"
## [71] "155" "68" "335" "308" "332" "94" "348" "310" "339" "375"
## [81] "66" "327" "387" "298" "91" "245" "385" "257" "393" "168"
## [91] "405" "249" "315" "75" "128" "307" "11" "436" "1" "201"
## [101] "183" "223" "368" "336" "291" "464" "411" "481" "10" "154"
## [111] "468" "410" "390" "440" "495" "492" "493" "434" "57" "531"
## [121] "420" "483" "526" "472" "429" "16" "446" "34" "78" "139"
## [131] "252" "270" "47" "114" "301" "193" "182" "135" "350" "195"
## [141] "352" "355" "159" "363" "384" "360" "331" "367" "64" "406"
## [151] "163" "414" "333" "427" "431" "430" "426" "438" "433" "418"
## [161] "441" "282" "432" "72" "450" "180" "454" "455" "59" "451"
## [171] "254" "358" "469" "165" "467" "510" "337" "476" "502" "527"
## [181] "479" "508" "535" "302" "497" "187" "13" "7" "27" "14"
## [191] "22" "17" "28" "42" "20" "19" "45" "37" "61" "39"
## [201] "21" "24" "41" "50" "30" "54" "52" "12" "44" "31"
## [211] "83" "32" "63" "60" "55" "56" "89" "87" "118" "86"
## [221] "85" "210" "214" "129" "179" "138" "174" "170" "153" "93"
## [231] "151" "119" "35" "173" "58" "53" "133" "79" "235" "192"
## [241] "191" "236" "162" "215" "157" "287" "132" "234" "98" "77"
## [251] "103" "107" "262" "220" "121" "205" "378" "23" "296" "290"
## [261] "229" "33" "286" "276" "425" "484" "323" "403" "219" "394"
## [271] "509" "111" "423" "4" "70" "82" "81" "74" "92" "99"
## [281] "90" "112" "117" "106" "148" "158" "144" "211" "213" "216"
## [291] "232" "150" "267" "227" "247" "278" "280" "285" "289" "269"
## [301] "295" "265" "288" "122" "294" "325" "341" "344" "346" "359"
## [311] "283" "364" "370" "371" "25" "141" "391" "397" "416" "404"
## [321] "299" "197" "73" "354" "444" "408" "461" "388" "453" "459"
## [331] "474" "475" "480" "449"
unique(hotel_df$company)
## [1] "NULL" "110" "113" "270" "178" "240" "154" "144" "307" "268"
## [11] "59" "204" "312" "318" "94" "174" "274" "195" "223" "317"
## [21] "281" "118" "53" "286" "12" "47" "324" "342" "373" "371"
## [31] "383" "86" "82" "218" "88" "31" "397" "392" "405" "331"
## [41] "367" "20" "83" "416" "51" "395" "102" "34" "84" "360"
## [51] "394" "457" "382" "461" "478" "386" "112" "486" "421" "9"
## [61] "308" "135" "224" "504" "269" "356" "498" "390" "513" "203"
## [71] "263" "477" "521" "169" "515" "445" "337" "251" "428" "292"
## [81] "388" "130" "250" "355" "254" "543" "531" "528" "62" "120"
## [91] "42" "81" "116" "530" "103" "39" "16" "92" "61" "501"
## [101] "165" "291" "290" "43" "325" "192" "108" "200" "465" "287"
## [111] "297" "490" "482" "207" "282" "437" "225" "329" "272" "28"
## [121] "77" "338" "72" "246" "319" "146" "159" "380" "323" "511"
## [131] "407" "278" "80" "403" "399" "14" "137" "343" "346" "347"
## [141] "349" "289" "351" "353" "54" "99" "358" "361" "362" "366"
## [151] "372" "365" "277" "109" "377" "379" "22" "378" "330" "364"
## [161] "401" "232" "255" "384" "167" "212" "514" "391" "400" "376"
## [171] "402" "396" "302" "398" "6" "370" "369" "409" "168" "104"
## [181] "408" "413" "148" "10" "333" "419" "415" "424" "425" "423"
## [191] "422" "435" "439" "442" "448" "443" "454" "444" "52" "459"
## [201] "458" "456" "460" "447" "470" "466" "484" "184" "485" "32"
## [211] "487" "491" "494" "193" "516" "496" "499" "29" "78" "520"
## [221] "507" "506" "512" "126" "64" "242" "518" "523" "539" "534"
## [231] "436" "525" "541" "40" "455" "410" "45" "38" "49" "48"
## [241] "67" "68" "65" "91" "37" "8" "179" "209" "219" "221"
## [251] "227" "153" "186" "253" "202" "216" "275" "233" "280" "309"
## [261] "321" "93" "316" "85" "107" "350" "279" "334" "348" "150"
## [271] "73" "385" "418" "197" "450" "452" "115" "46" "76" "96"
## [281] "100" "105" "101" "122" "11" "139" "142" "127" "143" "140"
## [291] "149" "163" "160" "180" "238" "183" "222" "185" "217" "215"
## [301] "213" "237" "230" "234" "35" "245" "158" "258" "259" "260"
## [311] "411" "257" "271" "18" "106" "210" "273" "71" "284" "301"
## [321] "305" "293" "264" "311" "304" "313" "288" "320" "314" "332"
## [331] "341" "352" "243" "368" "393" "132" "220" "412" "420" "426"
## [341] "417" "429" "433" "446" "357" "479" "483" "489" "229" "481"
## [351] "497" "451" "492"
unique(hotel_df$customer_type)
## [1] "Transient" "Contract" "Transient-Party" "Group"
unique(hotel_df$reservation_status)
## [1] "Check-Out" "Canceled" "No-Show"
unique(hotel_df$is_repeated_guest)
## [1] 0 1
1. 修正NA值 :
- children僅有4個NA,以0作為修復填充不影響整體數據分析。
hotel_df[is.na(hotel_df)] <- 0
sum(is.na(hotel_df))
## [1] 0
2. 進行欄位屬性調整 :
- 將chr調整為factor,進行類別整併與確認欄位levels,方便後續進行分析作業。
# 將chr轉為factor確認類別levels
hotel_df <- hotel_df %>%
mutate(
hotel = as.factor(hotel),
is_canceled = as.factor(is_canceled),
meal = as.factor(meal),
country = as.factor(country),
market_segment = as.factor(market_segment),
reserved_room_type = as.factor(reserved_room_type),
assigned_room_type = as.factor(assigned_room_type),
deposit_type = as.factor(deposit_type),
agent = as.factor(agent),
company = as.factor(company),
customer_type = as.factor(customer_type),
reservation_status = as.factor(reservation_status),
reservation_status_date = as.factor(reservation_status_date),
is_repeated_guest = as.factor(is_repeated_guest)
)
# 確認col levels
levels(hotel_df$agent) # null level >> 334
## [1] "1" "10" "103" "104" "105" "106" "107" "11" "110" "111"
## [11] "112" "114" "115" "117" "118" "119" "12" "121" "122" "126"
## [21] "127" "128" "129" "13" "132" "133" "134" "135" "138" "139"
## [31] "14" "141" "142" "143" "144" "146" "147" "148" "149" "15"
## [41] "150" "151" "152" "153" "154" "155" "156" "157" "158" "159"
## [51] "16" "162" "163" "165" "167" "168" "17" "170" "171" "173"
## [61] "174" "175" "177" "179" "180" "181" "182" "183" "184" "185"
## [71] "187" "19" "191" "192" "193" "195" "196" "197" "2" "20"
## [81] "201" "205" "208" "21" "210" "211" "213" "214" "215" "216"
## [91] "219" "22" "220" "223" "227" "229" "23" "232" "234" "235"
## [101] "236" "24" "240" "241" "242" "243" "244" "245" "247" "248"
## [111] "249" "25" "250" "251" "252" "253" "254" "256" "257" "258"
## [121] "26" "261" "262" "265" "267" "269" "27" "270" "273" "275"
## [131] "276" "278" "28" "280" "281" "282" "283" "285" "286" "287"
## [141] "288" "289" "29" "290" "291" "294" "295" "296" "298" "299"
## [151] "3" "30" "300" "301" "302" "303" "304" "305" "306" "307"
## [161] "308" "31" "310" "313" "314" "315" "32" "321" "323" "324"
## [171] "325" "326" "327" "328" "33" "330" "331" "332" "333" "334"
## [181] "335" "336" "337" "339" "34" "341" "344" "346" "348" "35"
## [191] "350" "352" "354" "355" "358" "359" "36" "360" "363" "364"
## [201] "367" "368" "37" "370" "371" "375" "378" "38" "384" "385"
## [211] "387" "388" "39" "390" "391" "393" "394" "397" "4" "40"
## [221] "403" "404" "405" "406" "408" "41" "410" "411" "414" "416"
## [231] "418" "42" "420" "423" "425" "426" "427" "429" "430" "431"
## [241] "432" "433" "434" "436" "438" "44" "440" "441" "444" "446"
## [251] "449" "45" "450" "451" "453" "454" "455" "459" "461" "464"
## [261] "467" "468" "469" "47" "472" "474" "475" "476" "479" "480"
## [271] "481" "483" "484" "492" "493" "495" "497" "5" "50" "502"
## [281] "508" "509" "510" "52" "526" "527" "53" "531" "535" "54"
## [291] "55" "56" "57" "58" "59" "6" "60" "61" "63" "64"
## [301] "66" "67" "68" "69" "7" "70" "71" "72" "73" "74"
## [311] "75" "77" "78" "79" "8" "81" "82" "83" "85" "86"
## [321] "87" "88" "89" "9" "90" "91" "92" "93" "94" "95"
## [331] "96" "98" "99" "NULL"
levels(hotel_df$company) # null level >> 353
## [1] "10" "100" "101" "102" "103" "104" "105" "106" "107" "108"
## [11] "109" "11" "110" "112" "113" "115" "116" "118" "12" "120"
## [21] "122" "126" "127" "130" "132" "135" "137" "139" "14" "140"
## [31] "142" "143" "144" "146" "148" "149" "150" "153" "154" "158"
## [41] "159" "16" "160" "163" "165" "167" "168" "169" "174" "178"
## [51] "179" "18" "180" "183" "184" "185" "186" "192" "193" "195"
## [61] "197" "20" "200" "202" "203" "204" "207" "209" "210" "212"
## [71] "213" "215" "216" "217" "218" "219" "22" "220" "221" "222"
## [81] "223" "224" "225" "227" "229" "230" "232" "233" "234" "237"
## [91] "238" "240" "242" "243" "245" "246" "250" "251" "253" "254"
## [101] "255" "257" "258" "259" "260" "263" "264" "268" "269" "270"
## [111] "271" "272" "273" "274" "275" "277" "278" "279" "28" "280"
## [121] "281" "282" "284" "286" "287" "288" "289" "29" "290" "291"
## [131] "292" "293" "297" "301" "302" "304" "305" "307" "308" "309"
## [141] "31" "311" "312" "313" "314" "316" "317" "318" "319" "32"
## [151] "320" "321" "323" "324" "325" "329" "330" "331" "332" "333"
## [161] "334" "337" "338" "34" "341" "342" "343" "346" "347" "348"
## [171] "349" "35" "350" "351" "352" "353" "355" "356" "357" "358"
## [181] "360" "361" "362" "364" "365" "366" "367" "368" "369" "37"
## [191] "370" "371" "372" "373" "376" "377" "378" "379" "38" "380"
## [201] "382" "383" "384" "385" "386" "388" "39" "390" "391" "392"
## [211] "393" "394" "395" "396" "397" "398" "399" "40" "400" "401"
## [221] "402" "403" "405" "407" "408" "409" "410" "411" "412" "413"
## [231] "415" "416" "417" "418" "419" "42" "420" "421" "422" "423"
## [241] "424" "425" "426" "428" "429" "43" "433" "435" "436" "437"
## [251] "439" "442" "443" "444" "445" "446" "447" "448" "45" "450"
## [261] "451" "452" "454" "455" "456" "457" "458" "459" "46" "460"
## [271] "461" "465" "466" "47" "470" "477" "478" "479" "48" "481"
## [281] "482" "483" "484" "485" "486" "487" "489" "49" "490" "491"
## [291] "492" "494" "496" "497" "498" "499" "501" "504" "506" "507"
## [301] "51" "511" "512" "513" "514" "515" "516" "518" "52" "520"
## [311] "521" "523" "525" "528" "53" "530" "531" "534" "539" "54"
## [321] "541" "543" "59" "6" "61" "62" "64" "65" "67" "68"
## [331] "71" "72" "73" "76" "77" "78" "8" "80" "81" "82"
## [341] "83" "84" "85" "86" "88" "9" "91" "92" "93" "94"
## [351] "96" "99" "NULL"
levels(hotel_df$is_canceled)
## [1] "0" "1"
levels(hotel_df$is_repeated_guest)
## [1] "0" "1"
#進行數據彙總預覽
summary(hotel_df)
## hotel is_canceled lead_time arrival_date_year
## City Hotel :79330 0:75166 Min. : 0 Min. :2015
## Resort Hotel:40060 1:44224 1st Qu.: 18 1st Qu.:2016
## Median : 69 Median :2016
## Mean :104 Mean :2016
## 3rd Qu.:160 3rd Qu.:2017
## Max. :737 Max. :2017
##
## arrival_date_month arrival_date_week_number arrival_date_day_of_month
## Length:119390 Min. : 1.00 Min. : 1.0
## Class :character 1st Qu.:16.00 1st Qu.: 8.0
## Mode :character Median :28.00 Median :16.0
## Mean :27.17 Mean :15.8
## 3rd Qu.:38.00 3rd Qu.:23.0
## Max. :53.00 Max. :31.0
##
## stays_in_weekend_nights stays_in_week_nights adults
## Min. : 0.0000 Min. : 0.0 Min. : 0.000
## 1st Qu.: 0.0000 1st Qu.: 1.0 1st Qu.: 2.000
## Median : 1.0000 Median : 2.0 Median : 2.000
## Mean : 0.9276 Mean : 2.5 Mean : 1.856
## 3rd Qu.: 2.0000 3rd Qu.: 3.0 3rd Qu.: 2.000
## Max. :19.0000 Max. :50.0 Max. :55.000
##
## children babies meal country
## Min. : 0.0000 Min. : 0.000000 BB :92310 PRT :48590
## 1st Qu.: 0.0000 1st Qu.: 0.000000 FB : 798 GBR :12129
## Median : 0.0000 Median : 0.000000 HB :14463 FRA :10415
## Mean : 0.1039 Mean : 0.007949 SC :10650 ESP : 8568
## 3rd Qu.: 0.0000 3rd Qu.: 0.000000 Undefined: 1169 DEU : 7287
## Max. :10.0000 Max. :10.000000 ITA : 3766
## (Other):28635
## market_segment distribution_channel is_repeated_guest
## Online TA :56477 Length:119390 0:115580
## Offline TA/TO:24219 Class :character 1: 3810
## Groups :19811 Mode :character
## Direct :12606
## Corporate : 5295
## Complementary: 743
## (Other) : 239
## previous_cancellations previous_bookings_not_canceled reserved_room_type
## Min. : 0.00000 Min. : 0.0000 A :85994
## 1st Qu.: 0.00000 1st Qu.: 0.0000 D :19201
## Median : 0.00000 Median : 0.0000 E : 6535
## Mean : 0.08712 Mean : 0.1371 F : 2897
## 3rd Qu.: 0.00000 3rd Qu.: 0.0000 G : 2094
## Max. :26.00000 Max. :72.0000 B : 1118
## (Other): 1551
## assigned_room_type booking_changes deposit_type agent
## A :74053 Min. : 0.0000 No Deposit:104641 9 :31961
## D :25322 1st Qu.: 0.0000 Non Refund: 14587 NULL :16340
## E : 7806 Median : 0.0000 Refundable: 162 240 :13922
## F : 3751 Mean : 0.2211 1 : 7191
## G : 2553 3rd Qu.: 0.0000 14 : 3640
## C : 2375 Max. :21.0000 7 : 3539
## (Other): 3530 (Other):42797
## company days_in_waiting_list customer_type
## NULL :112593 Min. : 0.000 Contract : 4076
## 40 : 927 1st Qu.: 0.000 Group : 577
## 223 : 784 Median : 0.000 Transient :89613
## 67 : 267 Mean : 2.321 Transient-Party:25124
## 45 : 250 3rd Qu.: 0.000
## 153 : 215 Max. :391.000
## (Other): 4354
## adr required_car_parking_spaces total_of_special_requests
## Min. : -6.38 Min. :0.00000 Min. :0.0000
## 1st Qu.: 69.29 1st Qu.:0.00000 1st Qu.:0.0000
## Median : 94.58 Median :0.00000 Median :0.0000
## Mean : 101.83 Mean :0.06252 Mean :0.5714
## 3rd Qu.: 126.00 3rd Qu.:0.00000 3rd Qu.:1.0000
## Max. :5400.00 Max. :8.00000 Max. :5.0000
##
## reservation_status reservation_status_date
## Canceled :43017 2015-10-21: 1461
## Check-Out:75166 2015-07-06: 805
## No-Show : 1207 2016-11-25: 790
## 2015-01-01: 763
## 2016-01-18: 625
## 2015-07-02: 469
## (Other) :114477
3. 進行NULL值與其他值調整 :
- agent欄位原始NULL值不以填充值或刪除的方式處理,以免影響整體數據集數值分析,選擇將NULL值轉換為no_record(沒有紀錄),維持原始數據集的內容,僅做單純的變更紀錄。
- company欄位NULL值處理方式與agent欄位相同。
- is_canceled欄位調整紀錄模式,0調整為check_in,1為canceled,方便閱覽與分析。
- is_repeated_guest欄位調整模式與is_canceled欄位相同。
# null值轉換,agent、Company
null_agent <- which(levels(hotel_df$agent) == "NULL")
levels(hotel_df$agent)[null_agent] <- "no_record"
null_company <- which(levels(hotel_df$company) == "NULL")
levels(hotel_df$company)[null_company] <- "no_record"
# 數值調整
null_check <- which(levels(hotel_df$is_canceled) == "0")
levels(hotel_df$is_canceled)[null_check] <- "check_in"
null_canceled <- which(levels(hotel_df$is_canceled) == "1")
levels(hotel_df$is_canceled)[null_canceled] <- "canceled"
null_customer <- which(levels(hotel_df$is_repeated_guest) == "0")
levels(hotel_df$is_repeated_guest)[null_customer] <- "new_customer"
null_repeated <- which(levels(hotel_df$is_repeated_guest) == "1")
levels(hotel_df$is_repeated_guest)[null_repeated] <- "repeated_customer"
4. 增加欄位 :
- 增加顧客總數(包含嬰兒)、孩童總數(不包含嬰兒)、平均客單(不含嬰兒)。
# 增加顧客總數、孩童總數、平均客單(小孩加大人,不含嬰兒)
hotel_df <- hotel_df %>%
mutate(all_guest = adults + children + babies) %>%
mutate(all_kids = children + babies) %>%
mutate(avg_adr = (adr / (adults + children)))
# 飯店取消率
ggplot(data = hotel_df) +
geom_bar(mapping = aes(x = is_canceled, fill = hotel)) +
scale_fill_manual(values = c("#ffd700", "steelblue")) +
labs(title = "甚麼樣的原因促使客戶取消?", subtitle = "各飯店取消率") +
xlab("是否取消") +
ylab("數量計算")
# 飯店預訂情形
ggplot(data = hotel_df) +
geom_bar(mapping = aes(x = reservation_status, fill = hotel)) +
labs(title = "預定後實際入住情形?", subtitle = "各飯店預定關係圖") +
scale_fill_manual(values = c("deeppink", "mediumseagreen")) +
xlab("預訂後入住情形") +
ylab("數量")
# 總客戶量 & 月份
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = arrival_date_month, y = all_guest, color = hotel, shape = hotel)) +
labs(title = "不同飯店每月來客是否平均?", subtitle = "飯店總來客與月份關係") +
xlab("月份") +
ylab("旅客總數") +
theme(axis.text.x = element_text(angle = 60)) +
facet_wrap(~hotel)
# 不同來源市場的訂單量
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = market_segment, y = all_guest, color = hotel, shape = hotel)) +
labs(title = "客戶總量&來源", subtitle = "訂單來源分布情形") +
xlab("旅客來源") +
ylab("旅客總量") +
theme(axis.text.x = element_text(angle = 60)) +
facet_wrap(~hotel)
# 不同來源市場的取消率
ggplot(data = hotel_df) +
geom_bar(mapping = aes(x = is_canceled, fill = market_segment)) +
labs(title = "訂單增加時入住率是否提升?", subtitle = "來源市場間的取消率") +
xlab("是否取消") +
ylab("數量統計")
# 各月分預定時間分布
ggplot(data = hotel_df) +
geom_boxplot(mapping = aes(x = arrival_date_month, y = lead_time, fill = hotel)) +
labs(title = "預定時間與準備作業", subtitle = "訂房日期至到達日期間的時間") +
scale_fill_manual(values = c("yellowgreen", "orange3")) +
xlab("月份") +
ylab("預定時間") +
theme(axis.text.x = element_text(angle = 45))
# 均客單價
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = arrival_date_month, y = avg_adr, color = hotel, shape = hotel)) +
labs(title = "各月客單是否穩定?", subtitle = "平均客單價 / 月") +
xlab("月份") +
ylab("平均客單(不含嬰兒)") +
theme(axis.text.x = element_text(angle = 45)) +
facet_wrap(~hotel)
# 假日家庭客流量
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = children, y = stays_in_weekend_nights, fill = children)) +
labs(title = "周末住宿是否以孩子居多?", subtitle = "孩童總數與周末關係圖") +
xlab("孩童總數") +
ylab("周末住宿")
# 平日家庭客流量
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = children, y = stays_in_week_nights, fill = children)) +
labs(title = "平日是否均為非家庭旅客?", subtitle = "平日入住與孩童總數關係圖") +
xlab("孩童總數") +
ylab("平日住宿")
# 是否重複客
ggplot(data = hotel_df) +
geom_bar(mapping = aes(x = is_repeated_guest, fill = hotel)) +
labs(title = "是否不斷有回流客?", subtitle = "各飯店新舊旅客關係圖") +
scale_fill_manual(values = c("mediumblue", "sienna4")) +
xlab("新/舊客戶") +
ylab("數量")
# 旅客預付情形
ggplot(data = hotel_df) +
geom_point(mapping = aes(x = deposit_type, y = adults, color = hotel, shape = hotel)) +
labs(title = "旅客付訂與退款方式", subtitle = "預付方式與是否退款關係圖") +
xlab("預付型態") +
ylab("旅客(不含孩童)") +
facet_wrap(~hotel)
- City Hotel的取消情形明顯大於Resort Hotel,預定之後的實際入住Check Out與Canceled的數量也幾乎持平上升,大量的訂單湧入,卻也大量的取消流失訂單,不僅沒有實際收益,工作人員進行房務準備時也容易手忙腳亂。
- Ciry Hotel每月來客幾乎維持持平,Resort Hotel則是於九、十月會有爆發性的來客增加。
- 旅客來源的部分City Hotel較多的來源來自線上訂購與公司訂單,Resort Hotel則是臨櫃客戶與代理旅行社。
- 旅客與來源的關係線上預訂、團體的取消率最高,相反的臨櫃旅客、旅行社、公司預定擁有較好的入住率。
- 均客單City Hotel二月最差,Resort Hotel則是於一、二月為最低點。
- 周末與平日家庭來客,排除沒有孩童的家庭、情侶、單人等情形,1 ~ 2個孩童為比例最高的家庭來客,無關乎平日或假日。
- 重複性客人無論City Hotel或是Resort Hotel均表現不理想,依靠完全的新旅客維持營運,對於持續性營收來說並不是理想的情況。
- 預付款與退款,City Hotel幾乎維持等比例,Resort Hotel則是高比例的臨櫃訂房,提升實際入住情形。
- 在均客單下降的月份,可以看到比對月來客時並無明顯降低趨勢,可能在均客單下降的月份,做了過多的促銷活動。
- 建議增加預收訂金程序,並明訂告知不退款情況,使旅客下訂之前自我進行確認,減少後續預約取消意願。
- 進行訂房來源比例分配,活動促銷範圍可以增加旅行社配合模式與現場來客活動。
- 廣告投放與房型促銷,建議以沒有孩童的家庭、單人或情侶客層,以及1~2位孩童家庭作為主要客層。
- 單純依靠新旅客並非理想營運模式,建議逐步增加回流旅客住宿活動,提升回流旅客二次消費。
- 建議旅客預訂後可以增加Chick In活動,或是其他能夠提升旅客住宿前的附加價值感,降低預定後的取消率。
- 旅客意見回條,建議可新增對於取消預定可能的原因與旅客實際建議,逐步調整旅客入住體驗。
- 來客於九月、十月有爆發性成長,但均客單卻沒有與之增幅,建議可以調整低客單月份促銷的力度,以附加價值增加旅客體驗,而不是以價格作為促銷首選。
- 單純依靠新旅客並非理想營運模式,建議逐步增加回流旅客住宿活動,提升回流旅客二次消費。
- 跨國旅客比率過低,建議增加非本地旅行社合作、廣告投放範圍,凸顯區域觀光價值,配合接駁行程,增加透過航空旅遊的旅客比率。
- 廣告投放與房型促銷,建議以沒有孩童的家庭、單人或情侶客層,以及1~2位孩童家庭作為主要客層。
- 數據的NULL值過多,顯示紀錄旅客基本資料時,大多數不會理會包含國家、公司、旅行社代號的欄位填寫,也可能內部進行紀錄時常遺漏,建議於填寫資訊或內部紀錄時添加其他選項,以免造成數據過多空值產生。
- 數據集內若為類別,以數字形式填入雖減少輸入時間與工作壓力,建議可以增加備註,例如"0" > "check_in","1" > "canceled",對於內部檢視數據時能更快速理解數據內容。
這次 Nashik 的房價分析有上傳至 Kaggle ,有興趣的朋友可以前往閱覽, RMarkdown PDF 報告存放在 Google 雲端,程式碼則是存放於 Github ,照慣例會分享好用的函式語法,雖說基本的 Packages 與語法可能很多人都會完整的閱覽,但是實際...