2022年6月17日星期五

納希克房價分析 | Nashik Apartment Price Analyze – 語法解析(上)

這次Nashik的房價分析有上傳至Kaggle，有興趣的朋友可以前往閱覽，RMarkdown PDF報告存放在Google雲端，程式碼則是存放於Github，照慣例會分享好用的函式語法，雖說基本的Packages與語法可能很多人都會完整的閱覽，但是實際在使用時像小雷就會忘東忘西，可以快速找到解決辦法是我認為很重要的事情

在研究案例的語法解析中不傾向寫的那麼繁雜，而是以基本的函數語法作為快速操作與使用時機的依據。

Nashik apartment price analyze文章 >> 請點這裡進入

小雷的Kaggle >> 請點這裡進入

小雷的Github >> 請點這裡進入

觀看完整報告，請至雲端下載PDF : 點我連結

---------------------------------------------做個分隔線------------------------------------------------

complete.cases() : 當數據中出現NA值的時候，如果確定NA值並不影響後續數據內容，可以透過移除或是像是近似值填補等方式進行處理，complete.case是將數據內容篩選出不是NA的值，而且使用上很簡單，沒有過多的參數。

所屬Package : stats

基礎語法 : complete.cases(x)

使用時機 : 當移除NA值不會影響分析結果時使用。

例子 :

# 選擇出所有不是NA的值
house_df[complete.cases(house_df), ]

# 選擇1到5不是NA的值
house_df[complete.cases(house_df[1:5, ]), ]

# 選擇第10行不是NA的值
house_df[complete.caese(house_df[, 10]), ]

paste() : 連接字串成一個新的字串

所屬Package : base

基礎語法: 請見R basic operation - paste | R基礎操作 - paste

gsub() : 這個函式作用是用”這個”去取代掉”那個”，在上一篇中我是用來移除數字與百分比符號的間隔。

所屬Package : base

基礎語法 : function (pattern, replacement, x, ignore.case = FALSE, perl = FALSE,

fixed = FALSE, useBytes = FALSE)

使用時機 : 原始數據有錯誤需要更正時。

例子 :

house_df_v2$price_percentage[1:5]

# 原始資料，數字與%符號有間隔存在
r$> house_df_v2$price_percentage[1:5]
[1] "1.75 %" "15.65 %" "2.92 %" "5.22 %" "8.93 %"

# 調整後的資料
r$> house_df_v2$price_percentage[1:5]
[1] "1.75%" "15.65%" "2.92%" "5.22%" "8.93%"

recode() : 重新編碼，依照所設定的方向進行分類，與dplyr中的recode同音同字不同使用方式，dplyr的recode簡單說是替換單一數值，但是car包的recode是允許操作數字範圍的，在上一篇的報告中，就是將範圍分類到我們想設定成的字符，而且可以調整輸出的型態與等級，我是覺得非常好用。

所屬Package : car

基礎語法 : function (var, recodes, as.factor, as.numeric = TRUE, levels)

NULL

使用時機 : 數值分類

例子 :

level_df$sqft_level <- recode( # car package
level_df$sqft_level, "lo:710 = 'small'; 711:1600 = 'median'; 1601:2300 = 'large'; 2301:40000 = 'huge'",
as.factor = TRUE,
levels = c("small", "median", "large", "huge")
)

# 結果輸出
r$> level_df$sqft_level[1:10]
[1] median median median median median median median median median median
Levels: small median large huge

2022年6月15日星期三

Nashik apartment price analyze | Nashik 房價分析 - R

前陣子為了Python kivy app花了不少時間進行，導致延遲了數據分析的產出，這篇是剛好看到kaggle有新的data set，雖然原始用意是要以預測房價為主，但是在預測房價之前若能將整體狀況都了解清楚，對後續的預測工作也會來的更順暢，所以先以整體的分析為主軸。

說實在的，台灣的房價已經夠高了，在寫這份Nashik房價分析時查了印度相關的經濟、人口與人均等等資料，不得不說政策、文化等等的差異，造成貧富巨大的差距，整體的人均所得於2021的報告中顯示2190.901USD，匯率30的話大約是台幣65727.03，月均約是台幣5477.252，Nashik當地的房價在這份數據集中價格範圍在min >> 台幣39,000、1^st Qu. >>台幣897,000、median >>台幣1,365,000、avg >>台幣1,712,040、3^rd Qu. >>台幣1,950,000、max >> 台幣27,300,000(扣除掉NA沒有數據的部分)，可以想見大多數民眾生活也是非常的忙碌與奔波，房價若是在首都或是熱門的地區將會來的更高。

完整的Rmarkdown報告一樣放置於雲端，供有需要的朋友取用，本次好用的語法也會照慣例寫成語法解析。

小雷的Github >> 點我連結

觀看完整報告，請至雲端下載PDF : 點我連結

View the full rmarkdown report, please go to the google cloud and download the PDF file, thx

--------------------------------------------做個分隔線--------------------------------------------

納希克房屋分析 | Nashik Apartment Price Analyze

關於納希克 | About Nashik

========================================================================================================

納西克是馬哈拉施特拉邦的第 4 大城市，距離孟買和浦那大約 200 公里，納西克作為度假熱點和投資養老院的地點而備受關注。隨著浦那和孟買的房地產價格飆升。社會基礎設施逐漸完善，隨著城市經濟的發展和人們選擇該地區作為永久居住地。
Nashik, the fourth largest city in Maharashtra, about 200 kilometers from Mumbai and Pune, is gaining traction as a holiday hotspot and a location to invest in nursing homes. With property prices soaring in Pune and Mumbai. The social infrastructure is gradually improving, as the city’s economy develops and people choose the area as a permanent residence.

主要目標 | Main Target

========================================================================================================

透過數據探索，分析找出房屋售價趨勢。
在售價、EMI、取得面積等條件下定義本效益比，取得最好的投資物件。
Exploration data, analyze and find out the trend of house price.
Define the cost–performance ratio under the conditions of selling price, EMI, acquired area, etc. to obtain the best investment object.

七個分析階段與流程 | Into Seven Analysis Phases and Processes

========================================================================================================

數據檢視。
問題解析。
數據清洗。
數據彙整。
趨勢分析。
可視化圖表。
結論。
Check data set values.
Parse all problems.
Clearn data process.
Data consolidation.
Trends Analysis.
Visualization chart.
Conclusion.

第一階段 : 數據檢視 | Phase One : Check data set values.

特定數據集出處引用 : https://www.kaggle.com/datasets/rushikeshdane20/nashik-apartment-price-prediction
來源取得由Kaggle，檔案為可信任之公開數據集。
數據內容作者 : Rushikesh Dane 20。
數據存放 : MySQL、Kaggle。
編碼位置 : Kaggle、Github。
編碼語言 : R。
IDE : RStudio、VScode。
References to specific datasets : : https://www.kaggle.com/datasets/rushikeshdane20/nashik-apartment-price-prediction
The source is obtained from Kaggle, the file is a trusted public dataset。
Author : Rushikesh Dane 20。
Data storage : MySQL、Kaggle。
Coding position : Kaggle、Github。
Coding language : R。

- 載入R包與數據集。

- Import R packages and data set.

# Import packages
library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --

## v ggplot2 3.3.5     v purrr   0.3.4
## v tibble  3.1.6     v dplyr   1.0.7
## v tidyr   1.1.4     v stringr 1.4.0
## v readr   2.1.1     v forcats 0.5.1

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(dplyr)
library(tidyr)
library(ggplot2)
library(car)

## 載入需要的套件：carData

## 
## 載入套件：'car'

## 下列物件被遮斷自 'package:dplyr':
## 
##     recode

## 下列物件被遮斷自 'package:purrr':
## 
##     some

# import data set
setwd("D:\\Github_version_file_R\\data_set\\marchine_learning_data\\Nashik_apartment_price_prediction")

nashik_house <- read.csv("final_data.csv")

- 檢視原始數據集內容。

- Check the data set values.

# check data set
str(nashik_house)

## 'data.frame':    5496 obs. of  12 variables:
##  $ X              : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ address        : chr  "Sheetal Vihar, Bhagwant Nagar, Dr.Homi Bhabha Nagar,Nashik" "Samraat Dream Citi, Samta Nagar, Nashik" "Suryaprakash Apartment,Nashik Road, Nashik" "Adishvar Residency,Nashik Road, Nashik" ...
##  $ owners         : chr  "Mahendra Kotwal" "Jiten Dadarkar" "Pankaj" "Saurav" ...
##  $ housetype      : chr  "Apartment" "Apartment" "Apartment" "Apartment" ...
##  $ house_condition: chr  "old" "old" "old" "old" ...
##  $ BHK            : num  3 2 2 2 2 2 2 3 2 2 ...
##  $ price          : num  75 41 53.4 55 27 ...
##  $ per_month_emi  : num  39.7 21.7 28.2 29.1 14.3 ...
##  $ total_sqft     : num  1550 1000 970 1000 853 ...
##  $ cordinates     : chr  "Sheetal Vihar" "Samraat Dream Citi" "Surya Prakash" "Nashik Road, Vadner Dumala, Nashik, Maharashtra, 422401" ...
##  $ latitude       : num  20 20 20 19.9 20 ...
##  $ longitude      : num  73.8 73.8 73.8 73.8 73.8 ...

dim(nashik_house)

## [1] 5496   12

colnames(nashik_house)

##  [1] "X"               "address"         "owners"          "housetype"      
##  [5] "house_condition" "BHK"             "price"           "per_month_emi"  
##  [9] "total_sqft"      "cordinates"      "latitude"        "longitude"

# check fixed values
table(nashik_house$housetype)

## 
##         Apartment Independent house 
##              4323              1173

table(nashik_house$house_condition)

## 
##  new  old 
## 1846 3650

table(nashik_house$BHK)

## 
##    1    2  2.5    3  3.5    4    5    6    7    8   10 
## 1559 2487    2 1017    1  205   33   14    8    3    4

any(is.na(nashik_house))

## [1] TRUE

原始數據集包含12cols、5496rows。
公寓類型共有4323筆、獨棟類型共有1173筆資料。
新成屋共有1846筆、舊式(二手)共有3650筆
房間數自1房~10房都有(包含2.5房2筆、3.5房1筆)
資料集內存在NA空值。
Original data set has 12 cols and 5496rows.
Apartment type has 4,323records, independent type has 1,173records.
The new existing house has 1,846records, second-hand house has 3,650records.
Rooms type has 1bhk ~ 10bhk(include 2.5bhk, 3.5bhk).
There are NA values in the date.

第二階段 : 問題解析 | Phase Two : Parse all problems.

1. 確定利益相關人與團隊 | Identify stakeholders and teams

主要利益關係者 : 所有有購屋需求的自住者與投資客。
Primary stakeholders : Cyclistic、Lily Moreno。

2. 數據集內需要解決的問題 | Issues that need resolving

有多筆NA空值情形。
數據內房間數有0.5間的狀況。
售價單位為Lakh，非一般國際貨幣單位。
There are many NA values in the data.
The room type has 0.5, which is not reasonable.
The currenc unit is Lakh, it is not an international currency unit.

第三階段 : 數據清洗 | Phase Three : Clearn data process.

- 調整貨幣單位為國際貨幣INR、USD並另存數據集防止錯誤數據更動原始數據集。

- 1Lakh = 100,000INR，1K = 1,000，1INR = 0.013USD

- Adjust the currency unit to the international currency INR, USD and save the data set to prevent wrong data from changing the original data set.

house_df <- nashik_house

# create col，1 Lakh = 100,000 INR(Rs)
house_df <- house_df %>%
    mutate(
        price_INR = price * 100000,
        month_emi = per_month_emi * 1000
    )

# change coltype 1 USD = 0.013INR
house_df <- house_df %>%
    mutate(
        price_USD = price_INR * 0.013,
        price_INR_sqft = price_INR / total_sqft
    )

- 調整房間數型態以利後續整併。

- Adjust the room type to facilitate subsequent consolidation.

# change BHK type to character
house_df$BHK <- as.character(house_df$BHK)

- 移除錯誤房數2.5、3.5。

- Remove the wrong information about room type 2.5、3.5.

# remove incorrect number of rooms
house_df <- house_df[!(house_df$BHK == 2.5 | house_df$BHK == 3.5), ]

- 各房售價為出售者定義，房價不以近似值填補，以免與真實價格產生過大落差，故刪除NA欄位。

- The selling price of each house is defined by the seller, it is not filled with an approximate value, so to avoid a large gap with the real price, so delete the NA column.

# remove null data
house_df <- house_df[complete.cases(house_df), ]

- 移除不需要之欄位。

- Remove unused cols.

# remove col
house_df <- house_df %>%
    select(-c("X", "latitude", "longitude"))

- 確認清洗後的數據資料

- Confirm data after cleaning.

# check NA
any(is.na(house_df))

## [1] FALSE

# check data set
dim(house_df)

## [1] 3871   13

# check new fixed values
table(house_df$housetype)

## 
##         Apartment Independent house 
##              2814              1057

table(house_df$house_condition)

## 
##  new  old 
## 1769 2102

table(house_df$BHK)

## 
##    1   10    2    3    4    5    6    7    8 
##  919    4 1898  830  167   30   14    7    2

# check values range
range(house_df$total_sqft)

## [1]   150 40000

range(house_df$price)

## [1]   1 700

range(house_df$price_INR)

## [1] 1e+05 7e+07

range(house_df$per_month_emi)

## [1]   1.05 529.00

range(house_df$month_emi)

## [1]   1050 529000

數據內容初步探索 | Explore data values

調整後數據集大小為13cols與3871rows。
售價範圍自10萬INR至7,000萬INR，呈現極大落差。
面積(平方英尺)、每月EMI也與售價一同表現出很大的落差級距。
After cleaning, the data set changed the size to 13 cols and 3871 rows.
The selling price range is 100,000INR to 70,000,000INR, showing a huge gap.
Square feet、EMI showing huge gap as the same as house price.

第四階段 : 數據彙整 | Phase Four : Data consolidation.

- 彙整總數、平均數、標準差與建立各個數值百分比。

- Aggregate total、average、standard deviation, create the percentage of each value.

house_df_v2 <- house_df %>%
    group_by(BHK) %>%
    summarise(
        total = n(),
        avg_price_INR = mean(price_INR),
        sd_price_INR = sd(price_INR),
        avg_sqft = round(mean(total_sqft), digits = 2),
        avg_emi_INR = round(mean(month_emi), digits = 2)
    ) %>%
    mutate(
        price_percentage = paste(round(avg_price_INR / sum(avg_price_INR) * 100, digits = 2), "%"),
        total_percentage = paste(round(total / sum(total) * 100, digits = 2), "%"),
        sqft_percentage = paste(round((avg_sqft / sum(avg_sqft)) * 100, digits = 2), "%"),
        emi_percentage = paste(round((avg_emi_INR / sum(avg_emi_INR)) * 100, digits = 2), "%")
    )

- 移除百分比與數據間的空格格式。

- Remove percentage symbol spaces

# remove percentage symbol spaces
house_df_v2$price_percentage <- gsub(" ", "", house_df_v2$price_percentage)
house_df_v2$total_percentage <- gsub(" ", "", house_df_v2$total_percentage)
house_df_v2$sqft_percentage <- gsub(" ", "", house_df_v2$sqft_percentage)
house_df_v2$emi_percentage <- gsub(" ", "", house_df_v2$emi_percentage)

- 調整房數排序以利後續圖表繪製。

- Adjust the rooms type to facilitate subsequent chart plot.

# reorder rooms type
house_df_v2$BHK <- factor(house_df_v2$BHK, level = c("1", "2", "3", "4", "5", "6", "7", "8", "10"))

- 彙整房屋型態，加總型態總值後計算各型態百分比。

- Aggregate the house type, summary total number、percentage.

# house type summary
type_price <- house_df %>%
    group_by(housetype) %>%
    summarise(
        total = n(),
        price_INR_sum = sum(price_INR)
    ) %>%
    mutate(
        price_percentage = paste(round(price_INR_sum / sum(price_INR_sum) * 100, digits = 2), "%"),
        total_percentage = paste(round(total / sum(total) * 100, digits = 2), "%")
    )

# remove percentage symbol spaces
type_price$price_percentage <- gsub(" ", "", type_price$price_percentage)
type_price$total_percentage <- gsub(" ", "", type_price$total_percentage)

- 彙整新舊型態，加總總值後計算各新舊百分比

- Aggregate new、old type, summary total number、percentage.

# old & new condition summary
condition_df <- house_df %>%
    group_by(house_condition) %>%
    summarise(
        total = n(),
        price_INR_sum = sum(price_INR)
    ) %>%
    mutate(
        price_percentage = paste(round(price_INR_sum / sum(price_INR_sum) * 100, digits = 2), "%"),
        total_percentage = paste(round(total / sum(total) * 100, digits = 2), "%")
    )

# remove percentage symbol spaces
condition_df$price_percentage <- gsub(" ", "", condition_df$price_percentage)
condition_df$total_percentage <- gsub(" ", "", condition_df$total_percentage)

房屋每月支付EMI與使用面積評分調整 | Adjust the EMI、area ratio

- 透過國際貨幣基金組織所披露之資料顯示，2021年印度人均所得為2190.901USD，換算後約為168,538INR/年，月均約為14044.83INR，若以人均檢視售價，可以發現房屋價格範圍過高，基於此條件進行EMI與面積評分，找出在限定條件下適合投資或自住之物件以及建立個物件之本效益比。

- According to the information disclosed by the international Monetary Fund, the per capita income of india is 2190.90USD, which is about 168,638INR per year after conversion, and the monthly average is about 14044.83INR per month, if check the selling price by per capita, we can find that the price range of the house is too high. based on this condition, EMI and house area levels are carried out to find out the items suitable for investment or self-occupation under the limited conditions and establish the cost-effectiveness ratio of each item.

支付與面積分級 | EMI and House Area Level

- 以月均收入14044.83INR為基準條件 >> 30%約為4213.449INR，以此演算方式制定下列分級表

- The condition base uses the 14044.83INR per month >> 30% is about 4213.449INR per month, and the following level table is an algorithm like this.

1 ~ 1400INR >> expected (0% ~ 10%)

1401 ~ 5600INR >> affordable (10% ~ 40%)

5601 ~ 11200INR >> Investable (40% ~ 80%)

11201 ~ 100000INR >> costly (遠超出人均範圍 | It is far away from the capita range)

100001 ~ 529000INR >> inflated (遠超出人均範圍 | It is far away from the capita range)

- 面積為平方英尺，換算台坪約為1 : 35.583Sqft，居住空間較常見的格局約為20~45台坪左右，換算約為711 ~ 1600Sqft，以此標準制定以下分級表

- House area is square feet, the area is in square feet, and the converted Taiwan unit is about 1: 35.583Sqft,The common pattern of living space is about 20~45 in Taiwan area unit, which is about 711~1600Sqft, and the following level table is an algorithm like this.

1 ~ 710Sqft >> small

711 ~ 1600Sqft >> median

1601 ~ 2300Sqft >> large

2301 ~ 40000Sqft >> huge

- 建立起各物件的EMI與Sqft階級

- Create EMI and Sqft level in the object

level_df <- house_df

level_df <- level_df %>%
    mutate(
        price_INR_ratio = price_INR,
        sqft_level = total_sqft,
        emi_level = month_emi
    )


# classification sqft
level_df$sqft_level <- recode( # car package
    level_df$sqft_level, "lo:710 = 'small'; 711:1600 = 'median'; 1601:2300 = 'large'; 2301:40000 = 'huge'",
    as.factor = TRUE,
    levels = c("small", "median", "large", "huge")
)


# classification emi
level_df$emi_level <- recode(
    level_df$emi_level,
    "lo:1400 = 'expected'; 1401:5600 = 'affordable'; 5601:11200 = 'Investable'; 11201:100000 = 'costly'; 100001:529000 = 'inflated'",
    as.factor = TRUE,
    levels = c("expected", "affordable", "Investable", "costly", "inflated")
)

- 依照人均找出尚可負擔的物件。

- According to per capita income find an object that people can burden.

suitable_object <- level_df %>%
    select(address, owners, housetype, house_condition, BHK, month_emi, price_INR, emi_level, sqft_level) %>% 
    filter(emi_level == "affordable", sqft_level == "median", price_INR < 1000000) %>% 
    view()

suitable_object

##                                          address        owners
## 764                 Dhanlaxmi, Ghatkopar, Nashik  Awani Kakkad
## 4594                            Gangapur, Nashik    SUNIL KALE
## 4595                        Konark Nagar, Nashik BHAVESH GURAV
## 5178 lahvit gav patill galli,Nashik Road, Nashik Pankaj Dhumal
## 5179                       Prabhat Nagar, Nashik      Ravindra
##              housetype house_condition BHK month_emi price_INR  emi_level
## 764          Apartment             old   2      1590    300000 affordable
## 4594 Independent house             new   2      2120    400000 affordable
## 4595 Independent house             new   2      3100    585000 affordable
## 5178 Independent house             old   2      5220    985000 affordable
## 5179 Independent house             old   1      4770    900000 affordable
##      sqft_level
## 764      median
## 4594     median
## 4595     median
## 5178     median
## 5179     median

第五階段 : 趨勢分析 | Phase Five : Trends Analysis.

綜合以上數據清理、聚合結果針對型態、售價、支付額、面積、新舊程度等五點進行交叉分析

1. 以整體市案件量觀察，現有物件數量最多前三名的房間數為兩房 > 一房 > 三房，佔總數的49.03%、23.74%、21.44%。

2. 一至三房雖然有零星價格偏高的現象，但總體來說相較其他房數範圍波動較小，超過四房型態價格就有明顯波動趨勢，尚未趨於穩定。

3. 若是以房屋型態觀察，釋出最多的類型為公寓，總數佔了整體的72.7%，整體銷售金額僅高於獨棟總金額的17.34%，顯示獨棟類型的房屋在當地價格相對高檔。

4. 新舊比例並沒有明顯的落差產生，新舊占比45.7%、54.3%，新舊程度的總銷售金額佔比為46.38%、53.62%，新舊程度並沒有影響總銷售金額。

5. 七至八房擁有最好的面積佔比，分別為23.6%、26.6%，遠高於十房的11.48%以及五至六房的10.1%、10.2%，若加以每月分期壓力的情況比較，三、四、五、六房佔比皆超過10%以上，其中又以五房的還款壓力為最重，占了19.26%，每月EMI超過50000INR。

6. 排除人均收入的情況，每月支付與面積趨勢在EMI介於20,000INR ~ 30,000INR呈現最高，代表本效益比能夠在此區間能夠以最低的支付壓力取得最高的使用面積。

Based on the above data cleaning and aggregation, cross-analysis is carried out for five point, such as type, price, EMI, area and degree of old and new.

1. Observing the total in the whole data, the most room type is 2 > 1 > 3, the percentage about 49.03%、23.74%、21.44%.

2. The room type 1 ~ 3 some object price has a bit high, but the price range is lower than in another room type, the price of more than four-bedroom type has obvious fluctuation trend, it has not stable.

3. If observing house type, the most type is apartment, which has 72.7%, but the selling price is only higher than the Independent object at 17.34%, this shows that independent homes are relatively expensive locally.

4. this is no obvious gap between the old and new ratio, the old and new type is 45.7%、54.3%, the total selling price is 46.38、53.62%, do not affect the total sales amount.

5. The 7 ~ 8 rooms type has the best area ratio, it is 23.6%、26.6%, which is higher than the 10 rooms ratio is 11.48% and 5 ~ 6 rooms ratio of 10.1%、10.2%, if the monthly installment pressure is compared, the third, fourth, fifth, and sixth rooms all account for more than 10%, and the repayment pressure of the fifth room is the heaviest, accounting for 19.26%, and the monthly EMI exceeds 50000INR.

6. if excluding per capita income, the EMI and house area trend is the best between 20,000INR ~ 30,000INR, can get the highest usable area and the lowest EMI payment.

第六階段 : 可視化圖表 | Phase Six : Visualization chart.

- 不同房數之間的價格範圍。

- The different selling price ranges of the room type.

# Rooms type & Price relation
price_max_min <- ggplot(
    data = house_df,
    mapping = aes(x = reorder(BHK, price), y = price, fill = BHK)
) +
    geom_boxplot() +
    scale_y_continuous(
        name = "Price Unit : Lakh",
        breaks = c(0, max(house_df$price), 100)
    ) +
    scale_x_discrete(name = "Rooms Type") +
    labs(
        title = "各房數的價格範圍比較",
        caption = "1 Lakh = 100,000 INR"
    ) +  # add note
    guides(fill = guide_legend(title = "Rooms Type"))

price_max_min

- 各房數在市場上流通的比例大小。

- The proportion of the ratio of houses circulating in the market.

# Rooms type & Total number
price_bhk <- ggplot(
    data = house_df_v2,
    mapping = aes(x = reorder(BHK, total), y = total, fill = BHK)
) +
    geom_col() +
    scale_fill_brewer(palette = "Set1") +
    geom_text(
        aes(label = total_percentage),
        vjust = 0.5, # adjust position
        hjust = 0.5
    ) +
    labs(
        title = "市場上流通的房型數比較",
        subtitle = "不同房數總數與所占比例"
    ) +
    xlab("Rooms Type") +
    scale_y_continuous(
        name = "Total Number",
        breaks = c(0, max(house_df_v2$total), 150)
    ) +
    coord_flip() +  # graph flip
    guides(fill = guide_legend(title = "Rooms Type"))

price_bhk

- 不同房數每月支付EMI比較。

- The EMI payment between the room type.

# EMI & BHK
bhk_emi <- ggplot(
    data = house_df_v2,
    mapping = aes(x = reorder(BHK, -avg_emi_INR), y = avg_emi_INR, fill = BHK)
) +
    geom_col() +
    scale_fill_brewer(palette = "Set3") +
    geom_text(
        aes(label = emi_percentage),
        vjust = -0.3
    ) +
    labs(
        title = "支付越多是否與房間數成比例?",
        subtitle = "房間數與每月支付EMI比較",
        caption = "Price Unit : INR"
    ) +
    xlab("Rooms type") +
    ylab("AVG EMI") + 
    guides(fill = guide_legend(titile = "Rooms Type"))

bhk_emi

- 所有物件對於取得面積比較。

- All objects obtaining area.

# SQFT & BHK
bhk_sqft <- ggplot(
    data = house_df_v2,
    mapping = aes(x = reorder(BHK, -avg_sqft), y = avg_sqft, fill = BHK)
) +
    geom_col() +
    scale_fill_brewer(palette = "Paired") +
    geom_text(
        aes(label = sqft_percentage),
        vjust = -0.3
    ) +
    labs(
        title = "房間數越多是否取得面積越大",
        subtitle = "房間數與面積比較",
        caption = "Area Unit : Square Feet"
    ) +
    xlab("Rooms Type") +
    ylab("AVG Sqft") + 
    guides(fill = guide_legend(title = "Rooms Type"))

bhk_sqft

- 獨棟與公寓型態間比例。

- The independent and apartment house ratio.

# Apartment & Independent percentage
type_pie <- pie(
    type_price$price_INR_sum,
    labels = type_price$total_percentage,
    col = c("slateblue3", "tan2"),
    main = "不同房屋型態所占比例 | Different House Type Percentage"
)
# add notes
legend(
    "topright",
    legend = c("Apartment", "Independent"),
    title = "House Type",
    fill = c("slateblue3", "tan2"),
    cex = 1.2
)

- 新、舊型態間比例。

- The ratio of the new、old house.

# New & Old house percentage
pie(
    condition_df$price_INR_sum,
    labels = condition_df$price_percentage,
    col = c("#d425ae", "#5bb91d"),
    main = "新、房屋所佔比例 | New、Old house percentage"
)

legend(
    "topright",
    legend = c("New House", "Old House"),
    title = "House Condition",
    fill = c("#d425ae", "#5bb91d"),
    cex = 1.2 # 字符大小
)

- 每月支付金額與取得面積趨勢。

- The trend of the monthly payment EMI and the obtaining area.

# EMI vs SQFT
emi_sqft <- ggplot(
    data = house_df_v2,
    mapping = aes(x = avg_emi_INR, y = avg_sqft)) +
    geom_point(
        alpha = 0.6, # transparency透明度
        size = 2.0) +
    geom_smooth(
        method = "loess",
        aes(x = avg_emi_INR, y = avg_sqft)) +
    labs(
        title = "每月EMI越高，所得的面積是否越大?",
        subtitle = "平均EMI vs 平均面積",
        caption = "EMI Unit = INR") +
    xlab("AVG EMI") +
    scale_y_continuous(
        name = "AVG Square Feet",
        breaks = c(0, max(house_df_v2$avg_sqft)))

emi_sqft

## `geom_smooth()` using formula 'y ~ x'

- 面積與支付額分類情形。

- Classification between the obtaining area and EMI payment.

# Sqft and EMI Level distribution
emi_sqft_level <- ggplot(
    data = level_df,
    mapping = aes(
        x = emi_level,
        fill = sqft_level
    )) +
    geom_bar() +
    guides(fill = guide_legend(title = "Sqft Level")) + 
    xlab("EMI Level") + 
    ylab("Total") + 
    labs(title = "經分類後的分布情形", subtitle = "EMI & Sqft")

emi_sqft_level

- 面積與房數分類情形。

- Classification between the obtaining area and room type.

# Sqft Level and BHK distribution
sqft_bhk_level <- ggplot(
    data = level_df,
    mapping = aes(
        x = sqft_level,
        fill = BHK
    )) +
    geom_bar() + 
    guides(fill = guide_legend(title = "Rooms Type")) + 
    xlab("Sqft Level") + 
    ylab("Total") + 
    labs(title = "各房數對應的面積分布", subtitle = "BHK & Sqft Level")

sqft_bhk_level

第七階段 : 結論 | Phase Seven : Conclusion.

分析後結論

印度人口排名為世界前三大，經濟成長也持續起飛，但是過低的人均收入與國內的貧富差距過大，導致了並不是每位民眾都能享有正常的生活水平，房價是其中之一，在此份數據中，房屋價格的落差最低於1,300USD(約39000台幣)，最高則是到達了910,000USD(約27,300,000萬台幣)，可以預期的是這並不是短時間可以消彌的事情，需要國家政策等各個方面進行調整，若是以剛好介於人均水平的收入，我們可以透過演算建立起的數據，找尋出適當的物件。

The conclusion after analysis

India’s population ranks among the top three in the world, and its economic growth continues to take off. However, the low per capita income and the large gap between the rich and the poor in the country mean that not everyone can enjoy a normal standard of living. Housing prices are one of them. In this data, the gap between house prices is as low as 1,300USD (about 39,000 Taiwan dollars), and the highest is 910,000USD (about 27,300,000 Taiwan dollars). Policy and other aspects are need adjusted. If the income is just between the per capita level, we can find appropriate objects through the data established by calculation.

資料來源 | Source

人均所得 | Per capita income : 國際貨幣基金組織Internation Monetary Fund

人口與經濟概況 | Demographic and Economic Profile : 世界銀行數據WorldBank.org

========================================================================================================

我是Rex

感謝您瀏覽本次納西克的數據分析報告，若有任何不足之處，歡迎與我探討

I’m Rex

Thank you for browsing the data analysis report of Nashik apartment price. If have any shortcomings, please feel free to discuss them with me

訂閱：意見 (Atom)

搜尋感興趣的網誌

所有文章連結

2022年6月17日 星期五

納希克房價分析 | Nashik Apartment Price Analyze – 語法解析(上)

2022年6月15日 星期三

Nashik apartment price analyze | Nashik 房價分析 - R

Nashik apartment price analyze

Rex_Li

2022/6/1

納希克房屋分析 | Nashik Apartment Price Analyze

關於納希克 | About Nashik

主要目標 | Main Target

透過數據探索，分析找出房屋售價趨勢。

在售價、EMI、取得面積等條件下定義本效益比，取得最好的投資物件。

Exploration data, analyze and find out the trend of house price.

Define the cost–performance ratio under the conditions of selling price, EMI, acquired area, etc. to obtain the best investment object.

七個分析階段與流程 | Into Seven Analysis Phases and Processes

數據檢視。

問題解析。

數據清洗。

數據彙整。

趨勢分析。

可視化圖表。

結論。

Check data set values.

Parse all problems.

Clearn data process.

Data consolidation.

Trends Analysis.

Visualization chart.

Conclusion.

第一階段 : 數據檢視 | Phase One : Check data set values.

特定數據集出處引用 : https://www.kaggle.com/datasets/rushikeshdane20/nashik-apartment-price-prediction

來源取得由Kaggle，檔案為可信任之公開數據集。

數據內容作者 : Rushikesh Dane 20。

數據存放 : MySQL、Kaggle。

編碼位置 : Kaggle、Github。

編碼語言 : R。

IDE : RStudio、VScode。

References to specific datasets : : https://www.kaggle.com/datasets/rushikeshdane20/nashik-apartment-price-prediction

The source is obtained from Kaggle, the file is a trusted public dataset。

Author : Rushikesh Dane 20。

Data storage : MySQL、Kaggle。

Coding position : Kaggle、Github。

Coding language : R。

- 載入R包與數據集。

- Import R packages and data set.

- 檢視原始數據集內容。

- Check the data set values.

原始數據集包含12cols、5496rows。

公寓類型共有4323筆、獨棟類型共有1173筆資料。

新成屋共有1846筆、舊式(二手)共有3650筆

房間數自1房~10房都有(包含2.5房2筆、3.5房1筆)

資料集內存在NA空值。

Original data set has 12 cols and 5496rows.

Apartment type has 4,323records, independent type has 1,173records.

The new existing house has 1,846records, second-hand house has 3,650records.

Rooms type has 1bhk ~ 10bhk(include 2.5bhk, 3.5bhk).

There are NA values in the date.

第二階段 : 問題解析 | Phase Two : Parse all problems.

1. 確定利益相關人與團隊 | Identify stakeholders and teams

主要利益關係者 : 所有有購屋需求的自住者與投資客。

Primary stakeholders : Cyclistic、Lily Moreno。

2. 數據集內需要解決的問題 | Issues that need resolving

有多筆NA空值情形。

數據內房間數有0.5間的狀況。

售價單位為Lakh，非一般國際貨幣單位。

There are many NA values in the data.

The room type has 0.5, which is not reasonable.

The currenc unit is Lakh, it is not an international currency unit.

第三階段 : 數據清洗 | Phase Three : Clearn data process.

- 調整貨幣單位為國際貨幣INR、USD並另存數據集防止錯誤數據更動原始數據集。

- 1Lakh = 100,000INR，1K = 1,000，1INR = 0.013USD

- Adjust the currency unit to the international currency INR, USD and save the data set to prevent wrong data from changing the original data set.

- 調整房間數型態以利後續整併。

- Adjust the room type to facilitate subsequent consolidation.

- 移除錯誤房數2.5、3.5。

- Remove the wrong information about room type 2.5、3.5.

- 各房售價為出售者定義，房價不以近似值填補，以免與真實價格產生過大落差，故刪除NA欄位。

- The selling price of each house is defined by the seller, it is not filled with an approximate value, so to avoid a large gap with the real price, so delete the NA column.

2022年6月17日星期五

2022年6月15日星期三