Google Data Analytics Capstone

In this project I will be taking a look at the pernicious effects of the HDB’s 99-year leasehold by quantifying the rate of intrinsic value deterioration, not accounting for inflation or market conditions.

Contents

  1. Data Preparation
  2. Analysis
  3. Key Findings
  4. Limitations

1. Data Preperation

The data set used in this project can be found at https://data.gov.sg/dataset/resale-flat-prices
This data is based on the date of registration for resale transactions between Jan 2012 and Jan 2023. Firstly since the data from the source splits the years up in 3 different files, we will have to append them together with power query.

Fig.1 Appending all 3 files together

The next step is to calculate the Remaining Lease of each unit at the time of transaction. Which is given by the equation: \[RL = 99 - (Y_{1} - Y_{0}) \]

This is done with the following formula in Microsoft Excel:

=DATE(99-(YEAR([@month])-[@[lease_commence_date]]),MONTH([@month]),DAY([@month]))

Now that we have prepared our data, let’s get familiar with its contents before we begin analysis.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
dataset <- read.csv("Book1.csv")
glimpse(dataset)
## Rows: 234,765
## Columns: 11
## $ month               <chr> "03-2012", "03-2012", "03-2012", "03-2012", "03-20…
## $ town                <chr> "ANG MO KIO", "ANG MO KIO", "ANG MO KIO", "ANG MO …
## $ flat_type           <chr> "2 ROOM", "2 ROOM", "3 ROOM", "3 ROOM", "3 ROOM", …
## $ block               <chr> "172", "510", "610", "474", "604", "154", "110", "…
## $ street_name         <chr> "ANG MO KIO AVE 4", "ANG MO KIO AVE 8", "ANG MO KI…
## $ storey_range        <chr> "06 TO 10", "01 TO 05", "06 TO 10", "01 TO 05", "0…
## $ floor_area_sqm      <int> 45, 44, 68, 67, 67, 68, 67, 67, 67, 67, 68, 67, 68…
## $ flat_model          <chr> "Improved", "Improved", "New Generation", "New Gen…
## $ lease_commence_date <int> 1986, 1980, 1980, 1984, 1980, 1981, 1978, 1979, 19…
## $ remaining_lease     <int> 73, 67, 67, 71, 67, 68, 65, 66, 66, 72, 68, 67, 67…
## $ resale_price        <int> 250000, 265000, 315000, 320000, 321000, 321000, 32…
unique(dataset$town)
##  [1] "ANG MO KIO"      "BEDOK"           "BISHAN"          "BUKIT BATOK"    
##  [5] "BUKIT MERAH"     "BUKIT PANJANG"   "BUKIT TIMAH"     "CENTRAL AREA"   
##  [9] "CHOA CHU KANG"   "CLEMENTI"        "GEYLANG"         "HOUGANG"        
## [13] "JURONG EAST"     "JURONG WEST"     "KALLANG/WHAMPOA" "MARINE PARADE"  
## [17] "PASIR RIS"       "PUNGGOL"         "QUEENSTOWN"      "SEMBAWANG"      
## [21] "SENGKANG"        "SERANGOON"       "TAMPINES"        "TOA PAYOH"      
## [25] "WOODLANDS"       "YISHUN"

The output above shows the 11 columns that we’re working with and the 234,765 transactions that make up this data set. Each one of these transactions come from one of the 26 housing estates as seen above. 


2. Analysis

One might be tempted to assume a linear rate of deterioration in property value, trending toward zero. However that is not the case, the typical HDB’s value follows the proverbial Bala’s Curve (Fig.2). The property value is a concave downward graph that exponentially decreases it value the lower the tenure left. Fig.2 Bala’s Curve

To visualize this pernicious effect; Consider a 3-Room flat in Pasir Ris estate using our data let’s find out the average deterioration of value by creating a linear model, based on the covariance between resale price and the remaining lease. The next step is to import the “broom” package to convert our model into tidy tibbles in order to extract our estimate.

The code below does just that and outputs a summary of the regression model, results to note are the Estimate (4449.6) and the Adjusted-R squared (0.6088).

price_model <-dataset %>%
  filter(town == "PASIR RIS", flat_type == "3 ROOM") %>% 
  lm(resale_price ~ remaining_lease , data =.)
summary(price_model)
## 
## Call:
## lm(formula = resale_price ~ remaining_lease, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -107820  -42656   -9518   26079  144243 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      16395.6    25130.6   0.652    0.515    
## remaining_lease   4449.6      305.1  14.583   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53940 on 135 degrees of freedom
## Multiple R-squared:  0.6117, Adjusted R-squared:  0.6088 
## F-statistic: 212.7 on 1 and 135 DF,  p-value: < 2.2e-16
library(broom)
price_model %>% 
  tidy() %>% 
  filter(term == "remaining_lease") %>% 
  pull(estimate) -> slope

Now that we have the estimate variable (slope) we can pass this into our line plot as an annotation.

# Format result output as integers 
options(scipen=100)
# Visualizing the model
dataset %>% 
  ggplot(aes(x = factor(remaining_lease), y = resale_price, group = 1)) +
  geom_line(stat = 'summary', fun = 'mean') +
  geom_smooth(method = 'lm')+
  labs(title = "Relationship between resale prices and lease tenure")+
  annotate("text", x =20, y= 600000 ,label= (paste0("Depreciation ==", slope)), parse = TRUE )


Now we will plot this for each estate individually with the facet_wrap function.

dataset %>% 
  filter(flat_type == '4 ROOM') %>% 
  ggplot(aes(x = factor(remaining_lease), y = resale_price, group = 1)) +
  geom_line(stat = 'summary', fun = 'mean') +
  geom_smooth(method = 'lm')+ 
  labs(title = "Tenure deterioration rate for 4-Room flats by estate")+
  facet_wrap(~town)

3. Key Findings

These different deterioration profile should be noted by potential homeowners and must be considered in their decision making process. Homeowners are generally aware of the estate premium they pay for a more accessible location but far too often forget about the long-term value deterioration that comes with these ‘premium’ estates come with.

4. Limitations

This is a gross oversimplification and overlooks the many complexities of real estate (i.e. Interest Rates, Inflation, Market Sentiment). The scope of data only tracks transactions between 2012 - 2023 Jan, a relatively prosperous period for the housing market, and is devoid of the many fluctuations in the preceding decades (e.g. 2008 Recession, Asian Finical Crisis). This case study extrapolates a linear model from a typically non-linear projection,