Bidding War on Housing Market? Let’s Use R For Exploratory Data Analysis

Photo by Nick Brunner on Unsplash
Photo by Nick Brunner on Unsplash

Exploratory Data Analysis (EDA) is a methodology in data science as the initial approach to gain insights by visualizing and summarizing data. As this article was written in August 2023, the 30-year fixed mortgage is 7.48%. Will this make housing prices go down further? We will use some exploratory data analysis technics to find the answer using our data from Zillow. 

A modern programming language with a robust data analytics ecosystem is best for efficient EDA. Python and R are popular options due to their ease of use and variety of data visualization libraries. I will use R and ggplot2 in this tutorial. 

Exploratory Data Analysis Process

Hadley Wickham’s 2nd edition R for Data Science has a chapter on exploratory data analysis, and he defines EDA as an iterative process and further breaks it into three steps:

  1. Generate questions about your data.
  2. Search for answers by visualizing, transforming, and modeling your data.
  3. Use what you learn to refine your questions and/or generate new questions.

EDA is a philosophy that any data professional should keep curiosity when exploring the data, gain understanding, and be able to prove the hypothesis from the data. 

Modern exploratory data analysis usually gives data in a 2D tabular format. They are easy to read and write, but there are better ways to visualize and summarize. 

To mitigate this issue, we have two weapons: data visualization and data aggregation. I will share steps for each of them on how to conduct EDA with Zillow’s housing data. 

Read raw Zillow Data and Load DataFrame

Zillow is the largest public real estate website for listing houses. The data is available to the public, and you can build a web scrapper to grab each individual listing. 

However, web scraping only provides a current snapshot of the listings, while retrieving historical data takes a lot of work. Fortunately, Zillow offers a time series view of multiple cities on their research website that is more convenient than web scraping. That’s what we will be using throughout the entire tutorial. 

You can download the data under Housing Data from Zillow. We will go to a section for “LIST AND SALE PRICES,” choose the median Sale-to-List Ratio, and then save the CSV file to a desired location. We can later use R to build a dataframe.

Now, let’s open up the RStuio and put down the following.

df = read.csv("~/Downloads/Metro_median_sale_to_list_uc_sfrcondo_sm_month.csv")
df |> head()

read.csv is a built-in function to read a CSV file into dataframe. There are various options to decide how to read a CSV file. Most of the time, the default will work as expected. Here’s what the output will look like:

RegionID SizeRank      RegionName RegionType StateName X2018.03.31 X2018.04.30 X2018.05.31 X2018.06.30 X2018.07.31
1 102001 0 United States country 0.9852031 0.9879375 0.9907951 0.9926868 0.9929328
2 394913 1 New York, NY msa NY 0.9782334 0.9786714 0.9797537 0.9816898 0.9833501
3 753899 2 Los Angeles, CA msa CA 0.9990979 1.0000000 1.0000000 1.0000000 1.0000000
4 394463 3 Chicago, IL msa IL 0.9739560 0.9768094 0.9796032 0.9810988 0.9813994
5 394514 4 Dallas, TX msa TX 0.9917648 0.9959621 0.9986395 1.0000000 0.9989899
6 394692 5 Houston, TX msa TX 0.9770657 0.9790790 0.9807165 0.9819711 0.9818681

The dataset is sorted by population. It tracks the sale-to-price ratio, which indicates whether the final sell price is higher or lower than the listing price since March 2018.

Sale-to-list Ratio Trends Visualization

Usually, the clean-up is the first step to bringing the data into a clean stage. However, data clean-up analysis is an iterative process. As we gain more understanding of the data, we may need a second or third round of data cleaning. 

Before analyzing the real estate market, choosing a region is important. San Jose, my current city, will be my focus. The analysis will begin by filtering location and examining sale-to-list ratio trends.

The sale-to-list ratio is not an indicator of housing price increase, but it can be used to measure market interest and the possibility of a bidding war. 

					df <- df |> filter(RegionName == "San Jose, CA") |> pivot_longer(
  cols = starts_with("X"),
  names_to = "date",
  values_to = "sale_to_list_ratio",
  values_drop_na = TRUE
ggplot(data = df, aes(x=date, y=sale_to_list_ratio)) +
  geom_point() + 
  geom_line(group = 1)
Initial sale-to-list ratio trends visualization | Image By Author
Initial sale-to-list ratio trends visualization | Image By Author

Reasons behind the housing bidding war

The visualization above clearly shows exactly how crazy the market was during 2017–2018; this aligned with my observation of the housing market in the Bay Area. Then, the sale-to-list ratio suddenly dropped to less than 1 toward the end of 2018. Then, it became stable in 2019–2020. 

With low-interest rates, limited inventory, and work-from-home needs, the bidding war resumed in 2021 until the stock market bear market at the end of 2022. 

What is the reason for the sale-to-list being over 1 again in the last 3 years? One hypothesis I had earlier is the stock market. We saw the AI concepts pushing the stock market to an all-time high. People in the Bay Area compensate on the stock market more than other cities. We need to plot the stock market price in the same plot to prove that.

We can download the QQQ monthly close price from Yahoo, then use a similar way to read into R and normalize the price to a similar sale-to-list ratio. The stock price is not the main reason for the 2017–2018 increase, but it seems we have a stronger correlation since 2020, but not from 2017–2018.

sale-to-list ratio trends and stock market | Image by Author
sale-to-list ratio trends and stock market | Image by Author

We can further check data starting from 2019; the correlation is 0.596. However, if we include 2018, it is only 0.0045. You can iterate on this process by continuing to pull the data and explore more interesting findings.

Final Thought

Exploratory Data Analysis (EDA) is essential for gaining insights into data. A solid process and various tools can greatly aid data professionals, including data engineers, data scientists, and data analysts. EDA forms the foundation for any type of data work and this article aims to provide insights for starting EDA on your next project.

About Me

I hope my stories are helpful to you. 

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

More Articles

Photo by saeed karimi on Unsplash

5 Hidden Apache Spark Facts That Fewer People Talk About

I want to share 5 hidden facts about Apache Spark that I learned throughout my career. Those can be helpful to you to save you ...
Read More →
Photo by Katerina Pavlyuchkova on Unsplash

Think In SQL – Window Function Demisify

Understanding window function is critical for anyone that writes SQL daily. In this story, let's think in SQL and demystify the window function with examples ...
Read More →
Second Iteration: Interactivity with User Click | Image By Author

How to Engage with Users By Storytelling: Show Data Analytics in R and Shiny

Using R and Shiny, we can build an app where the end users can interact with the data analysis we have done. I will show ...
Read More →

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top
Share via
Copy link