Exploratory Data Analysis (EDA) is a methodology in data science as the initial approach to gain insights by visualizing and summarizing data. As this article was written in August 2023, the 30-year fixed mortgage is 7.48%. Will this make housing prices go down further? We will use some exploratory data analysis technics to find the answer using our data from Zillow.
A modern programming language with a robust data analytics ecosystem is best for efficient EDA. Python and R are popular options due to their ease of use and variety of data visualization libraries. I will use R and ggplot2 in this tutorial.
Exploratory Data Analysis Process
Hadley Wickham’s 2nd edition R for Data Science has a chapter on exploratory data analysis, and he defines EDA as an iterative process and further breaks it into three steps:
- Generate questions about your data.
- Search for answers by visualizing, transforming, and modeling your data.
- Use what you learn to refine your questions and/or generate new questions.
EDA is a philosophy that any data professional should keep curiosity when exploring the data, gain understanding, and be able to prove the hypothesis from the data.
Modern exploratory data analysis usually gives data in a 2D tabular format. They are easy to read and write, but there are better ways to visualize and summarize.
To mitigate this issue, we have two weapons: data visualization and data aggregation. I will share steps for each of them on how to conduct EDA with Zillow’s housing data.
Read raw Zillow Data and Load DataFrame
Zillow is the largest public real estate website for listing houses. The data is available to the public, and you can build a web scrapper to grab each individual listing.
However, web scraping only provides a current snapshot of the listings, while retrieving historical data takes a lot of work. Fortunately, Zillow offers a time series view of multiple cities on their research website that is more convenient than web scraping. That’s what we will be using throughout the entire tutorial.
You can download the data under Housing Data from Zillow. We will go to a section for “LIST AND SALE PRICES,” choose the median Sale-to-List Ratio, and then save the CSV file to a desired location. We can later use R to build a dataframe.
Now, let’s open up the RStuio and put down the following.
library(tidyverse) df = read.csv("~/Downloads/Metro_median_sale_to_list_uc_sfrcondo_sm_month.csv") df |> head()
read.csv is a built-in function to read a CSV file into dataframe. There are various options to decide how to read a CSV file. Most of the time, the default will work as expected. Here’s what the output will look like:
RegionID SizeRank RegionName RegionType StateName X2018.03.31 X2018.04.30 X2018.05.31 X2018.06.30 X2018.07.31
1 102001 0 United States country 0.9852031 0.9879375 0.9907951 0.9926868 0.9929328
2 394913 1 New York, NY msa NY 0.9782334 0.9786714 0.9797537 0.9816898 0.9833501
3 753899 2 Los Angeles, CA msa CA 0.9990979 1.0000000 1.0000000 1.0000000 1.0000000
4 394463 3 Chicago, IL msa IL 0.9739560 0.9768094 0.9796032 0.9810988 0.9813994
5 394514 4 Dallas, TX msa TX 0.9917648 0.9959621 0.9986395 1.0000000 0.9989899
6 394692 5 Houston, TX msa TX 0.9770657 0.9790790 0.9807165 0.9819711 0.9818681
The dataset is sorted by population. It tracks the sale-to-price ratio, which indicates whether the final sell price is higher or lower than the listing price since March 2018.
Sale-to-list Ratio Trends Visualization
Usually, the clean-up is the first step to bringing the data into a clean stage. However, data clean-up analysis is an iterative process. As we gain more understanding of the data, we may need a second or third round of data cleaning.
Before analyzing the real estate market, choosing a region is important. San Jose, my current city, will be my focus. The analysis will begin by filtering location and examining sale-to-list ratio trends.
The sale-to-list ratio is not an indicator of housing price increase, but it can be used to measure market interest and the possibility of a bidding war.
df <- df |> filter(RegionName == "San Jose, CA") |> pivot_longer( cols = starts_with("X"), names_to = "date", values_to = "sale_to_list_ratio", values_drop_na = TRUE ) ggplot(data = df, aes(x=date, y=sale_to_list_ratio)) + geom_point() + geom_line(group = 1)
Reasons behind the housing bidding war
The visualization above clearly shows exactly how crazy the market was during 2017–2018; this aligned with my observation of the housing market in the Bay Area. Then, the sale-to-list ratio suddenly dropped to less than 1 toward the end of 2018. Then, it became stable in 2019–2020.
With low-interest rates, limited inventory, and work-from-home needs, the bidding war resumed in 2021 until the stock market bear market at the end of 2022.
What is the reason for the sale-to-list being over 1 again in the last 3 years? One hypothesis I had earlier is the stock market. We saw the AI concepts pushing the stock market to an all-time high. People in the Bay Area compensate on the stock market more than other cities. We need to plot the stock market price in the same plot to prove that.
We can download the QQQ monthly close price from Yahoo, then use a similar way to read into R and normalize the price to a similar sale-to-list ratio. The stock price is not the main reason for the 2017–2018 increase, but it seems we have a stronger correlation since 2020, but not from 2017–2018.
We can further check data starting from 2019; the correlation is 0.596. However, if we include 2018, it is only 0.0045. You can iterate on this process by continuing to pull the data and explore more interesting findings.
Exploratory Data Analysis (EDA) is essential for gaining insights into data. A solid process and various tools can greatly aid data professionals, including data engineers, data scientists, and data analysts. EDA forms the foundation for any type of data work and this article aims to provide insights for starting EDA on your next project.
I hope my stories are helpful to you.