Overview:
Housing prices in the United States (US) continue to increase as incomes rise, unemployment drops, and industries grow. Our team selected this topic in order to predict how housing prices will change over the years as we decide where we want to relocate long-term.
Objective:
By analyzing housing market data and trends between 2015-2019, the Housing Price Prediction Tool will predict whether the housing market will rise or drop in the 50 largest cities in the US. For example, someone who works in the Technology sector will be able to compare the income, housing price, and population demographics of San Francisco, Austin, and Seattle while they are applying for jobs. This could help them better understand similarities and differences between different cities and aid their decision making process.
Background: This was the final project in my Data Analytics Boot Camp. The goal of the project was to implement many of the skills we learned
Detailed descriptions of our data analysis can be found in our presentation.
Here are the housing price trends of New York (top) and Los Angeles (bottom), after we cleaned null values from our data. We found that the housing prices in Los Angeles to rise in a more linear and predictable fashion compared to New York, which was more sporadic.
Kaggle: Zillow US House Price Data
Census: US City and Town Population Totals: 2010-2019
Bureau of Labor Statistics: Unemployment Rates by City
For our database, we will be using PostgreSQL by use of pgAdmin and we are also hosting our raw data in an AWS S3 bucket. This enables anyone with the access codes to work the project data. The image below represents the tables of data that are uploaded onto the database in Postgres. The entity relational diagram allowed for easier joining of tables with SQL and was a helpful reference while importing data into the database. There are three main tables with data that were used to build and perform the machine learning model.
The most common and obvious connect between all of our datasets is the State column.
The first steps were to check the kind of data types were inside of the CSV file housing our data for each city. We found that our dataset had city name, state, county and average sales price for all home types inside of that city with time steps of months from 2006 to 2020.
The next was to check for duplicates and null values in the dataframe we created. We chose to keep the first of each of the duplicates and drop all rows (cities) that had more than 10% null values. This left a little over 17,000 cities with data from the year 2016-2020.
After our preliminary processing, were able to perform an initial unsupervised clustering. We attained the following 3D Pricincipal Cluster Analysis Plot:
For the null values we decided to use a KNN (K-Nearest-Neighbors) imputer to fill in the values, as a simple imputer would have used the mean or median housing price. For housing data with large variances between large cities like New York and small towns, we believed that nearest-neighbor medians would not skew the data as much as the median of the whole column.
With the 4 years of monthly time-step data for the remaining 17,000 cities, the categorical features of the state that the cities were in was ordinal-encoded, then one-hot-encoded, and finally added into the data frame to be used as a feature with the rest of the time series data. This brought the total number of columns from 177 to 224.
For our final linear regression model, we used an 80/20 testing/training split to achieve our results. The testing/training splits we tried in other methods are shown in the table below.
Explanation of Model Choice (Including Limitations & Benefits)Here are the models we tried, along with results we got:
We used Tableau to create and host our dashboard. It will be directly tied to our Postgres database hosted on AWS via a direct connection.
After completing the the project and viewing the prediction, we can see that not all housing prices will be increasing in the next year. The machine learning model selected allowed us to get a RMSE of less than $200, which offers a strong prediction from the data provided. If we look at a city like Honolulu for example, we can determine that other factors may be an indicator of the housing market decline. The unemployment rate dropped from 2018 to 2019, but the percentage decrease was a lot smaller that in years past, which can indicate the unemployment percentage will begin to either level out or increase. This can then impact the housing market as more people are unable to purchase homes. New York shows a similar scenario. We also noticed that some cities housing prices are not increasing at such a high rate as they have done in years past. Boston for instance, is beginning to level out.
In conclusion, the data points we provided can be correlated in determining the increase or decrease of the housing market. We also believe there are many other data points we should look at the get a better picture. For example, viewing by zip code instead of city, looking at political party majority in the area, weather, etc.
One major area where we feel we could have improved our project is by taking more time to discover more data sets and factors that may influence housing prices. There are likely many variables we could not find data on handily, and that would probably be the best place to improve our project.