The goal of this project is to practice my SQL and Python skills using a real-world dataset. As a car enthusiast and someone who has spent a lot of time on Craigslist buying and selling bicycles, I designed this project around a Kaggle dataset I found that has automotive classified listings from Craigslist.
Follow along in the Jupyter Notebook embedded below!
Through the ETL process, the dataset was reduced from 423,857 listings to 137,212 (-68%%). This removed some outliers, null values, and listings with messy inputs.
In the end, the dataset was loaded into a SQLite database. The database has 2 tables -- one with the 137,212 cleaned listings, and the other with the original number of locations. The listings table houses most of the relevant information we expect to use.
The location table includes all the location information for these listings.
Here is a Sankey chart depicting the pruning process we went through with the data:
The primary key that links these tables is the ID column:
For the next part of this project, I used this database to create a dashboard to visualize the data.