This project was designed to practice the Extract, Transform, and Load (ETL) process on three data sources to generate an SQL database of movies. The movie information and links came from Wikipedia articles (in a JSON format) and the review information came from Kaggle (in CSV format).
Once the data was imported into various Pandas DataFrames, it was cleaned in order to remove extraneous columns, merge various formats, and convert data types where needed.
The data was then loaded into an SQL database using SQLAlchemy.
A Python function was created to take the in three input files and perform the ETL process. This file is saved as challenge.py in the GitHub repository. In doing this, we were asked to document our assumptions for transformations we performed to clean up the data.
Note: View the file titled Movies_Challenge-TEST in the repository to see the Jupyter Notebook file housing the code.
We should filter to movies that contain both an IMDB link and a listed director. This reduces the number of data points we have without IMDB data we can use to link our multiple data sources.
We should remove listings with "No. of Episodes". We assume that these entries refer to television shows. While there are not many data points that had this issue, it allowed us to clean up data that is obviously not related to a movie.
Alternate titles can be combined into one list. This reduces the number of columns without losing information related to the movie.
There are likely duplicate entries for some of the movies. Given the size of our source data, it is not unreasonable to assume that some movies might be duplicated. Removing these duplicates allows us to pare down the size of our dataset, while simultaneously giving us more valuable information to work with.
There are likely many columns with null data. Since some entries may have null values for many of the columns due to the quality of data uploaded by contributers, we determined that we would check for columns that were mostly (>90%) empty and remove them from our dataset.
Some columns may have data represented in different formats. We used REGEX functions to clean up these inconsistencies. Some examples are listed below:
6a) There are two main forms of numeric formats for box office and budget values. Ex: $123.4 Million vs. $123,400,000. However, within these two formats, there are many potential variations.
6b) There are 4 main forms of dates. They are as follows: - Full month name, one- to two-digit day, four-digit year (i.e., January 1, 2000) - Four-digit year, two-digit month, two-digit day, with any separator (i.e., 2000-01-01) - Full month name, four-digit year (i.e., January 2000) - Four-digit year
There are some redundant data columns between our datasets. We will decide to keep some while dropping others, as follows:
Wiki Column | Kaggle Column | Decision |
---|---|---|
title_wiki | title_kaggle | Drop Wiki - Kaggle data is better, with no missing values |
running_time | runtime | Keep Kaggle data and fill in zeros with Wiki data |
budget_wiki | budget_kaggle | Keep Kaggle data and fill in zeros with Wiki data |
box_office | revenue | Keep Kaggle data and fill in zeros with Wiki data |
release_date_wiki | release_date_kaggle | Drop Wiki - Kaggle data is better, with no missing values |
language | original_language | Drop Wiki data |
Production company(s) | production_companies | Drop Wiki data - Kaggle is more consistent |