Projects | Movie Database | Sathvik Aithala

Overview & Background

This project was designed to practice the Extract, Transform, and Load (ETL) process on three data sources to generate an SQL database of movies. The movie information and links came from Wikipedia articles (in a JSON format) and the review information came from Kaggle (in CSV format).

Once the data was imported into various Pandas DataFrames, it was cleaned in order to remove extraneous columns, merge various formats, and convert data types where needed.

The data was then loaded into an SQL database using SQLAlchemy.

A Python function was created to take the in three input files and perform the ETL process. This file is saved as challenge.py in the GitHub repository. In doing this, we were asked to document our assumptions for transformations we performed to clean up the data.

Note: View the file titled Movies_Challenge-TEST in the repository to see the Jupyter Notebook file housing the code.

Description of Data Sources

Wikipedia Data: A JSON file consisting of movie names and associated info.

Kaggle Data: Similar to the Wikipedia data, but in CSV format.

Ratings: An extensive (700MB+) dataset consisting of ratings for a slew of movies.

Extract, Transform, & Load

The following assumptions were made in our analysis:

We should filter to movies that contain both an IMDB link and a listed director. This reduces the number of data points we have without IMDB data we can use to link our multiple data sources.
We should remove listings with "No. of Episodes". We assume that these entries refer to television shows. While there are not many data points that had this issue, it allowed us to clean up data that is obviously not related to a movie.
Alternate titles can be combined into one list. This reduces the number of columns without losing information related to the movie.
There are likely duplicate entries for some of the movies. Given the size of our source data, it is not unreasonable to assume that some movies might be duplicated. Removing these duplicates allows us to pare down the size of our dataset, while simultaneously giving us more valuable information to work with.
There are likely many columns with null data. Since some entries may have null values for many of the columns due to the quality of data uploaded by contributers, we determined that we would check for columns that were mostly (>90%) empty and remove them from our dataset.
Some columns may have data represented in different formats. We used REGEX functions to clean up these inconsistencies. Some examples are listed below:

6a) There are two main forms of numeric formats for box office and budget values. Ex: $123.4 Million vs. $123,400,000. However, within these two formats, there are many potential variations.

6b) There are 4 main forms of dates. They are as follows: - Full month name, one- to two-digit day, four-digit year (i.e., January 1, 2000) - Four-digit year, two-digit month, two-digit day, with any separator (i.e., 2000-01-01) - Full month name, four-digit year (i.e., January 2000) - Four-digit year

There are some redundant data columns between our datasets. We will decide to keep some while dropping others, as follows:

Wiki Column	Kaggle Column	Decision
title_wiki	title_kaggle	Drop Wiki - Kaggle data is better, with no missing values
running_time	runtime	Keep Kaggle data and fill in zeros with Wiki data
budget_wiki	budget_kaggle	Keep Kaggle data and fill in zeros with Wiki data
box_office	revenue	Keep Kaggle data and fill in zeros with Wiki data
release_date_wiki	release_date_kaggle	Drop Wiki - Kaggle data is better, with no missing values
language	original_language	Drop Wiki data
Production company(s)	production_companies	Drop Wiki data - Kaggle is more consistent

Creating A

MOVIE DATABASE

ETL, SQL, Python

Overview & Background

Description of Data Sources

Extract, Transform, & Load

Contact Me