Analyzing

CAMERA REVIEWS

Big Data, PySpark, AWS, ETL


GitHub Repository


Objective


This project looks at a big data set of Amazon.com reviews for cameras to determine if reviews that were part of a paid campaign were biased or trustworthy. Statistical analysis was run on the database using PySpark, and the database was loaded to an AWS RDS instance.

The ETL process was run on the cloud, using Spark. In order to accomplish this, tables were created in an RDS database. The data was then extracted from the S3 bucket and loaded into a DataFrame, where it was transformed to fit a desired schema. To complete the ETL process, these DataFrames were loaded into the corresponding tables on the RDS instance.

Once the ETL process was completed, statistical analysis was run using PySpark to determine if the paid reviews were unbiased.

Follow along the Notebook below to check out the analysis!



Notebook




Contact Me