Context :

This project serves as a practical application following the completion of a data analysis course, allowing for the demonstration and refinement of essential skills in the field.

The project encompasses key aspects of the data analysis process, including data preparation, data cleaning, analysis, and visualisation.

Project overview and objectives :

This project focuses on analysing data from a movie database over a period of thirty years. The primary goal is to clean up the dataset in order to extract meaningful insights and more specifically possible correlations.

Software used:


  • Pandas
  • Numpy
  • Seaborn
  • Matplotlib

Jupyter notebook

Data sources :

Link to source

Tasks performed :

  • Clarifying questions :
    1. Best tools and methods to clean up the dataset
    2. What are the possible bias to avoid ?
    3. What factors should be taken into consideration when selecting data to include in the analysis, and what factors should be excluded to ensure the accuracy and relevance of the findings?

By answering these clarifying questions, the project can be more effectively focused and the analysis can be tailored to provide the most valuable and relevant insights.

  • Cleaning data :

Checking for missing data, dealing with null entries, working on data types, duplicates

  • Looking at correlations
  1. Trying different couple of visualizations, correlations tables and matrix or heatmap
  2. Observations and conclusions.

Conclusion :

Thanks to Python’s libraries, it was possible to establish some possible correlations within the movie dataset. We identified that votes and gross were correlated. This was also the case for budget and gross. A contrario we discovered that other factors were not correlated.

Link to the project file