Course Title: Agec211: Statistical methods
Instructor: Christopher Llones
Assignment: Netflix Dataset Analysis in R
Due Date: 9 October 2025

Objective

This assignment will assess your ability to apply R programming skills—specifically using the dplyr package and the pipe operator (%>%)—to explore and analyze a real-world dataset. You will work with the Netflix Movies & TV Shows dataset to answer questions using code.

Instructions

  • Use R and the dplyr package to answer each question.

  • Submit your R script file (.R) with your code and outputs.

  • Use the pipe operator (%>%) for all data manipulations.

  • You may use additional packages like tidyr or stringr if needed.

  • Ensure your code is clean, commented, and reproducible.

Dataset and files
  1. Access the dataset and R script template from the agec211-assignment1 folder.

  2. Submit your completed R script file (.R) by the due date and upload using this link: Submission Link.

Questions

Part 1: Data exploration

  1. How many rows and columns are in the dataset?

  2. List all unique types of content (e.g., Movie, TV Show).

  3. How many titles were released in 2020?

Part 2: filtering and summarising

  1. Filter the dataset to show only TV Shows released in India. How many are there?

  2. Find the top 5 most common ratings.

  3. Which year had the most titles added to Netflix?

Part 3: grouping and aggregation

  1. Group the data by type and count how many entries each type has.

  2. Group the data by release_year and summarize the number of titles released per year.

  3. Which country has produced the most content on Netflix?

Advanced Filtering

  1. Filter the dataset to show all Movies with a duration longer than 100 minutes.

  2. Find all titles directed by ‘Steven Spielberg’.

  3. List all titles with the genre containing ‘Documentary’.

Bonus Challenge

  1. Create a new column that extracts the number of seasons for TV Shows. Then, find the average number of seasons.

  2. Which actor appears most frequently across all titles?

Grading rubrics

Criteria Excellent (5pts) Good (4pts) Fair (2-3 pts) Needs improvement (0-1 pt)
Code accuracy All answers are correct and match expected outputs. Most answers are correct with minor errors. Several answers are incorrect or incomplete. Many answers are missing or incorrect.
Use of dplyr Functions Consistently uses appropriate dplyr verbs (filter, mutate, summarise, etc.). Uses dplyr functions correctly in most cases. Uses some dplyr functions but inconsistently or incorrectly. Rarely uses dplyr or misuses functions.
Pipe Operator Usage (%>%) Pipe operator is used fluently and correctly throughout. Mostly correct usage with occasional syntax issues. Used sporadically or with frequent errors. Not used or used incorrectly.
Data Manipulation & Filtering Demonstrates strong understanding of filtering, grouping, and summarizing. Shows good grasp with minor gaps. Basic filtering and grouping attempted but lacks depth. Little to no meaningful data manipulation.
Insight & Interpretation Provides thoughtful insights or observations where applicable. Some interpretation is present. Minimal interpretation or unclear reasoning. No interpretation or irrelevant commentary.
Bonus Challenge (Q13–Q14) Completed with correct logic and creative approach. Attempted with mostly correct logic. Attempted but contains errors or lacks clarity. Not attempted or incorrect.
Reproducibility Code runs without errors and produces expected results. Minor issues but generally reproducible. Some errors prevent full reproducibility. Code fails to run or produces major errors.