Part 1 Data cleaning

1.1 Importing the data

First, let us load the data. A quick way to do is simply click the raw data folder from the files section at the bottom right and then import. Another way is by code shows as below:

library(readr)
anime <- read_csv("raw data/anime.csv")
#Quick check to see if the data have been successfully and correctly loaded. 
head(anime, row = 5)

## # A tibble: 6 x 18
##   title mediaType   eps duration ongoing startYr finishYr sznOfRelease
##   <chr> <chr>     <dbl>    <dbl> <lgl>     <dbl>    <dbl> <chr>       
## 1 Full… TV           64       NA FALSE      2009     2010 Spring      
## 2 your… Movie         1      107 FALSE      2016     2016 <NA>        
## 3 A Si… Movie         1      130 FALSE      2016     2016 <NA>        
## 4 Haik… TV           10       NA FALSE      2016     2016 Fall        
## 5 Atta… TV           10       NA FALSE      2019     2019 Spring      
## 6 Demo… TV           26       NA FALSE      2019     2019 Spring      
## # … with 10 more variables: description <chr>, studios <chr>, tags <chr>,
## #   contentWarn <chr>, watched <dbl>, watching <dbl>, wantWatch <dbl>,
## #   dropped <dbl>, rating <dbl>, votes <dbl>

For the meaning of each variable, please look at my codebook.txt

1.2 Cleaning the data

library(tidyverse)
library(ggplot2)

#change all [] from the raw data to NA
is.na(anime) <- anime == "[]"

#set new data frame to keep the original without contaminated
anime1 <- anime

#To find out the staring and ending year of the data
summary(anime$startYr)

#removing special character
gsub("'", "", anime1$studios)
gsub("\\[", "", anime1$studios)
gsub("\\]", "", anime1$studios)

#Forming a table of the studios and the amount of their production
studio <- aggregate(anime$studios, by = list(anime$studios), FUN = length)
#Rename the column
colnames(studio) <- c("studios", "production")

#Set all Na to 0, becasue the missing data is unpredictable
anime1$rating[is.na(anime1$rating)] <- 0
anime1$studios[is.na(anime1$studios)] <- 0

#Create overall average rating of each studio
studiorate <- aggregate(anime1[, 17], list(anime1$studios), mean)
colnames(studiorate) <- c("studios", "rating")

#merge two table 
studioall <- merge(studio,studiorate, by = "studios", all = TRUE)

#Only select the top 50 studios based on the amount of total production
library(data.table)
#need this package to use piping
library (dplyr)
studiotop50 <- data.table(studioall, key = "production")
studiotop50 <- studiotop50 %>% 
#arrange the table to the decreasing trend
  arrange(desc(production)) %>% 
  slice (1:50)

After achieving a table only contains top 50 studios with their total numbers of production, plus the overall average rating from 1958 to 2020, a graph could be plotted to visualise which is the most prolific studio and its rating.