Part 1 Data cleaning
1.1 Importing the data
First, let us load the data. A quick way to do is simply click the raw data folder from the files section at the bottom right and then import. Another way is by code shows as below:
library(readr)
<- read_csv("raw data/anime.csv")
anime #Quick check to see if the data have been successfully and correctly loaded.
head(anime, row = 5)
## # A tibble: 6 x 18
## title mediaType eps duration ongoing startYr finishYr sznOfRelease
## <chr> <chr> <dbl> <dbl> <lgl> <dbl> <dbl> <chr>
## 1 Full… TV 64 NA FALSE 2009 2010 Spring
## 2 your… Movie 1 107 FALSE 2016 2016 <NA>
## 3 A Si… Movie 1 130 FALSE 2016 2016 <NA>
## 4 Haik… TV 10 NA FALSE 2016 2016 Fall
## 5 Atta… TV 10 NA FALSE 2019 2019 Spring
## 6 Demo… TV 26 NA FALSE 2019 2019 Spring
## # … with 10 more variables: description <chr>, studios <chr>, tags <chr>,
## # contentWarn <chr>, watched <dbl>, watching <dbl>, wantWatch <dbl>,
## # dropped <dbl>, rating <dbl>, votes <dbl>
For the meaning of each variable, please look at my codebook.txt
1.2 Cleaning the data
library(tidyverse)
library(ggplot2)
#change all [] from the raw data to NA
is.na(anime) <- anime == "[]"
#set new data frame to keep the original without contaminated
<- anime
anime1
#To find out the staring and ending year of the data
summary(anime$startYr)
#removing special character
gsub("'", "", anime1$studios)
gsub("\\[", "", anime1$studios)
gsub("\\]", "", anime1$studios)
#Forming a table of the studios and the amount of their production
<- aggregate(anime$studios, by = list(anime$studios), FUN = length)
studio #Rename the column
colnames(studio) <- c("studios", "production")
#Set all Na to 0, becasue the missing data is unpredictable
$rating[is.na(anime1$rating)] <- 0
anime1$studios[is.na(anime1$studios)] <- 0
anime1
#Create overall average rating of each studio
<- aggregate(anime1[, 17], list(anime1$studios), mean)
studiorate colnames(studiorate) <- c("studios", "rating")
#merge two table
<- merge(studio,studiorate, by = "studios", all = TRUE)
studioall
#Only select the top 50 studios based on the amount of total production
library(data.table)
#need this package to use piping
library (dplyr)
<- data.table(studioall, key = "production")
studiotop50 <- studiotop50 %>%
studiotop50 #arrange the table to the decreasing trend
arrange(desc(production)) %>%
slice (1:50)
After achieving a table only contains top 50 studios with their total numbers of production, plus the overall average rating from 1958 to 2020, a graph could be plotted to visualise which is the most prolific studio and its rating.