R For Data Analysis: How to Find the Perfect Cocomelon Video for Your Kids

Cocomelon — Nursery Rhymes is the world’s second-largest Youtube channel (155M+ subscribers). It is such a popular and helpful channel that it is an inevitable subject for toddlers and parents. I enjoy spending time watching Cocomelon together with my son.

After watching Cocomelon videos for a month, I noticed the same videos are repeatedly recommended on Youtube. Videos like “The wheel on the bus” and “bath song” are popular and fun to watch, but they were published years ago, and kids got bored watching them repeatedly. As a father, I want to show some more recent but good-quality videos from the Cocomelon channel. As a data professional, I also want to explore the world’s second-largest Youtube channel data to gain more insights and find something interesting about the data available.

All videos within a Youtube channel only provide users with two options: recently uploaded (order by time) and popular (order by view). I could go to the recently uploaded tab and click one after another. However, the Cocomelon channel has 800+ videos, which will be time-consuming.

The good thing is that I am an engineer and know how to build something with data. So I started writing code by gathering data, performing the cleanup, visualization, and gaining more insights. I will share my journey on using R for Data Analysis: building an end-to-end solution for exploring trending Cocomelon videos using R from scratch.

Note: although the example code I wrote in R and the Youtube channel is for Cocomelon, they are my preference. You can also write in Python or Rust with its data analysis tool, and I will show how to get data from Youtube applies to other channels as well.

How To Get Youtube Data Using R

The data source is always the starting point of any data project. I have made several attempts to step onto my final solution.

I first searched on Google for the term: “Youtube views stats for Cocomelon” It shows some statistics about the channel, but none cover more detailed data for each video. Those sites are heavily flooded with ads, and web scraping might be challenging.

Then I looked at the public dataset on Kaggle, and CC0 datasets like Trending YouTube Video Statistics could be a good option. However, after exploring the dataset, I found two issues:

It doesn’t contain Cocomelon in the dataset
The content was retrieved years ago and needed newer videos I wanted to search for.

My only option is to pull data directly from Youtube to pull the most up-to-date data. There are also two options here:

Web scraping: I could set up a crawler or find one project on GitHub and use it directly. My concern here is if the crawler is aggressive, it might block my Youtube account. And crawisn’tisn’t very efficient for numerous videos to pull from.
Youtube API: I finally landed on this solution. It is efficient and provides some basic statistics on videos: number of views and number of likes. We can further use this information to build our data analysis project.

Loading Youtube Data into R Dataframe

Get Youtube API Key To Pull Data

Youtube API critical grants you permission to pull data from Youtube. You first would need to go to https://console.cloud.google.com/apis, then “create credentials” with the API key. The default key isn’t restricted; you can limit the API key used only for Youtube.

Get Youtube Channel Playlist in R

Once you have the API key, refer to Youtube Data API for more reference on the potential data it supports. To examine the API in a queryable stage, we can use tools like Postman or directly copy the full URL.

For example, we’d like to pull the channel information for Cocomelon; somehow, I didn’t find its channel id by inspecting its URL, but I found it through some google search.

https://www.youtube.com/channel/UCbCmjCuTUZos6Inko4u57UQ

Now we can use the channel id to construct the GET method and fill the API key into the key field:

https://www.googleapis.com/youtube/v3/channels?part=snippet,contentDetails,statistics&id=UCbCmjCuTUZos6Inko4u57UQ&key=

From the returned JSON, the most crucial piece of information is the playlist information, which tells us further about all the videos.

"contentDetails": {
  "relatedPlaylists": {
    "likes": "",
    "uploads": "UUbCmjCuTUZos6Inko4u57UQ"
  }
}

With the new adoption of pagination and the maximum number of items on one page being 50, calling playlistItems will take time to reach the final list. We’d need to use the current token to retrieve the next page until no next one is found. We can put everything together in R

				
					library(shiny)
library(vroom)
library(dplyr)
library(tidyverse)
library(httr)
library(jsonlite)
library(ggplot2)
library(ggthemes)
library(stringr)

key <- "to_be_replace"
playlist_url <-
  paste0(
    "https://www.googleapis.com/youtube/v3/playlistItems?part=snippet,contentDetails,status&maxResults=50&playlistId=UUbCmjCuTUZos6Inko4u57UQ&key=",
    key
  )

api_result <- GET(playlist_url)
json_result <- content(api_result, "text", encoding = "UTF-8")
videos.json <- fromJSON(json_result)
videos.json$nextPageToken
videos.json$totalResults

pages <- list(videos.json$items)
counter <- 0

while (!is.null(videos.json$nextPageToken)) {
  next_url <-
    paste0(playlist_url, "&pageToken=", videos.json$nextPageToken)
  api_result <- GET(next_url)
  print(next_url)
  message("Retrieving page ", counter)
  json_result <- content(api_result, "text", encoding = "UTF-8")
  videos.json <- fromJSON(json_result)
  counter <- counter + 1
  pages[[counter]] <- videos.json$items
}
## Combine all the dataframe into one
all_videos <- rbind_pages(pages)
## Get a list of video
videos <- all_videos$contentDetails$videoId

all_videos should give us all the fields for the video. All we care about at this stage is the videoId so we can fetch detailed information on each video.

Iterate the Video List and Fetch Data For Each Video In R

Once all the videos are stored in a vector, we can replicate a similar process as we did for the playlist. It will be much easier this time since we don’t have to handle the pagination.

At this stage, we’d care more about the data we will eventually pull from the video API call. I chose the ones for our later data analysis and visualization. To save time in pulling this data again, it’s better to persist the data into a CSV file, so we don’t have to run the API call multiple times.

				
					videos_df = data.frame()
video_url <-
  paste0(
    "https://www.googleapis.com/youtube/v3/videos?part=contentDetails,id,liveStreamingDetails,localizations,player,recordingDetails,snippet,statistics,status,topicDetails&key=",
    key
  )

for (v in videos) {
  a_video_url <- paste0(video_url, "&id=", v)
  print(v)
  print(a_video_url)
  api_result <- GET(a_video_url)
  json_result <- content(api_result, "text", encoding = "UTF-8")
  videos.json <- fromJSON(json_result, flatten = TRUE)
  # colnames(videos.json$items)
  video_row <- videos.json$items %>%
    select(
      snippet.title,
      snippet.publishedAt,
      snippet.channelTitle,
      snippet.thumbnails.default.url,
      player.embedHtml,
      contentDetails.duration,
      statistics.viewCount,
      statistics.commentCount,
      statistics.likeCount,
      statistics.favoriteCount,
      snippet.tags
    )
  videos_df <- rbind(videos_df, video_row)
}

write.csv(videos_df, "~/cocomelon.csv", row.names=TRUE)

Explore the Cocomelon Youtube Video Data in R

The data is prepared for our next stage to explore the Cocomelon Youtube video. Now it’s time to perform some cleanup and create visualizations to show findings.

The default object data type doesn’t work well with the later sorting, so we’d need to convert some objects to float or date types.

				
					videos_df <- videos_df %>%  transform(
  statistics.viewCount = as.numeric(statistics.viewCount),
  statistics.likeCount = as.numeric(statistics.likeCount),
  statistics.favoriteCount = as.numeric(statistics.favoriteCount),
  snippet.publishedAt = as.Date(snippet.publishedAt)
)

What are the top 5 most viewed Cocomelon videos?

This part is straightforward. We’d need to select the fields we are interested in, then sort the videos in descending order by the field. viewCount .

				
					videos_df %>%
  select(snippet.title, statistics.viewCount) %>% 
  arrange(desc(statistics.viewCount)) %>% head(5)

# Output:
#                                                    snippet.title statistics.viewCount
#1               Bath Song | CoComelon Nursery Rhymes & Kids Songs           6053444903
#2       Wheels on the Bus | CoComelon Nursery Rhymes & Kids Songs           4989894294
#3     Baa Baa Black Sheep | CoComelon Nursery Rhymes & Kids Songs           3532531580
#4 Yes Yes Vegetables Song | CoComelon Nursery Rhymes & Kids Songs           2906268556
#5 Yes Yes Playground Song | CoComelon Nursery Rhymes & Kids Songs           2820997030

For you have watched Cocomelon videos before, it’s not surprising to see the result that “Bath Song,” “Wheels on the Bus,” and “Baa Baa Black Sheep” rank in the top 3. It matches the Cocomelon popular tab on Youtube. Also, the “Bath Song” is played 20%+ more times than the second video — ”Wheels on the Bus.” I can see that many toddlers are struggling to take a bath, and having kids watch this video could give them an idea of how to take a bath and comfort them to calm them down.

We also create a bar chart with the top 5 videos:

				
					ggplot(data = chart_df, mapping = aes(x = reorder(snippet.title, statistics.viewCount), y = statistics.viewCount)) +
  geom_bar(stat = "identity",fill="lightgreen") +
  scale_x_discrete(labels = function(x) str_wrap(x, width = 16)) +
  theme_minimal()

What’s the correlation between views and likes?

The number of views and likes are correlated: Is a video more likely to get a thumb up (like) with more views?

We can use the data to prove it further. First, normalize the viewCount and likeCount to fit better for the visualization. Secondly, we also compute the days since the video was uploaded to get when the popular videos are created.

				
					chart_df <- videos_df %>%
  mutate(
    views = statistics.viewCount / 1000000,
    likes = statistics.likeCount / 10000,
    number_days_since_publish = as.numeric(Sys.Date() - snippet.publishedAt)
  )

ggplot(data = chart_df, mapping = aes(x = views, y = likes)) +
  geom_point() +
  geom_smooth(method = lm) + 
  theme_minimal()

cor(chart_df$views, chart_df$likes, method = "pearson")
## 0.9867712

The correlation coefficient is 0.98 very highly correlated: with more views on a video, it is likely to get more thumbs up. It’s also fascinating that solely six videos have over 2B+ views: parents and kids enjoy those six videos and potentially watch them many times.

We can further plot the popular videos and find out that the most popular videos aged 1500–2000 days showed those videos were created around 2018 or 2019.

How to Check the New Trending Cocomelon Video?

The popular video is easy to retrieve. However, popular videos created 4,5 years ago can still be trending due to many daily videos.

How about finding new Cocomelon videos with views? Since we can only pull the number of views from the Youtube API for the current state, we’d need to store the data temporarily by pulling data from the API with some days in between.

				
					f1 <- read_csv("~/cocomelon_2023_2_28.csv")
df2 <- read_csv("~/cocomelon_2023_3_2.csv")

df1<- df1 %>% transform(
  statistics.viewCount = as.numeric(statistics.viewCount)
)

df2<- df2 %>% transform(
  statistics.viewCount = as.numeric(statistics.viewCount),
  snippet.publishedAt = as.Date(snippet.publishedAt)
)

df1 <- df1 %>% select(snippet.title,
                      statistics.viewCount)
df2 <- df2 %>% select(snippet.title,
                      snippet.publishedAt,
                      statistics.viewCount)

# Join data by snippet.title
joined_df <- inner_join(df1, df2, by = 'snippet.title')
joined_df <- joined_df %>%
  mutate(
    view_delta = statistics.viewCount.y - statistics.viewCount.x,
    number_days_since_publish = as.numeric(Sys.Date() - snippet.publishedAt)
  )

# Recent Video uploaded within 200 days and top 5 of them by view delta
chart_df <- joined_df %>%
  filter(number_days_since_publish<=200) %>% 
  select(snippet.title, view_delta) %>%
  arrange(desc(view_delta)) %>% head(5)

ggplot(data = chart_df,
       mapping = aes(
         x = reorder(snippet.title, view_delta),
         y = view_delta
       )) +
  geom_bar(stat = "identity", fill = "lightblue") +
  scale_x_discrete(
    labels = function(x)
      str_wrap(x, width = 16)
  ) +
  theme_minimal()

# Output
#                                                                 snippet.title view_delta
#1 🔴 CoComelon Songs Live 24/7 -  Bath Song + More Nursery Rhymes & Kids Songs    2074257
#2                  Yes Yes Fruits Song | CoComelon Nursery Rhymes & Kids Songs    1709434
#3                        Airplane Song | CoComelon Nursery Rhymes & Kids Songs     977383
#4                    Bingo's Bath Song | CoComelon Nursery Rhymes & Kids Songs     951159
#5    Fire Truck Song - Trucks For Kids | CoComelon Nursery Rhymes & Kids Songs     703467

The top trending video is 🔴 CoComelon Songs Live 24/7. This video shows that parents can keep the kids with videos automatically rotating without switching videos explicitly. The remaining videos also showed potential good single songs that are good recommendations.

Final Thoughts

There are many videos to watch on Youtube for kids. Cocomelon has many videos, and I want to show my kid the good ones with the restricted time he is allowed to watch daily. Finding those trending videos is a fascinating exploration for data professionals.

I hope my post is helpful to you. As the next step, I will continue my journey in R and use Shiny to build an interactive application with users.

About Me

I hope my stories are helpful to you.

For data engineering post, you can also subscribe to my new articles or becomes a referred Medium member that also gets full access to stories on Medium.

In case of questions/comments, do not hesitate to write in the comments of this story or reach me directly through Linkedin or Twitter.

The Practical Data Engineering Resource

Chengzhi Zhao April 2, 2023

The data engineering space is evolving. Here are the resources I collected for practical data engineering resource.

6 Side Project Ideas for New and Experienced Data Engineers

Chengzhi Zhao January 13, 2023

Data engineers can work on some side projects to get experience. Those projects could initiate impressive discussions to help you land a dream job. We will introduce 6 data engineering side project ideas regardless of your experience.

Photo by Dan Cristian Pădureț on Unsplash

Data Engineering: Why It’s About Much More Than Just the Tools You Use

Chengzhi Zhao May 17, 2023

One key learning I had while chasing the latest tool is: Tools are great, but many data engineering problems cannot be resolved by using the newest tool but by human — Data Engineers. I want to share my thoughts on Why data engineering is about much more than just the tools you use.