Scraping YouTube Data Using Python

Web scraping is one of the most trendy and valuable processes for data scientists or enthusiasts to fetch data loads. While learning or experimenting with data, you need an abundance of it to conclude. Web scraping lets us fetch data from websites, social media platforms, e-commerce, and other applications.

With web scraping, we can obtain data from the above-mentioned applications for analysis, personal projects, etc. While scraping the data from a site, we must be mindful of the ethical limits; the data you scrape and what you do with it must not affect or harm the source. Web scraping is a potent tool for importing data when not misused.

Some sites allow web scraping, and some do not for obvious reasons! The data on the site can be misused when hackers or trespassers get hold of it. When scraping a site’s data, make sure the site allows scraping.

Youtube is one of the most used applications after Google. We all know how much time we spend on youtube every day. Youtube has a very curated home page that satisfies almost everyone. It is used for educational purposes, entertainment, traveling, etc.

Have you ever wondered if you can analyze the stats of your favorite youtube channel? What if you can visualize the number of views the channel gets and the popularity of the videos? That is possible with Python. But what if you can scrape youtube data?

We are going to do just that in a few moments. So if you are here for your next big data science project or as a beginner, stick till the end, and you will have a great project! However, if you are scraping bulk data, there is a high chance that YouTube will block your IP, you can look at the Bright Data Scraping Browser to bulk scrape YouTube data.

This article focuses on three topics- Scraping youtube data using Python, analyzing and visualizing it,

The Google Developer Console

To start off, you need to have a Google account for this step. So if you do not have a Google account yet, please create one and continue.

The Google developer console is a part of the Google Cloud console, which allows the developers to manage Google resources and also provides APIs for google maps, youtube, and other third-party applications. Why do we need this console now? We are scraping the data from youtube. To do that, we need to enable certain permissions and also fetch a key for scraping the data.

Let us see how we can obtain the key.

In your browser, search for Google developer console and click the first link. You might see a pop-up showing terms and conditions. Check the box and click AGREE AND CONTINUE. You need to create a project in the console. Refer to the image below to create a project.

The creation of your project might take some time. When it’s done, click on your project and navigate to Credentials on the left side, as shown below, under APIs and services.

Select the API Key and copy the key generated under ‘+create credentials’.

Save the key somewhere because we need it in the further steps.

You also need to enable the API services for the Youtube data API.

To do so, you will need to navigate to the library section,

The first step is completed!

YouTube API

Youtube API is the official documentation useful for integrating youtube into our project environment. This documentation has all the resources and code snippets for the scraping we are going to perform.

The youtube API has code samples for each activity which can be tested in our environment. The code samples are written in multiple languages, and we can select any language based on our preference.

So let’s get started!

Search for Youtube API in your browser and click the first link. Before we get started with the resources, let us check the requirements we need to satisfy for using the API.

You can find the requirements for Python by following the below navigation steps.

Add Youtube functionality to your site>Quickstarts>Python

As per the API, it is only compatible with Python versions – 2.7 or 3.5 and above.

There are also a few limitations to the API key we generated in the first step. Let us take a look at them.

Limitations to the API Key

The key we generated earlier has the default allocation of 10,000 units per day. So after its exhaustion, it can not be used again on the same day. Let us take a look at some common operations and their usage costs.

The reading operation – retrieving channel, video, and playlist data- usually costs 1 unit.

A write operation – creating, updating, or deleting a resource in the data can cost 50 units.

A typical search operation can take up to 100 units.

And a video upload costs 1600 units.

Bright Data Scraping Browser

Since the YouTube API limit is not suitable for enterprises, we recommend Bright Data’s Scraping Browser. It is a powerful tool designed for scraping web data. This browser provides easy access to target websites and allows you to interact with their HTML code to extract relevant data. Unlike other automated browsers, Scraping Browser is the only browser with built-in website unblocking capabilities, including CAPTCHA solving, browser fingerprinting, and automatic retries. This means you can save time and resources while bypassing the toughest website blocks and outsmarting any bot-detection software.

Additionally, Scraping Browser is compatible with Puppeteer and Playwright APIs, making it seamless to migrate and use the remote browser. You can easily fetch any number of browser sessions, interact with them, and retrieve data from websites that require interactions. With Scraping Browser, you can grow your data scraping projects with as many browsers as you need without worrying about developing and maintaining complicated infrastructure in-house.

Scraping YouTube Data

Now before we start, there are two packages we need to install from the youtube API documentation.

We need to install the two packages marked with red. The above commands work for virtual environments, but if you are using notebooks like Colab or Jupyter, or JupyterLab, you need to use the following commands.

pip3 install --upgrade google-api-python-client
pip3 install --upgrade google-auth-oauthlib google-auth-httplib2

The first command is an API client used to enable API services, and the second command is used to authorize the youtube user/owner.

Remember the API key you saved in the first step? Copy that key into one of the cells of the notebook.

Let us install and import all the necessary libraries for scraping, analyzing, and visualizing.

from googleapiclient.discovery import build
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from google.oauth2.credentials import Credentials
from google_auth_oauthlib.flow import InstalledAppFlow
!pip3 install isodate 
!pip3 install wordcloud
from wordcloud import WordCloud
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

The first line of the code is used to create an API client that allows the user to interact with Google’s services. We will use this to build a client that takes out the API key and lets us use Youtube’s services.

The Pandas library is used to create a data frame out of the data we scrape. The Seaborn and Matplotlib are used for data visualization.

The next two packages are used to support the credentials we use,

Isodate is used to convert the date format of the Youtube API to something we can understand.

Finally, the wordcloud module is used to display the text beautifully. The stopwords of the nltk library is used to remove the stopwords from the data.

Scraping the Channel Stats

Before we move on with scraping, we need to build an API client. Follow the code below to build the client.

youtube = build('youtube', 'v3', developerKey=apikey)

Here, youtube is the name of the API, and v3 is the version. The apikey is the variable assigned to the key generated earlier.

The channel we are going to scrape data from is freecodecamp.org which is a popular channel trusted by many students and developers.

We are going to store the ID of this channel in a variable called cids. The channel ID can be obtained by visiting the youtube channel and copying the URL in the address bar. You can use any other channel if you like.

cids=['UC8butISFwT-Wl7EV0hUK0BQ']

Now we don’t need to write any code for scraping the data. All the examples are given in the API, and we will modify them to suit our use case.

In the Youtube API, navigate to the Search for the Content tab on the Home page. You need to navigate to the Channel list as shown below.

In the use cases table, click the symbol next to list(by Channel ID). You will land on a code page. Select Python as the language.

The code we use is shown below.

We will modify this code a bit and create a function to scrape the data.

def chstats(youtube,cids):
     request = youtube.channels().list(
        part="snippet,contentDetails,statistics",
        id=','.join(cids))
    response = request.execute()
    channels_data = []
    for item in response['items']:
        data={
            'ChannelName': item['snippet']['title'],
            'Subscribers': item['statistics']['subscriberCount'],
            'Views': item['statistics']['viewCount'],
            'TotalViews': item['statistics']['videoCount'],
            'PlaylistID': item['contentDetails']['relatedPlaylists']['uploads']
        }
        channels_data.append(data)
    df=pd.DataFrame(channels_data)
    return df
chstats(youtube, cids)

Firstly, we are defining a function called chstats which fetches all the channel data. We are creating an empty list called channels_data to store all the details. The response object fetches all the data we want. The data we need is stored in a dictionary called data. We are fetching the Channel name, the number of subscribers, the views of the channel, and the playlist IDs.

This dictionary is appended to the empty list we created. The dictionary is then used to render a data frame because that’s what we planned to do right from the start.

Learn how to create a data frame from a dictionary.

The function is then called to print the data frame.

And just like that, we have a data frame that stores all the channel details.

We will use the PlaylistID obtained from the previous step to conduct an in-depth analysis.

Scraping the Playlist Ids of the Channel

We can get the IDs of each video on the channel. Just like the above example, navigate to the PlaylistItems to get the code.

# Define the playlist ID for the channel
PlaylistID = "UU8butISFwT-Wl7EV0hUK0BQ"
# Define a function to retrieve the video IDs for a given playlist
def vdids(youtube, playlist_id):
    videoids = []
    request = youtube.playlistItems().list(
        part="snippet,contentDetails",
        playlistId=playlist_id,
        maxResults=50)
    response = request.execute()
    for item in response['items']:
        videoids.append(item['contentDetails']['videoId'])
    return videoids

# Call the vdids function to retrieve the video IDs for the given playlist
video_ids = vdids(youtube, PlaylistID)
print(video_ids)

In this code, we are trying to get the IDS of the first 50 videos on the channel’s profile. The maximum value of the maxResults is 50. If we want more IDs, we need to use some other method.

The video IDs obtained in the above step create a data frame of all the required video details.

def getvid_details(youtube,video_ids):
    all_info=[]
    for i in range(0,len(video_ids),50):
        request = youtube.videos().list(
            part="snippet,contentDetails,statistics",
            id=','.join(video_ids[i:i+50])
    )
    response = request.execute()
    for video in response['items']:
        keepstats={'snippet':['channelTitle','title','description','tags','publishedAt'],
               'statistics':['viewCount','likeCount','favouriteCount','commentCount'],
               'contentDetails':['duration','definition','caption']
              }
        video_info={}
        video_info['video_id']=video['id']
        for k in keepstats.keys():
            for v in keepstats[k]:
                try:
                    video_info[v]=video[k][v]
                except:
                    video_info[v]=None
                
        all_info.append(video_info)
    return (pd.DataFrame(all_info))

The getvid_details is used to obtain details about the video, such as the title of the video, the description provided, the tags, various counts, and the duration.

We are creating an empty dictionary called video_info to store the details. Some of the videos may not have the tags provided. So as not to encounter an error, we are using the try-except block.

The dictionary is then converted to a data frame.

vdf=getvid_details(youtube,video_ids)
vdf.head()

The data frame is called vdf. We are printing the first 5 entries of the data frame using the head function.

And that is how we scrape the youtube data with the help of Python. Now that we have the final data frame, let us perform some data preprocessing and visualize it.

Data Preprocessing

Let us take the data frame and analyze it.

vdf.isnull().any()

The isnull() function is used to check if any of the objects in the data frame have null values.

Checking If The Data Frame Contains Null Values

As you can see from the output, the fields – tags and favouriteCount have null values. You can either drop them or ignore them for now.

Let us check the data types of the fields in the data frame.

vdf.dtypes

As you can see, the data types of the fields- viewCount,likeCount, and commentCount are objects. But they should be numeric, don’t you think?

Let us change the data types of these fields.

num_cols=['viewCount','likeCount','favouriteCount','commentCount']
vdf[num_cols]=vdf[num_cols].apply(pd.to_numeric,errors='coerce',axis=1)

This code snippet will convert the data types of the fields from object to numeric.

If you take a look at the duration field of the data frame vdf, you will find it in an alphanumeric format. But we need it in a time format. Let us change this field too.

import isodate
vdf['durationSec'] = vdf['duration'].apply(lambda x: isodate.parse_duration(x))
vdf[['duration','durationSec']]

And the output is given below.

That concludes the preprocessing step. You can even add a few other steps and analyze them.

Visualizing the Scraped Data

Let us start with a simple visualization. Let us plot the number of views versus the number of likes to check if the people watch the video or just like and leave. If the relationship is linear, we can say that everyone who watch the video also like the video.

sample_df = vdf.sample(n=25, random_state=42,replace=True)
plt.scatter(sample_df['viewCount'], sample_df['likeCount'])
plt.xticks(rotation=90) 
plt.xlabel('Views')
plt.ylabel('Likes')
plt.title('Views vs Likes')
plt.show()

We have taken a sample data frame because the original data frame has 50 rows, and it becomes messy to plot 50 entries as the values may overlap. The rotation for the x label is done to avoid the values overlapping each other.

Now let us check the popularity of the videos by the titles.

ax=sns.barplot(x='viewCount',y='title',data=vdf.sort_values('viewCount',ascending=False)[0:11])
plot=ax.set_xticklabels(ax.get_xticklabels(),rotation=360)

In this code, we check the videos’ popularity by their titles. We are checking the popularity of the first 12 videos. The title is usually assigned an x-axis. Since the titles are big and it is difficult to get the output in a frame, we switched the axes.

As you can see, the video titled – Basic Course for Beginners takes the first place!

Now, let us look at the most frequently used words in the titles of the freecodecamp channel and generate a word cloud.

Learn how to create a Word Cloud using Python.

We use the nltk library to generate a word cloud of the words.

stop_words=set(stopwords.words('english'))
vdf['title_no_stopwords']=vdf['title'].apply(lambda x:[item for item in str(x).split() if item not in stop_words])
all_words= list([a for b in vdf['title_no_stopwords'].tolist() for a in b])
all_words_str=','.join(all_words)
def plotcloud(wordcloud):
    plt.figure(figsize=(30,20))
    plt.imshow(wordcloud)
    plt.axis("off");
wordcloud=WordCloud(width=2000,height=1000,random_state=1,background_color='gray',colormap='magma',collocations=False).generate(all_words_str)
plotcloud(wordcloud)

Firstly, we remove the stop words from the titles. We frequently use stop words to address something like the, it, a, so, what, and so on.

Next, we create a function to plot the word cloud where the background color is gray, and the dimensions of the cloud are also specified.

Of course! freecodecamp is all about tutorials. So the word Tutorial takes up most of the cloud. The next frequently used word would be Course.

Conclusion

We have come to the end of this article. To recapitulate, we have discussed briefly web scraping and how web scraping can be a useful tool in fetching loads of data ethically. We must be careful with the sites we choose and the data we scrape. The scraping we perform at any cost should not affect the data owner.

We often find ourselves glued to youtube and scrolling through the videos. We can use scraping to fetch the data from youtube. It is possible with the help of a developer key generated by the Google Developer Console and the code samples available in Youtube API.

Youtube API is a dedicated documentation that allows us to incorporate youtube data in our projects.

We can scrape the data by using the code snippets from the API docs. While scraping the data, we tried to fetch the basic details of the channel, like the subscriber count, the channel name, etc. We dug deeper using this data as the stepping stone.

We created a data frame from the scraped data and performed some preprocessing. We checked if the fields in the data frame have any null values and the data types of the fields and made some changes to the existing data types.

Coming to visualization, we plotted a scatter plot of the number of views and the number of likes for sample videos of the channel.

Then, we checked the popularity of the videos based on their titles. Finally, we created an amazing word cloud containing the most frequently used word in the channel’s titles.

Of course, you can perform a deeper analysis of the data. You can even use the developer key we obtained to perform a search task based on a specific keyword. Be careful not to exhaust your key while doing so! Have fun coding!