Google BigQuery is a serverless data warehouse fully managed by Google. It supports storing and accessing huge data and also accommodates querying the data with ease.
We can access Google BigQuery from the Google Cloud Console which hosts many useful APIs and other services for building, deploying, and scaling our applications. One such useful service of Google Cloud Console is the API for scraping Youtube data.
Refer to this article to know how to use Google Cloud Console for scraping Youtube data.
With Google BigQuery you can literally do everything big! You can store the files in your google drive here, create your own datasets and projects, and also, use the already existing and updated datasets for querying. Since it uses ANSI SQL internally, it supports querying in the platform and also outside the platform.
That brings us to the main question; can we load the data from this platform into a data frame? We sure can! The pandas library has supported methods for literally any interchange of data and it does have a dedicated method for reading data from bigquery.
Learn more about Pandas Library here
Before we move on to querying the data into a data frame, there are some prerequisites to performing this task.
Prerequisites to Use Google BigQuery
Now before we continue with anything, please ensure that you have a Google account to use.
Open your browser and visit the Google Cloud Console. You will see an icon saying
Create Project. We need a project in the platform to be able to access the datasets of the bigquery platform.
Give a suitable name to the project so that it will be easy to locate. You should select the project you created from the drop-down. Make sure to copy the Project ID of your current project and store it somewhere securely. After you create a project, we need to create a service account in the current project. In the left navigation menu, look for IAM and admin.
You will find the create service account icon on the top of the page. Create a service account.
After you click on create you will be directed to name the account. Give a suitable name and a description. For the permissions, select BigQuery Data Viewer and BigQuery User.
After you create a service account, you will see the details of the account on the page. Click on the three dots and navigate to the manage keys. In the
ADD KEY dropdown section, select
Create new key.
Make sure the Key type is JSON. Then a json file will be downloaded into your system. Save it in a secure place because we need it later.
That’s all for the prerequisites. Make sure to save your project ID and the json key.
Let us see how we can access the data sets and tables from Google BigQuery.
In the welcome page of the cloud console(make sure you are in the project you created). Navigate to the
BigQuery section from the navigation menu.
After you enter the bigquery section, click on
View Dataset. Repeat the step on the product details page.
After you click on view dataset, you will be able to see a new dataset loading on the left side of the page. This is a public dataset and it contains many tables inside m=thousands of datasets. You can select any dataset and a table from it.
You can select any dataset you like. After you decide on what dataset you want to use, click on the drop arrow to choose a table. Note down the table ID as we need to use this ID for querying.
Exploring the read_gbq Method
The read_gbq method allows us to interact with the Google Cloud Console and query the tables. Let us see the method and its arguments.
pandas.read_gbq(query, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=True, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, max_results=None, progress_bar_type=None)
Let us see the important parameters of this method.
query– This argument is used to store the query you write for loading the data.
project_id– This argument stores the ID of the project you use for querying.
dialect– Determines the dialect we use for querying. Options are
credentials– This argument stores the Google authorization credentials.
How to Query a Table and Return a Data Frame?
There is one package we need to install before starting.
pip install pandas-gbq
Since we are using Google Colaboratory, we don’t need to install the package manually. It is already installed. But if we are using other notebooks or environments, we need to install it.
Let us import the libraries and packages.
import pandas as pd from google.cloud import bigquery from google.oauth2 import service_account
The pandas library is imported to create a data frame. The bigquery package is imported from the google cloud. We are also importing the authorization library for authorizing the key.
Store the path of the json file you downloaded earlier in a variable called credspath.
credspath = 'path_to_jsonfile'
Next, we are going to authorize the credentials and store them in a variable called creds.
creds = service_account.Credentials.from_service_account_file(credspath)
Let us bring the project id to our environment.
projid = 'your_project_id'
Let us start querying now!
We are going to see two examples. In the first example, we are going to load the entire table into a data frame. In the second example, we are going to select only a few rows from the data frame on a condition.
Loading an Entire Table Into A Data Frame
The data set we have used for this method is called a genomic dataset. It was released in 2016 and contains information about the cannabis species.
This dataset has column names called
end and so on.
The query to load the entire data is:
query = """ SELECT * FROM `bigquery-public-data.genomics_cannabis.MNPR01_201703` LIMIT 100 """
The variable query stores the query we write to load the data.
SELECT * is a statement used to select all the rows of a table.
The FROM clause is used to refer to the table we are accessing the data.
bigquery-public-data is the name of the dataset and
genomics_cannabis.MNPR01_201703 is the name of the table.
LIMIT 100 makes sure the number of rows displayed is 100.
df = pd.read_gbq(query, project_id=projid, credentials=creds) print(df)
The query we just wrote, the project id and the credentials are passed as arguments to the read_gbq method that returns a data frame called df.
Querying a Column Based on a Condition
The dataset we used for this method is an analysis of covid cases in India. It has columns like
deaths, and so on.
We are going to display the states whose confirmed cases are between 45 and 176235.
query = """ SELECT state FROM `bigquery-public-data.dataflix_covid.india_covid` WHERE confirmed BETWEEN 45 AND 176235 LIMIT 100 """
We are selecting the state column from the table
dataflix_covid.india_covid. The WHERE clause is used to select the states whose confirmed cases are between 45 and 176235. The number of rows is limited to 100.
To conclude the tutorial, we have learned the adverse uses of Google BigQuery. It is generally used to store and manipulate data with the help of querying.
We have seen a step-by-step approach to creating a service account, accessing the credentials, and viewing the datasets.
We have looked into two methods of loading the bigquery data into a data frame. The first one loads the entire table into a data frame. In the second method, we have queried to load a column of the table based on a certain condition.
Learn more about the method from the pandas official documentation.