Python Beautiful Soup for Easy Web Scraping

Python Beautiful Soup

Hello, readers! In this article, we will be focusing on the Python Beautiful Soup module for Web Scraping, in detail.

So, let us get started! 馃檪

Also read: How to Scrape Google Search Results using Python Scrapy


Web Scraping using Beautiful Soup – Crisp Overview

These days, with data science and machine learning taking precedence in the IT industry, data has gained a lot of importance.

When we think of a specific domain or topic, there are many ways to fetch the data and analyze it. When it comes to fetching data for analysis, we collect data from various websites to analyze and poll out the possibilities from it.

On similar lines, these concepts gave birth to the concept of Web scraping.

With Web Scraping, we can surf and search through the webpages for data, collect necessary data from the webpage and then have it in a customized format at ease. That is the reason we call it scraping data from the web.

Having understood about scraping, let us now go ahead with Beautiful Soup as a module for Web Scraping in Python.


Python Beautiful Soup module for Web Scraping

The concept of web scraping is not as straightforward as it sounds.

At first, when we wish to scrape data from a website, we need to write a script that would request the master server for the data.

Moving ahead, with the customized scripts we can download the data from the webpage onto our workstations.

At last, we can customize the information we wish to scrape based on HTML tags as well so that only that specific information is downloaded from the website.

Python provides us with Beautiful Soup module that consists of various functions to scrape data from webpages at ease. With Beautiful Soup module, we can easily crawl and scrape HTML, XML, webpages, documents, etc.


Scrape Google Search results with Beautiful Soup

At first, we will be using Beautiful Soup module to scrape results of the webpages when the word science is searched against the server.

Initially, we would need to load the BeautifulSoup module in the python environment.

from bs4 import BeautifulSoup
import requests

Now, we will provide the URL that is the web page that needs to be searched for. Also, we append the word science to the URL so that we get the web links for the posts relevant to data science.

Further, we set User agent headers that lets the server identify the system and browsers wherein we want the scrape data to be downloaded.

A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

Now, we would need to add a GET request to the URL for the HTML content to be downloaded from the search results.

requests.get(url, header)

Further, we customize and get all the Header 3 values from the downloaded HTML content.

Example:

import requests
from bs4 import BeautifulSoup
import random
 
text = 'science'
url = 'https://google.com/search?q=' + text
A1 = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )
 
Agent1 = A1[random.randrange(len(A1))]
 
headers = {'user-agent': Agent1}
requ = requests.get(url, headers=headers)
 
soup_obj = BeautifulSoup(requ.text, 'lxml')
for x in soup_obj.find_all('h3'):
    print(x.text)
    print('#######')

Output:

Science
#######
American Association for the Advancement of Science (Nonprofit organization)
#######
Science (Peer-reviewed journal)
#######
Science | AAAS
#######
Science
#######
Science - Wikipedia
#######
ScienceDirect.com | Science, health and medical journals
#######
science | Definition, Disciplines, & Facts | Britannica
#######
Science News | The latest news from all areas of science
#######
Science - Home | Facebook
#######
Science Magazine - YouTube
#######
Department Of Science & Technology 
#######

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any questions.

For more such posts related to Python Programming, Stay tuned with us.

Till then, Happy Learning!! 馃檪