Downloading All Images From a Website using Python

Downloading all the images from a webpage individually can be a headache and time-consuming progress. Creating a short Python script makes the work easier. A website’s images are usually in .jpg, .jpeg, or .png format. This is useful information as it will help in establishing the code. An image in a webpage is usually implemented using the <img> tag of HTML. The tools used in the code are BeautifulSoup and the requests library from Python. BeautifulSoup is a well-known web scraping library that parses the HTML content of a webpage and gets all the content from the webpage based on HTML tags. The requests library can be used to get the HTML text of the webpage, also known as the source code.

The website to be used in the code is a blog on Yoast. The blog teaches users to develop a better SEO algorithm using images. The website has been selected because it contains a lot of images.

Here is a screenshot of the website:

The code will first get the HTML page using the requests module. BeautifulSoup will extract all the details from the <img> tags. Using the get() method, the source of the images will be stored in the list. The content of these images will be extracted using BeautifulSoup and it would be written to an image file using File Handling in Python.

The requests module and BeautifulSoup can be installed using pip in Command Shell:

>>> pip install requests
>>> pip install bs4

Extracting Content From the Website

The code will first extract the HTML content from the website in a string form. The <img> tag contents would then be stored in a list using BeautifulSoup. Its contents will contain the entire <img> tag along with its specifications and content values. The HTML extraction will be done using the content attribute of the requests output. The following code demonstrates the same:

from bs4 import BeautifulSoup
import requests
import os
os.mkdir('images')
images=[]
url='https://yoast.com/using-images-in-your-blog-post/'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'Referer': 'https://www.google.com/', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-User': '?1', 'Sec-Fetch-Dest': 'document', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'}
cont=requests.get(url,headers=headers).content

It should be noted that a headers attribute was also passed to the get() method of requests. This is done to ensure that our code’s IP Address is not blocked as a bot and that it is given access to the webpage HTML. The os module is also imported, and its mkdir() function is used to make a folder named ‘images.’ In this folder, all the downloaded images will be saved. The content attribute of requests.get() method provides the user with the content of the entire webpage HTML in a string format. This is provided as Input to BeautifulSoup as shown below:

soup=BeautifulSoup(cont,'html.parser')
imgall=soup.find_all('img')

It should be noted that the BeautifulSoup function has been given two inputs, the first is webpage HTML as discussed earlier, and the second is the HTML parser. In BeautifulSoup, ‘html.parser’ is the default parser, but it also provides other parsing options. The imgall variable is a list that contains all the <img> tags found by find_all() function of BeautifulSoup. But the attributes from the values in imgall must be extracted to get exact images.

Also Read: Python Beautiful Soup for Easy Web Scraping

Extracting and Saving Images From the Webpage

After getting all the <img> tags from the HTML, the next work is to search through each <img> tag for a source link to the image. The image can be directly copied from Google, Pixabay, Unsplash, etc., free image repositories, or uploaded on the Webpage server. If it’s a WordPress-managed Page, WordPress uploads the images for its users. All such images can be downloaded using Python. Source extraction is done as shown below:

for img in imgall:
    try:
        imgsrc=img['data-srcset']
    except:
        try:
            imgsrc=img['data-src']
        except:
            try:
                imgsrc=img['data-fallback-src']
            except:
                try:
                    imgsrc=img['src']
                except:
                    pass
    images.append(imgsrc)

It should be noted that the above code snippet uses a nested try-except to find the source link of the image. The source link is usually stored in either of the four source attributes in <img> tag:

data-srcset
data-src
data-fallback-src
src

All these four attributes of each <img> tag in the imgall list is checked for source link. If the source link is found in data-srcset attribute, there is no need to check in data-src attribute and so on. This creates a priority tree to select the cleanest and most direct source link available. If none of the attributes are found in the <img> tag, the tag is skipped. The source link is appended to the images list, which was created in the first code snippet.

Also Read: Python Exception Handling – Try, Except, Finally

The image can be downloaded as a binary image using the requests module. The content attribute of requests.get() method with an image as URL, provides the binary contents of the image. These contents can be written to a file to produce a downloaded image. It is shown in the code snippet below:

imgsdownloaded=0
imgsnotdownloaded=0
for image in images:
    if '.svg' in image:
        imgsnotdownloaded=imgsnotdownloaded+1
    else:
     r=requests.get(image).content
     filename='images/image'+str(imgsdownloaded)+'.jpg'
     with open(filename,'wb+') as f:
        f.write(r)
     imgsdownloaded=imgsdownloaded+1
print(f'{imgsdownloaded} Images Downloaded')
print(f'{imgsnotdownloaded} Images failed to Download')

The imgsdownloaded variable is created as it helps in naming the downloaded image. The code takes every link in the images list, gets its binary content, and writes it to a file. The files are stored in the newly created images folder, with each image being successively named, so the first image is image0.jpg, the second image is image1.jpg, and so on. It should be noted that all the images are converted to a .jpg format. The only format not supported is the .svg format, which is skipped if encountered.

The binary content of the image is written using the ‘wb+’ attribute in Python File Handling. This attribute allows the user to write and edit binary values in a file. The end of the program prints the number of images downloaded and the number of images that failed to download. The following output can be seen after running the code:

The column on the left shows a list of freshly generated images in the images folder. The terminal shows the code output as 54 images successfully downloaded, while two images failed to download.

Two images have also been attached to the blog for the users to see the output images extracted from the website. These images can be easily seen on the website.

Complete Code For Downloading All Images From a Website

The entire code is given below for reference:

from bs4 import BeautifulSoup
import requests
import os
os.mkdir('images')
images=[]
url='https://yoast.com/using-images-in-your-blog-post/'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36', 'Referer': 'https://www.google.com/', 'Sec-Fetch-Site': 'same-origin', 'Sec-Fetch-Mode': 'navigate', 'Sec-Fetch-User': '?1', 'Sec-Fetch-Dest': 'document', 'Accept-Encoding': 'gzip, deflate', 'Accept-Language': 'en-GB,en-US;q=0.9,en;q=0.8'}
cont=requests.get(url,headers=headers).content
soup=BeautifulSoup(cont,'html.parser')
imgall=soup.find_all('img')
for img in imgall:
    try:
        imgsrc=img['data-srcset']
    except:
        try:
            imgsrc=img['data-src']
        except:
            try:
                imgsrc=img['data-fallback-src']
            except:
                try:
                    imgsrc=img['src']
                except:
                    pass
    images.append(imgsrc)
imgsdownloaded=0
imgsnotdownloaded=0
for image in images:
    if '.svg' in image:
        imgsnotdownloaded=imgsnotdownloaded+1
    else:
     r=requests.get(image).content
     filename='images/image'+str(imgsdownloaded)+'.jpg'
     with open(filename,'wb+') as f:
        f.write(r)
     imgsdownloaded=imgsdownloaded+1
print(f'{imgsdownloaded} Images Downloaded')
print(f'{imgsnotdownloaded} Failed to download')

Conclusion

BeautifulSoup and requests modules can be considered the strongest weapons in a Web Scraper’s arsenal. Downloading all images from a website is an easy task using Web Scraping. A tedious task is reduced to a small amount of work in just a few lines of code. Hence it can be concluded that it is easier to download all images from a Webpage using Python rather than to do it manually.