How to read HTML from a URL in Python 3?

One of the most widely used and preferred languages today is Python programming language. Python is used extensively not only because it is user-friendly, relatively easier to read and understand, and has simple language syntax, but also because it is capable of addressing many different kinds of difficulties.

This language comes with a number of packages and libraries that are able to handle anything from simple problem-solving to working with datasets, machine learning, and beyond.

Methods to Read HTML from a URL in Python

Let us discuss yet another intriguing situation in this particular blog post: how to make use of Python 3 to read the HTML code of a web page whose URL is provided. We are going to try to understand two different approaches for solving this problem using various libraries and modules.

Approach 1: Using the urllib package in Python3

The urllib package consists of a number of modules for working with URLs. Today, we’re going to make use of urllib.request. The urllib.request module includes classes and functions that are helpful with the opening as well as reading URLs, mostly HTTP.

Make sure to import the urllib.request package before starting with implementation. Using this module one can read the HTML just with one line of code. The URL is passed to the url.request.urlopen as a string for opening the URL and the .read() function is used for reading the HTML.

You can learn more about the Python urllib package.

# Step 1: Import the package
import urllib.request

# Step 2: Assign the URL to variable
url = "http://www.python.org"

# Step 3: First open, then read the the HTML
text = urllib.request.urlopen(url).read()

#Step 4: Print the HTML code
print(text)

# prints the datatype of the output.
print(type(text))

OUTPUT

Using Urllib 1 — Using Urllib – Example 1

Observe that the output is the HTML code of provided URL. Also, take a look at the code that mentions “\n” instead of writing code on the new line. And the data type of the code is “bytes”. This might not be the most efficient way of viewing the HTML code. To make this output more readable by adding the indentation wherever they are supposed to be, what we can do is convert the “bytes” output produced by the urllib.request into string datatype. To convert the datatype of output into string datatype, we need to decode the output obtained from the previous steps.

# read the URL
text = urllib.request.urlopen("http://www.python.org")
text_bytes = text.read()

# converting bytearray to string datatype
text_str = text_bytes.decode("utf8")

# printing the HTML code as string datatype
print(text_str)

OUTPUT

Using Urllib 2 — Using Urllib – Example no. 2

To learn more about urllib.request, its use cases, syntax, parameters, etc, please click here

Approach 2: Using the requests package in Python3

Requests is an HTTP library for the Python programming language. The objective of the package intends to simplify and improve the overall accessibility of HTTP requests. To read HTML for the provided URL, we first prepare a request using the request.get() function of this module, the datatype of this is ‘requests.models.Response’, therefore we convert this into a string datatype using the text() function.

Make sure to install the requests package before proceeding further with implementation.

# To install package
!pip install requests

import requests

url = 'http://www.python.org'

# create request
x = requests.get(url)

#convert request to string datatype
text = x.text

print(text)

OUTPUT

To read more about the requests package, please click here.

Conclusion

Python language has wide applications and can be used for opening, and reading files of many forms. In this article, we studied two different standard ways of reading the HTML code of a webpage whose URL is provided. The packages used to read HTML are – the urllib package and the requests package in Python3.

To learn from more such detailed and easy-to-understand articles on various topics related to Python programming language, visit here.