One of the most widely used and preferred languages today is Python programming language. Python is used extensively not only because it is user-friendly, relatively easier to read and understand, and has simple language syntax, but also because it is capable of addressing many different kinds of difficulties.
This language comes with a number of packages and libraries that are able to handle anything from simple problem-solving to working with datasets, machine learning, and beyond.
Methods to Read HTML from a URL in Python
Let us discuss yet another intriguing situation in this particular blog post: how to make use of Python 3 to read the HTML code of a web page whose URL is provided. We are going to try to understand two different approaches for solving this problem using various libraries and modules.
Approach 1: Using the urllib package in Python3
The urllib package consists of a number of modules for working with URLs. Today, we’re going to make use of
urllib.request module includes classes and functions that are helpful with the opening as well as reading URLs, mostly HTTP.
Make sure to import the urllib.request package before starting with implementation. Using this module one can read the HTML just with one line of code. The URL is passed to the
url.request.urlopen as a string for opening the URL and the .read() function is used for reading the HTML.
You can learn more about the Python urllib package.
# Step 1: Import the package import urllib.request # Step 2: Assign the URL to variable url = "http://www.python.org" # Step 3: First open, then read the the HTML text = urllib.request.urlopen(url).read() #Step 4: Print the HTML code print(text) # prints the datatype of the output. print(type(text))
Observe that the output is the HTML code of provided URL. Also, take a look at the code that mentions “\n” instead of writing code on the new line. And the data type of the code is “bytes”. This might not be the most efficient way of viewing the HTML code. To make this output more readable by adding the indentation wherever they are supposed to be, what we can do is convert the “bytes” output produced by the urllib.request into string datatype. To convert the datatype of output into string datatype, we need to decode the output obtained from the previous steps.
# read the URL text = urllib.request.urlopen("http://www.python.org") text_bytes = text.read() # converting bytearray to string datatype text_str = text_bytes.decode("utf8") # printing the HTML code as string datatype print(text_str)
To learn more about
urllib.request, its use cases, syntax, parameters, etc, please click here
Approach 2: Using the requests package in Python3
Requests is an HTTP library for the Python programming language. The objective of the package intends to simplify and improve the overall accessibility of HTTP requests. To read HTML for the provided URL, we first prepare a request using the
request.get() function of this module, the datatype of this is ‘requests.models.Response’, therefore we convert this into a string datatype using the
Make sure to install the requests package before proceeding further with implementation.
# To install package !pip install requests import requests url = 'http://www.python.org' # create request x = requests.get(url) #convert request to string datatype text = x.text print(text)
Python language has wide applications and can be used for opening, and reading files of many forms. In this article, we studied two different standard ways of reading the HTML code of a webpage whose URL is provided. The packages used to read HTML are – the urllib package and the requests package in Python3.
To learn from more such detailed and easy-to-understand articles on various topics related to Python programming language, visit here.
Python Documentation – https://docs.python.org/3/howto/urllib2.html