How to read HTML tables using Python?

Pandas Read Html Cover Image

In today’s world, data is produced in abundance every day. The internet is a great source of getting data relevant to our needs. In order to make this easier, we can use web scraping which is the process of collecting data from the web.

Web scraping is a popular technique for obtaining data from sites that don’t provide direct API. Moreover, Python has made it easier to perform web scraping with the help of modules like BeautifulSoup. 

Pandas is a popular library of Python used for handling data. The read_html() function helps you to read HTML tables on web pages in the form of a list of DataFrame objects. That is, if a web page has multiple pages, they will be read as a list of DataFrame objects. 

In this article, you will learn how to read HTML tables from a string, a URL, a file, and typecasting tables using the Pandas read_html() function.


Prerequisites for using read_html()

You need to have Python and Pandas installed on your computer and your favorite IDE set up to start coding. After this, install the lxml module before using the read_html() function. It is an external module and is not included in the Python distribution. You can install it using the command:

pip install lxml

Extracting tables from strings of HTML using Python

In the following code, a variable named html_table contains a string representing an HTML table.

html_table = """
<table>
  <thead>
    <tr>
      <th>ID</th>
      <th>Name</th>
      <th>Designation</th>
      <th>Salary</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>5</td>
      <td>Jenny</td>
      <td>Branch Manager</td>
      <td>50000</td>
    </tr>
    <tr>
      <td>1</td>
      <td>Alice</td>
      <td>Salesperson</td>
      <td>38000</td>
    </tr>
    <tr>
      <td>4</td>
      <td>Thomas</td>
      <td>CEO</td>
      <td>85000</td>
    </tr>
    <tr>
      <td>8</td>
      <td>Bob</td>
      <td>Software Developer</td>
      <td>67000</td>
    </tr>
  </tbody>
</table>
"""

You can read this table using Pandas in the following way:

import pandas as pd
df = pd.read_html(html_table)
df
Read Html String Example

Notice that the output is in the form of a list containing only one table here. It can be viewed as a data frame by accessing the specific index in the list.

Read Html String Df

You can check the data type of each column as:

df[0].dtypes
ID              int64
Name           object
Designation    object
Salary          int64
dtype: object

Extracting table from a URL

The read_html() function accepts URLs and reads HTML tables on that web page.

import pandas as pd
url = 'https://www.infoplease.com/geography/largest-countries-world-area'
df = pd.read_html(url)
df[0]
Read Html Url Df

You can also check how many tables are present on the web page by computing the length of the list.

len(df)
1

Here, the above URL contained only one table.

Note that, the content on the web page might change over time and you may get a different result if your run the same code.


Extracting tables from files

The read_html() function also accepts files and reads the tables in them.

In the following code, the file ‘table.txt’ contains the HTML code for two tables like this:

Tables In File
Tables In File
import pandas as pd
file_path = 'table.txt'
with open(file_path, 'r') as f:
    df = pd.read_html(f.read())
df

Output:

[   ID    Name         Designation  Salary
 0   5   Jenny      Branch Manager   50000
 1   1   Alice         Salesperson   38000
 2   4  Thomas                 CEO   85000
 3   8     Bob  Software Developer   67000,
    Roll No.  Name  Marks
 0        10  John     23
 1         5  Mary     40
 2        18   Max     44]

The output list has two tables. You can view each one of them by using indexing.

Read Html File Df 1
Read Html File Df 2

Typecasting table columns with converters

Sometimes, you might want the data types of some columns from the table to be of a specific type. In such cases, you can typecast them using the read_html() function. Recall that in the above example, the data type of ‘Salary’ was ‘int64’.

import pandas as pd
df  = pd.read_html(html_table, converters={
    'Salary': float
})
df
Read Html Typecast Df

If you check the data types now, you will see the data type of ‘Salary’ to be float.

Read Html Typecast Datatype

Summary

The Pandas read_html() function is a quick and efficient way to read HTML tables from various sources. It is very helpful while performing web scraping. One of the limitations of read_html() is that it cannot read tables that are loaded with JavaScript. Also, some websites do not allow the scraping of data, so in that case this function is of no use.


Reference