Scrape ArXiv Latest Papers using Python

Hello fellow researcher!! You must have heard how tedious the task of researching and collecting the ArXiv papers can be. Guess what?! We can automate the task with the help of the Python programming language and get the relevant papers quickly and easily.

Today in this tutorial, we aim to build such a python code that will get us the required research papers in minutes and just by a few lines of code! What are we even waiting for? Let’s get started already!

Also Read: Python Selenium Introduction and Setup

Code to Scrape ArXiv Latest Papers

The very first step in any program is to install and import all the necessary modules/libraries into our program.

For scraping ArXiv research papers we will be required to install the ArXiv python library. The same can be done using the python pip command below.

pip install arxiv

Next, let’s import the two modules we need for the program i.e. pandas and ArXiv module. The pandas module will be required to save the final dataset in form of a dataframe. We will also be asking the user for the topic that the researcher needs the papers on using the input function available.

import pandas as pd
import arxiv

topic = input("Enter the topic you need to search for : ")

Once we have installed and imported all the necessary libraries and we also have the topic we need to research. We will be making use of the Search function to get the papers needed with all the details about the paper.

search = arxiv.Search(
  query = topic,
  max_results = 300,
  sort_by = arxiv.SortCriterion.SubmittedDate,
  sort_order = arxiv.SortOrder.Descending
)

The function will take a number of parameters. Let us understand the ones we have used in the code above.

query is used to assign the topic to search for. max_results is used to assign the number of results ( default value: 10 and max value: 30,000). sort_by is used to specify the factor that would be used to sort the output (submittedDate, lastUpdatedDate, or relevance). sort_order is used to set the order of papers submitted (Ascending or Descending).

Also Read: Fetch Data From a Webpage Using Selenium [Complete Guide]

But this code won’t result in the papers or any information getting displayed. For that to happen we need a loop. What we will be doing here is go through all the 300 papers received by us and then save some information for all the papers in a list which will later on transferred to a dataframe using the pandas library.

We can gather the following information about a certain paper: The id of the paper, Title of the paper, the Summary of the paper, the authors involved in the paper, the URL of the paper, and the category it belongs to as well.

all_data = []
for result in search.results():
  temp = ["","","","",""]
  temp[0] = result.title
  temp[1] = result.published
  temp[2] = result.entry_id
  temp[3] = result.summary
  temp[4] = result.pdf_url
  all_data.append(temp)

column_names = ['Title','Date','Id','Summary','URL']
df = pd.DataFrame(all_data, columns=column_names)

print("Number of papers extracted : ",df.shape[0])
df.head()

After the code snippet is executed, the result would be 300 research paper data in a dataframe.

The Complete Code to Scrape ArXiv Latest Papers using Python

Let’s have a look at the complete code for the scraper below.

import pandas as pd
import arxiv

topic = input("Enter the topic you need to search for : ")

search = arxiv.Search(
  query = topic,
  max_results = 300,
  sort_by = arxiv.SortCriterion.SubmittedDate,
  sort_order = arxiv.SortOrder.Descending
)

all_data = []
for result in search.results():
  temp = ["","","","",""]
  temp[0] = result.title
  temp[1] = result.published
  temp[2] = result.entry_id
  temp[3] = result.summary
  temp[4] = result.pdf_url
  all_data.append(temp)

column_names = ['Title','Date','Id','Summary','URL']
df = pd.DataFrame(all_data, columns=column_names)

print("Number of papers extracted : ",df.shape[0])
df.head()

Let’s have a look at another output for the same scraper we just developed.