What does axis in Pandas mean?

What Does Axis Mean In Pyandas?

When you’re working with Pandas, you must have seen the axis as the argument for many functions. What is this axis? What does it mean? Why is it so important? Let’s dive deep into Pandas and understand what this axis is and how to use it.

What is Pandas?

Pandas is one of the most popular and important libraries in the entirety of Python. It is the most essential library in the field of data science. Pandas is generally used to work on CSV files, excel files, or SQL databases. Pandas make it very easy for us to work on datasets or any kind of structured data. Pandas uses a special data structure known as dataframe to store structured data. Once we convert a dataset into a dataframe it becomes very easy to manipulate the data in it with the functions available in Pandas. Pandas with other data science libraries like matplotlib and numpy make Python the best language for any kind of data science work.

Related: Learn Pandas in depth.

What is a data-frame?

A data frame is a pandas-specific data structure used to store structured data. It consists of columns and rows in a table. It is basically a Python dictionary of key: list pairs, where the key is the column name and the corresponding list is the values inside the column. Once we have a dataframe, we can perform numerous mathematical operations revolving around a series of data values and even plot graphs using libraries like matplotlib.

How to create a data frame?

import pandas as pd
df = pd.read_csv("example.csv")
df
Code And Output To Create A Dataframe Using Pandas
Code And Output To Create A Dataframe Using Pandas

In the above block of code, first, we imported the pandas library and named it pd. Then we used the read_csv function of the pandas library to read the “example.csv” file and store it in the df variable. The output of the read_csv function is a dataframe. The read_csv function creates a dataframe of a CSV file. In case you have an Excel file, you can use the read_excel function. It is similar to read_csv and creates a dataframe of an Excel file. Then, when we print the df variable we get a dataframe as output.

Hooray! We just created a dataframe. Wasn’t that easy?

Dataframe 2
Dataframe

Okay so now that we have created a dataframe, let’s try some pandas functions on it. What can we do? Yes! let’s try to find the average of all these years of the invention.

Pandas’ Mean function

df.mean(axis=0,numeric_only=True)
Code And Output For Pandas Mean Function
Code And Output For Pandas Mean Function

Okay so when we tried to use the mean function on the dataframe, it gave us two outputs, the mean and the datatype of each column. We got the mean of only the Foundation year which is 1985.7 because we set the numeric_only parameter of the mean function as True. As the rest of the columns have string values in them, we didn’t get their mean value of them. We also passed value for another argument which is the “axis”. We passed its value as 0.

Okay so let’s move to the main question, what is this axis?

What exactly is the axis?

Axis specify the dimension of the dataframe in which we want to perform the function in. A dataframe is simply a table. So it has 2 dimensions, a row, and a column. That means the axis parameter specifies if we want to perform aggregation on the columns or the rows.

The 0-axis value says that we have to perform aggregation on the columns and the 1-axis value specifies that we want to perform aggregation on the rows. As we specified the axis value as 0 the aggregation was performed on the columns. If we set its value as 1 the aggregation will be performed on the rows.

Let’s set the axis as 1 and try to find the mean again.

Finding the mean of the rows of the dataframe

df.mean(axis=1,numeric_only=True)
Code And Output For Finding Mean Of The Rows Of The Dataframe
Code And Output For Finding Mean Of The Rows Of The Dataframe

So when we pass the value of the axis as 1 which means we try to calculate means of the all the rows, we get output in row_index: mean pairs and then the datatype which is float64. So the mean of the 0th indexed row is 1972.0 which is cause the only numeric value in the row is the foundation year which is 1972. So it’s just 1972/1 which gives a floating value of 1972.0. The same goes for all the other rows.

Also, the axis parameter has the default value of 0. So we are allowed to not mention the value of the axis in the parameters. In that case, the mean will be calculated for every column.

This axis parameter is present in the parameters of every aggregate function. It also behaves in the same way for every function. Once, you understand this axis parameter you can go on with other similar parameters and cover all the functions of the pandas library. At first, they all look very hard to understand so we keep pushing them away. But once we start to understand, it’s very simple.

Once, you get the hang of it, you can start with graph plotting libraries like Matplotlib, seaborn, or even Plotly which is a high-level graph plotting library. Let’s get a small overview of matplotlib and understand this axis parameter with respect to graph plotting.

What is Matplotlib?

Matplotlib is another important library for data science in Python. It is used to create good-quality animated, static and interactive visualizations. It is generally used with other data science libraries like Pandas and Numpy. It can render visuals in any format like PNG, SVG, PDF, or even directly in your Jupyter or any other notebook.

Related: Learn Matplotlib in depth.

Enough with theory, let’s try to make a bar graph based on the dataframe we created before.

Plotting a bar graph using Matplotlib.pyplot

import matplotlib.pyplot as plt

# Setting the axes
x = df['Programming language']
y = df['Foundation year']

# Create a bar chart
plt.bar(x, y)

# Add labels and title
plt.xlabel('Programming language')
plt.ylabel('Year of invention')
plt.title('Programming language-Year of invention graph')

# Show the chart
plt.show()

In the above code, first, we imported the Pyplot module from the matplotlib library as plt which is used to plot graphs. We will use it to plot a bar graph. Then we set the x-axis as the programming language column and the y-axis as the Foundation year column. Then we simply have to use the bar function of Pyplot to create the bar graph. Then we named the x-axis as the programming language, the y-axis as the year of invention, and gave the title the Programming language-year of invention graph. At last, we simply have to use the show function to display the graph in the output.

Code For Plotting A Graph Using A Dataframe
Code For Plotting A Graph Using A Dataframe

Let’s see what the graph looks like.

Graph Of Invention Year Of Every Programming Language
Graph Of Invention Year Of Some Popular Programming Languages

It is the classic bar graph. The axes are as we specified and we can get an idea of the year of the invention of each programming language. Here, we plotted the graph based on the column axis, in a similar way we can also plot the graph based on rows.

Conclusion

Pandas, matplotlib, and all the other libraries are very important for data science. It’s essential to know these libraries properly if you want to become a data scientist or you are working on a data science project. These small things may confuse you at some but you have to have faith in yourself and keep learning. If you keep trying you will get it all clear in no time.

References

Official Python documentation of Pandas.

Stack Overflow answer for the same question.