Pandas DataFrames are a thing of beauty. DataFrames in Python makes the handling of data very user friendly.
You can import large datasets using Pandas and then manipulate them effectively. You can easily import CSV data into a Pandas DataFrame.
But, What are Dataframes in Python, and How to Use Them?
Dataframes are a 2-dimensional labeled data structure with columns that can be of different types.
You can use DataFrames for various kinds of analysis.
Often the dataset is too big and it’s not possible to look at the entire dataset at once. Instead, we want to see the summary of the Dataframe.
Under summary we can get the first five rows of the dataset, we can get also get a quick statistical summary of the data. Apart from that we can get information about the type of columns we have in our dataset.
In this tutorial we will learn how to display such summary for a DataFrame in Python.
We will be using the California Housing dataset as the sample dataset for this tutorial.
1. Import the Dataset in a Pandas Dataframe
Let’s start by importing the dataset into a Pandas Dataframe.
To import the dataset into a Pandas Dataframe use the following set of lines:
import pandas as pd housing = pd.read_csv('path_to_dataset')
This will store the dataset as a DataFrame in the variable ‘housing’.
Now we can look at different types of data summary that is available to us in Pandas.
2. Get the first 5 rowss
After importing a dataset for the first time it is common for data scientists to have a look at the first five rows of the Dataframe. It gives a rough idea of what the data looks like.
To output the first five rows of the Dataframe, use the following line of code:
When you run the following line, you will see the output as :
The complete code for displaying the first five rows of the Dataframe is given below.
import pandas as pd housing = pd.read_csv('path_to_dataset') housing.head()
3. Get statistical summary
To get a statistical summary of your Dataframe you can use the .describe() method provided by pandas.
The line of code to display the statistical summary is as follows :
Running this line of code will give the following output.
The complete code is as follows:
import pandas as pd housing = pd.read_csv('path_to_dataset') housing.describe()
The output displays quantities like mean, standard deviation, minimum, maximum, and percentiles. You can use the same code for all the below examples, and only replace the function name as mentioned for each example.
3. Get a quick description of the data
To get the quick description of the type of data in the table you can use .info() method provided by Pandas.
You can use the following line of code to get the description :
The output looks like as shown below :
The output contains a row for each column of the dataset. For each column label you get the count of non null entries and the data-type of the entry.
Knowing the data type of the columns in your dataset allows you to make better judgements when it comes to using the data to train models.
4. Get count for each column
You can directly get the count of entries in each column using the .count() method in Pandas.
You can use this method as shown in the following line of code :
The output comes out as following:
Displaying the count for each column can tell you about any missing entries in your data. Subsequently, you can plan your data cleaning strategy.
Get a Histogram for each column in your dataset
Pandas allow you to display histograms for each and every column in just one line of code.
To display histograms use the following line of code :
After running the line above, we get the output as :
Data scientists often use histograms to form a better understanding of the data.
This tutorial was about different types of quick summary that you can get for a Dataframe in Python. Hope you had fun learning with us!