In this tutorial, we will go over several ways that you can use to subset a dataframe. If you are importing data into Python then you must be aware of Data Frames. A DataFrame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns.
Subsetting a data frame is the process of selecting a set of desired rows and columns from the data frame.
You can select:
- all rows and limited columns
- all columns and limited rows
- limited rows and limited columns.
Subsetting a data frame is important as it allows you to access only a certain part of the data frame. This comes in handy when you want to reduce the number of parameters in your data frame.
Let’s start with importing a dataset to work on.
Importing the Data to Build the Dataframe
In this tutorial we are using the California Housing dataset.
Let’s start with importing the data into a data frame using pandas.
import pandas as pd housing = pd.read_csv("/sample_data/california_housing.csv") housing.head()
Our csv file is now stored in housing variable as a Pandas data frame.
Select a Subset of a Dataframe using the Indexing Operator
Indexing Operator is just a fancy name for square brackets. You can select columns, rows, and a combination of rows and columns using just the square brackets. Let’s see this in action.
1. Selecting Only Columns
To select a column using indexing operator use the following line of code.
This line of code selects the column with label as ‘population’ and displays all row values corresponding to that.
You can also select multiple columns using indexing operator.
housing[['population', 'households' ]]
To subset a dataframe and store it, use the following line of code :
housing_subset = housing[['population', 'households' ]] housing_subset.head()
This creates a separate data frame as a subset of the original one.
2. Selecting Rows
You can use the indexing operator to select specific rows based on certain conditions.
For example to select rows having population greater than 500 you can use the following line of code.
population_500 = housing[housing['population']>500] population_500
You can also further subset a data frame. For example, let’s try and filter rows from our housing_subset data frame that we created above.
population_500 = housing_subset[housing['population']>500] population_500
Note that the two outputs above have the same number of rows (which they should).
Subset a Dataframe using Python .loc()
.loc indexer is an effective way to select rows and columns from the data frame. It can also be used to select rows and columns simultaneously.
An important thing to remember is that .loc() works on the labels of rows and columns. After this, we will look at .iloc() that is based on an index of rows and columns.
1. Selecting Rows with loc()
To select a single row using .loc() use the following line of code.
To select multiple rows use :
You can also slice the rows between a starting index and ending index.
2. Selecting rows and columns
To select specific rows and specific columns out of the data frame, use the following line of code :
This line of code selects rows from 1 to 7 and columns corresponding to the labels ‘population’ and ‘housing’.
Subset a Dataframe using Python iloc()
iloc() function is short for integer location. It works entirely on integer indexing for both rows and columns.
To select a subset of rows and columns using iloc() use the following line of code:
housing.iloc[[2,3,6], [3, 5]]
This line of code selects row number 2, 3 and 6 along with column number 3 and 5.
Using iloc saves you from writing the complete labels of rows and columns.
You can also use iloc() to select rows or columns individually just like loc() after replacing the labels with integers.
This tutorial was about subsetting a data frame in python using square brackets, loc and iloc. We learnt how to import a dataset into a data frame and then how to filter rows and columns from the data frame.