Python Data Science Questions

Data Structure In Python

In this section, we will discuss what is Python, its history, origins, current version, salaries, and job roles in 2022 and then we will jump into important python programming questions.

Python has become one of the most popular programming languages in the world in recent years. It is used across numerous devices in the whole world. Due to the wide range of available libraries, it can be used by developers and non-developers alike.

Python is a computer programming language regularly used to construct websites and software, automate tasks, and behaviour records analysis. It is a general-purpose language, which means it can be used to create a variety of different programs and isn’t specialized for any specific problems. This versatility, alongside its beginner-friendliness, has made it one of the most-used programming languages today. In many surveys provided by different organizations across the world, Python became the top in-demand language in 2022.

Python was developed by Guido van Rossum in the late 1980s at the Netherlands National Research Institute for Mathematics and Computing. It had succeeded ABC programming language which interfaced with the Amoeba operating system and had exceptional handling.

Python 3.10.7 is the latest release of the Python programming language and includes many new features and optimizations.

Top Python Jobs in 2022 with salaries

  • Artificial Intelligence (AI) Specialist | $135,238
  • Solutions Architect | $120,756
  • Machine Learning Engineer | $112,343
  • Analytics Manager | $99,121
  • Data Scientist | $97,004
  • Data Engineer | $92,999
  • Software Engineer | $88,280
  • Backend Developer | $87,009
  • Computer Scientist | $81,812
  • Front End Developer | $76,289

Theoretical Python Data Science Questions

1. Which library do we use for Data manipulation?

Pandas is a library of Python. pandas is a very popular library and it is a widely used library for data science, along with NumPy and matplotlib. It has an active community with 1,000+ contributors and is heavily used for data analysis and cleaning.

2. Write the top 5 libraries in Python for Data Science.

The top 5 libraries of Python which are widely used in Data science projects are:

  • TensorFlow
  • Pandas
  • NumPy
  • Matplotlib
  • SciPy

3. What is the difference between series and vectors?

  • Vectors only assign index positions values as 0,1,…, (n-1).
  • Series only one column. It assigns custom index positions values that are for each data series. E.g: cust_ID, cust_name, total_sales. Series can be created from the list, array, dictionaries.

4. Differentiate between data frames and matrices.

Data Frames

  • A data frame is a collection of series that share a common index.
  • It can hold multiple series, which are of different data types.
  • For example, the employee data has various columns such as emp_id, emp_name, age, gender, and department. These are each individually a series that is of a different data type.

Matrices

  • A matrix in Numpy is constructed with multiple vectors.
  • It can only hold one data type in the entire two-dimensional structure. 

5. Explain the use of Pandas Dataframe groupby.

Groupby allows the grouping of rows together based on a column and it performs an aggregation function on those combined rows. Example: df.groupby(‘salary’).mean().

6. Name some Python libraries that can be used for visualization.

Matplotlib is a standard data visualization library and it is very useful to generate two-dimensional graphs. Eg: histograms, pie charts, bar, column graphs, and scatterplots. Many libraries have been built on top of Matplotlib, and its functions can be used in the backend.  Also, it is widely used to create the axes and the layout for visualization.

Seaborn is Based on Matplotlib. It is a data visualization library in Python. It works well for Numpy and Pandas and It provides a great interface for drawing attractive and informative statistical graphics.

7. What is a scatter plot?

It is two-dimensional data visualization that explains the relationship between observations of two different variables. One will be plotted on the x-axis, and the other is plotted against the y-axis.

8. What are the difference between regplot(), lmplot() and residplot()?

  • regplot() is used to plot data and a linear regression model fit. For estimating the regression model, there are several mutually exclusive possibilities.
  • lmplot() plots the data, and the regression model fits across a FacetGrid. It is designed as a practical interface for fitting regression models across conditional subsets of a dataset and is more computationally intensive. lmplot() combines regplot() and FacetGrid.
  • residplot() plots the errors between X and Y, creating a linear regression equation for the same.

9. Define a heatmap.

A heatmap is a type of data visualisation that makes use of colour to depict how a value changes depending on the values of two other variables. For example, you could use a heatmap to understand how air temperature varies according to the time of day across a set of cities.

10. Why use Python over other languages?

Python is a widely used, flexible, and all-purpose programming language. Because it is clear and simple to learn, it is great as a first language. It is also a useful language to have in any programmer’s toolkit because it can be used for everything from web development to software development to scientific applications.

11. What is the enumerate function in Python?

Python enumerate() adds a counter to an iterable and returns it in a form of enumerating object. Enumerate object can then be used directly for loops or converted into a list of tuples using the list() method

12. What is the math behind the absolute value of a complex number?

If z=a+ib, then absolute value is calculated as sqrt(a^2+b^2)

13. What are the top libraries available in Python for text mining?

  • Natural Language Toolkit (NLTK)
  • Gensim
  • CoreNLP
  • spaCy
  • TextBlob
  • Pattern
  • PyNLPl

14. How is Pandas used in data analysis?

Pandas make it very convenient to load, process and analyze such tabular data using SQL-like queries. Pandas offers a variety of options for visual analysis of tabular data, working in conjunction with Matplotlib and Seaborn. The main data structures in Pandas are implemented with Series and DataFrame classes.

15. Name top 5 Python compilers.

  • PyCharm
  • Sublime Text
  • Thonny
  • Visual Studio Code
  • Jupyter Notebook

16. What are Keywords in Python?

Python uses reserved words with specific meanings called keywords. They are typically employed to specify the kind of variables. Variable and function names are not permitted to contain keywords. The 33 keywords listed below are all in Python:

orandnotifelif
elseforwhilebreakdef
aslambdapassreturntrue
falsewithtryassertclass
continuedelexceptfinallyfrom
globalimportinisNone
nonlocalraiseyield
Python Keywords

Data Science: Coding Questions

1. Write a program to predict the output type in Python

# Defining the variable  
x = 'z'
print(type(x))

2. Write a python program that prints a table of 13 using while loop.

i = 0
while i <= 10:
    print(i*13)
    i+=1

3. How can we access a CSV file in Python?

import csv

with open("bwq.csv", 'r') as file:
  csv_reader = csv.reader(file)
  for row in csvreader:
    print(row)
import pandas as pd
data_bwq = pd.read_csv("bwq.csv")
data_bwq

4. Generate random numbers in Python.

#generating random numbers between (0,22)
import random
n = random.randint(0,22)
print(n)

5. Check whether the element is in sequence or not.

42 in [2, 39, 42]

# Output: True

6. Show the difference between extend and append functions.

  • append: It appends the object at the end.
a = [1, 2, 3]
a.append([4, 5])
print (a)

# Output: [1, 2, 3, [4, 5]]
  • extend: Extends the list by appending elements from the iterable.
a = [1, 2, 3]
a.extend([4, 5])
print (a)

# Output: [1, 2, 3, 4, 5]

7. Print all the multiples of 10 up to 100.

multiples=[] 
for i in range(10, 101): 
    if i%10==0: 
        multiples.append(i) 
print(multiples)

# Output: [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]

8. Fix ModuleNotFoundError and ImportError in Python.

  • First, make sure you are using absolute imports
  • Second, export the project’s root directory to PYTHONPATH

Most modern Python IDEs will do the trick automatically but if this is not the case, I am sure there will be such an option where you’ll be able to define the PYTHONPATH for your Python application (at least PyCharm). If you are running your Python application in another environment like Docker, Vagrant, or inside your virtual environment you can run the below command in your bash:

export PYTHONPATH="${PYTHONPATH}:/path/to/your/project/"
# * For Windows
set PYTHONPATH=%PYTHONPATH%;C:\path\to\your\project\

9. Write methods to separate all files with a specific extension(.csv, .txt) in a directory using Python

  • Method 1
import os
for root, dirs, files in os.walk(directory):
    for file in files:
        if file.endswith(‘.txt’):
            print file
  • Method 2
import os
path = ‘mypath/path’
files = os.listdir(path)
files_txt = [i for i in files if i.endswith(‘.txt’)

Conclusion

Above were some of the most asked questions in a data science interview. There are numerous other examples but the fundamental knowledge of Python is the basic requirement while facing a data science interview. Documentation referencing is also one of the key skills required to get a better working knowledge of the multiple libraries used in the Data Science field.