Introduction

There are four main libraries in Python that you need to know: numpy, pandas, mathplotlib and sklearn

NumPy

The Python built-in list type does not allow for efficient array manipulation. The NumPy package is concerned with manipulation of multi-dimensional arrays. NumPy is at the foundation of almost all the other packages covering the Data Science aspects of Python. From a Data Science perspective, collections of Data types like Documents, Images, Sound etc can be represented as an array of numbers. Hence, the first step in analysing data is to transform data into an array of numbers. NumPy functions are used for transformation and manipulation of data as numbers – especially before the model building stage – but also in the overall process of data science.

Pandas

The Pandas library in Python provides two data structures: The DataFrame and the Series object. The Pandas Series Object is a one-dimensional array of indexed data which can be created from a list or array. The Pandas DataFrames objects are essentially multidimensional arrays with attached row and column labels. A DataFrame is roughly equivalent to a ‘Table’ in SQL or a spreadsheet. Through the Pandas library, Python implements a number of powerful data operations similar to database frameworks and spreadsheets. While the NumPy’s ndarray data structure provides features for numerical computing tasks, it does not provide flexibility that we see in Tale structures (such as attaching labels to data, working with missing data, etc.). The Pandas library thus provides features for data manipulation tasks.

Matplotlib

The Matplotlib library is used for data visualization in Python built on numpy. Matplotlib works with multiple operating systems and graphics backends.

Scikit-Learn

The Scikit-Learn package provides efficient implementations of a number of common machine learning algorithms. It also includes modules for cross validation, grid search and feature engineering