Saturday, 24 November 2018

Data Analysis and Data Manipulation: Intro to Numpy and Pandas Modules in Python

To understand this article you don't need to be an expert python user, but it will be helpful if you are familiar with python. Data analysis, data manipulation and data visualization are the basic building blocks of Machine Learning. 

1. Installation of Numpy and Pandas
2. Difference Between installing and importing modules
3. Excel Vs Pandas
4. Array Vs Numpy
5. Pandas
a. Creating a Data frame and Viewing Data
b. Converting a Column into a list
c. Manipulating CSV text
d. Converting a DataFrame to a CSV
e. Renaming the Column name
f. Manipulations on a Series
1. Creating a series
2. Selecting a series
3.Removing series from a data frame
4. Sorting a pandas Series
g. String Methods
        H.Exploring the series:


1. Installation of Numpy and Pandas


2. Difference between installing and importing modules:
It's like the difference between:
a. Uploading a photo to the internet
b. Linking the photo URL inside an HTML page
Installing puts the code somewhere python expects those kinds of things to be, and the import statement says "go look there for something named Numpy now, and make the data available to me for use".

3. Excel Vs Pandas:
When we have python, why do we need to again learn pandas module separately? Excel is everywhere in the business world and it has an excellent graphical user interface but it is not good at dealing with lots of data. Well! In Machine Learning, Most of the datasets we work with will be what is called data frames. We come across this term very often but we might know what it really is. So, Dataframe is more a headless version of a spreadsheet, like Excel. Moreover, pandas got tons of functionality, it has got a lot of support from the community and it is open source
Representing Data Excel Vs Pandas

Excel: Data is represented in an Excel sheet with columns, rows, and cells. We can have different sheets for different datasets.
Pandas: A “table” of data is stored in a DataFrame. We can create a DataFrame from scratch, or more commonly, import the data from a CSV file.

And finally, Why pandas over excel? The Pandas is a high performance, highly efficient, high-level data analysis library and makes life easier. 

4. Array Vs Numpy
Numpy is the core library for scientific computing in Python. It provides a high-performance multidimensional array object and tools for working with these arrays. Numpy is very similar to lists in python. Why do we need numpy when we have a list?
It is memory efficient, fast and convenient compared to python native list. 
a.  fast and convenient :
 b.memory:
From the above 2 code snippets, we can clearly observe the advantage of using a numpy array instead of a list

 5. Pandas:


A. Creating a Data frame and Viewing Data:

1.Creating a Data frame:
DataFrame generally refers to a tabular data with rows and columns.
DataFrame Constructor:
pandas.DataFrame( data, index, columns, dtype
Attributes:
1.Data:
It can be of type numpy ndarray (structured or homogeneous), dict, or DataFrame itself. We mostly use dictionary datatype because it easily suits the dataframe with less efforts, as there is no need to mention the column names the key of the dict is by default taken as column attribute.
 data={'Interest':['Python','C','C++'],'Names':['Sita','John','Raju']

2.Index:
It is the label given to each row
 index=['student1','student2','student3']


OUTPUT:


2.Viewing Data:

Head:
Head function with no arguments gets the first five rows of data from the data frame. Head function with specified N arguments gets the first N rows of data from the data frame.

Tail:
Tail function with no arguments gets the last five rows of data from the data frame. Tail function with specified N arguments gets the last N rows of data from the data frame.

Index:
It gives the label of each row.

Columns:
It gives the list of column names in a data frame.








B.Converting a Column into a list:
Pandas DataFrame columns are Pandas Series when you pull them out, which you can then call .tolist() on to turn them into a Python list


What if we want to convert two or more DataFrame columns into a list?



C.Manipulating CSV text:
Input/output, with Pandas, and begin with a realistic use-case and to download the CSV format dataset
GitHub Page
we're going to just manually download a CSV file instead, for learning purposes, since not every data source you find is going to have a nice and neat module for extracting the datasets GitHub Page

CSV is a Comma Separated Values. It looks like a garden-variety spreadsheet but with a .csv extension.



D.Converting a DataFrame to a CSV:

And here comes our newly created CSV file


E.Renaming the Column name:
 DataFrame.rename( columns, inplace=True)


F.Manipulations on Series:
Series is a one-dimensional labeled array capable of holding data of any type (integer, string, float, python objects, etc.).

 pandas.Series( data, index, dtype)
1.Creating a series:
Creating a Series by passing a list of values, letting pandas create a default integer index.

2.Selecting a series:

This notation is case-sensitive. It can be done in dot notation too.


3.Removing series from a data frame:
 DataFrame.drop(labels, axis, index,columns, inplace)

Here axis mentions whether it is a row(0) or a column(1). If inplace is True, do operation inplace.

4. Sorting a pandas Series:

It doesn't affect the underlying data. But if we want to change the order in the main data frame we can mention INPLACE is True.
G.String Methods:
Pandas include powerful string manipulation capabilities that we can easily apply to any Series of strings.
1. UpperCase & LowerCase:
Convert strings in the Series/Index to uppercase or lowercase.


2. Count:
Count occurrences of the pattern in each string of the Series/Index.
3. Length:
Compute the length of each string in the Series/Index.
4.Split:
Split strings around given separator/delimiter.


H.Exploring the series:
1.Describe:
Generates descriptive statistics that summarize the data frame.


2.Value_counts:
Gives an object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element.

There are many more string methods and can be referred from the Pandas documentation http://pandas.pydata.org/pandas-docs/stable/api.html#string-handling