Pandas - Python for Data Science

Pandas - Python for Data Science

Pandas Series, DataFrame and multiple operations on 2D data

Table of contents

Table of Content:

In this Pandas tutorial, we will be learning below concepts.

  • Prerequisites, Introduction
  • A). Import required packages:
  • B). Series in Pandas:
  • C). DataFrame in Pandas:
  • D). DataFrame - General functions:
  • E). DataFrame Columns Manipulation - Select, Create, Rename, Drop:
  • F). DataFrame Rows Manipulation - Select, Create, Rename, Drop:
  • G). Missing data Manipulations:
  • H). Group by in Dataframes:
  • I). Combine multiple dataframes - Merge, Join, Combine:
  • Conclusion

Prerequisites:

Pandas concepts are very easy to learn and apply in the real world applications.
Ex: If we understand our high school marks(grade) card with subjects along with marks in each subject this example is more than enough to digest the entire Pandas concepts.


Introduction:

  • Pandas is used for Processing (load, manipulate, prepare, model, and analyze) the given data. Pandas is built on top of the Numpy package so Numpy is required to work with Pandas.
  • Pandas has 2 data structures for processing the data.
    1. Series : is a one-dimensional array that is capable of storing various data types.
    2. DataFrame : is a two-dimensional array with labeled axes (rows and columns).

A). Import required packages:

import numpy as np
import pandas as pd

Explanation: import key word is used to import the required package into our code. as keyword is used for giving alias name for given package. numpy is the numerical python package used to create numerical arrays in this tutorial.
Example: pandas is the package and pd is the alias name or short name for pandas.


B). Series in Pandas:

  • Series is a 1-dimensional array which capable of storing various data types(Integers, strings, floating point numbers, Python objects).
  • The row labels of series are called the index.
  • Series cannot contain multiple columns. It will be having only one column.
  • Lets explore some of the examples of series in Pandas.

B1). Create Series from Python Dictionary

dict1 = {'p':111, 'q':222, 'r':333, 's':np.NaN, 't':555}
s = pd.Series(dict1)
print(s)

Explanation: Here p,q,r,s,t are called indexes. pd.Series() is used to create Pandas series. Note: S in Series() is capital.
Output: B1.output.PNG

B2). Series from Scalar value

s = pd.Series(125, index=['i','j','k','l'])
print(s)

Explanation: Here all indexes 'i','j','k','l' will be having same value 125.
Output: B2.output.PNG

B3). Series from Numpy array

s = pd.Series(np.random.randn(5), index=['a','b','c','d','e'])
print(s)

Explanation: np.random.randn(5) will be creating 5 random numbers.
Output:

B3.PNG

B4). Series Functionalities

dict1 = {'p':111, 'q':222, 'r':333, 's':np.NaN, 't':555}
s = pd.Series(dict1)
print("Slice:\n s[1]:",s[1])
print("s['r']:",s['r'])

print("\n##############################")
print("Filters:s[s > 200]:\n", s[s > 200])

print("\n##############################")
print("Select Multiple indexes:s[0,2,4]:\n", s[[0,2,4]])

print("\n##############################")
print("Check DType:", s.dtype)

Explanation: We can apply Slice, filters, selecting particular indexes, check for data type of the series. s[1] is for printing the value at index 1 whereas index starts at 0. s[s > 200] is used for filtering the data which are graterthan 200. s[[0,2,4]] with this we can select multiple indexed value at single step. s.dtype is for checking Series data type.
Output:

B4.PNG


C). DataFrame in Pandas:

  • DataFrame is a two-dimensional array with labeled axes (rows and columns).
  • DataFrame is like Structured table or Excel file.
  • Lets explore some of the examples of Dataframe in Pandas.

C1). Dataframe from Python Dictionary

dict1 = {"ID":[101,102,103,104,105], "Name":['AAA','BBB','CCC','DDD','EEE']}
df = pd.DataFrame(dict1)
df

Explanation: Here ID, Name are the columns and index will be auto generated at left side.
Output:

C1.PNG

C2). Dataframe from Numpy n-d array:

a = np.array(np.random.rand(10,5))
df = pd.DataFrame(a, index = [np.arange(2000,2010)], columns = ['India', 'USA', 'China', 'Japan', 'Italy'])
df

Explanation: a is for tabular data, np.arange() used for index, columns used for columns header.
Output:

C2.PNG

C3). DataFrame from List:

l1 = [2,3,4,5,6,7]
df = pd.DataFrame(l1, index = ['a','b','c','d','e','f'], columns = ['ID_NUM'])
df

Explanation: l1 for Data. 'a','b','c','d','e','f' are the index values and ID_NUM is the column header.
Output: C3.PNG

C4). DataFrame from CSV file:

df = pd.read_csv("Pandas_Blog.csv")
df

Explanation: read_csv function used for reading CSV files from local machine. Entire data in csv file will be accessable from Pnadas dataframe df. You can download this sample CSV file from here.
Output: C4.PNG


D). DataFrame - General functions:

Most of the time as a Data Analyst or Data Scientist we will be dealing with DataFrames very frequently. Lets explore some of the basic functionalities applied on top of DataFrame level. Before that lets use the sample dataframe for rest of the tutorials so that it will be very useful to apply our thoughts for any dataframe we come across.

D1). Sample DataFrame for rest of the tutorials.

a = np.array(np.random.rand(10,5))
df = pd.DataFrame(a, index = [np.arange(2000,2010)], columns = ['India', 'USA', 'China', 'Japan', 'Italy'])
df

Explanation: This example we already seen in our examples. Please be noted that you may get different values in the table as we are using random function here. Due to this we may get different values than these values it will get change system to system.
Output:

D1.PNG

D2). info() function:

df.info()

Explanation: info() function will be used for giving overall information about dataframe like number of columns, number of rows(records), column names and their data types, each column contains null values or non-null values.
With info() function we can get entire high level understanding on the dataframe.
Output:

D2.PNG

D3). describe() function:

df.describe()

Explanation: describe() function used for checking the statistical information about all numerical columns data. It will show us the min, max, mean, 25%, 50%, 75% of the numerical columns.
Output:

D3.PNG

D4). count() function:

df.count()

Explanation: count()function will show us in each column how many non-null records (non-missing or proper values) are exists. Here in each column we are having 10 records without any missing data.
Output: D4.PNG

D5). columns attribute:

df.columns

Explanation: columns will be printing the list of columns in the dataframe. Note: We shouldn't use () with this attribute.
Output: D5.PNG

D6). index attribute:

df.index

Explanation: index attribute will printout the indexes available for the dataframe. In the output levels is the user defined index and labels is the existing default index for the dataframe.
Output: D6.PNG

D7). shape attribute:

df.shape

Explanation: shape will print the number of rows, number of columns in the dataframe. It is like dimensions for the matrix. Ex: (rows, columns) = (10,5)
Output: D7.PNG

D8). dtypes attribute:

df.dtypes

Explanation: dtypes will print the each column and corresponding data type of the column side by side.
Output: D8.PNG

D9). head() function:

df.head(3)

Explanation: head() function will print the first or top n records in the dataframe. Here n=3. First 3 rows with all the columns will be shown.
Output: D9.PNG

D10). tail() function :

df.tail(3)

Explanation: tail() function will print the last or bottom n records in the dataframe. Here n=3. Last 3 rows with all the columns will be shown.
Output: D10.PNG

D11). sample() function :

df.sample(3)

Explanation: sample() will print n random rows from the dataframe.
Output: D11.PNG


E). DataFrame Columns Manipulation - Select, Create, Rename, Drop:

In the Pandas dataframe we mainly have rows and columns. In this section we starts with columns manupulations like how to ( Select, Create, Rename, Drop ) particular columns in dataframe. In general while dealing with dataframe rows and columns we need to specify axis value to 1 or 0.
axis = 0 for rows , axis = 1 for columns in the dataframe.

E1). Sample dataframe for this Columns Manipulation section:

a = np.array(np.random.rand(10,5))
df = pd.DataFrame(a, index = [np.arange(2000,2010)], columns = ['India', 'USA', 'China', 'Japan', 'Italy'])
df

Explanation: This example we already seen in our examples. Please be noted that you may get different values in the table as we are using random function here. Due to this we may get different values than these values it will get change system to system.
Output: E1.PNG

E2). Selecting single column from dataframe:

df['India']

Explanation: From the dataframe df we are selecting one single column India. We can cross check this output with our main dataframe in E1 section.
Output: E2.PNG

E3). Check the datatype of particular column:

type(df['India'])

Explanation: df['India'] code will select the India column. type() function will provide the datatype of India column.
Output: E3.PNG

E4). Selecting multiple columns from dataframe:

df[['India', 'USA']]

Explanation: From the dataframe df we are selecting multiple columns India, USA. Here we have to pass the required multiple columns in the form of List like ['India', 'USA']. We can cross check this output with our main dataframe in E1 section.
Output: E4.PNG

E5). Create new column to the dataframe:

df['IND_USA'] = df['India'] + df['USA']
df

Explanation: Here we are creating new column called IND_USA and in r.h.s we have to assign the value. Here we are adding the values of each row from India, USA and assigning to IND_USA column.
Ex: For 2000 year, India = 0.688980 and USA = 0.296874. Now IND_USA = 0.985854.
Output: E5.PNG

E6). Rename the dataframe column:

df = df.rename(columns={'IND_USA':'IND_puls_USA'})
df

Explanation: rename() function help us to rename the given column. Inside this function we have to pass the columns keyword with key-value paired. Here Key = Existing column name, Value = New Proposed column name.
Ex: Here IND_USA is existing column name and IND_puls_USA is new column name.
Output: E6.PNG

E7). Dropping existing column:

df = df.drop('IND_puls_USA', axis=1)
df

Explanation: drop function will help us to remove or delete the existing column. Inside function we are passing the column name IND_puls_USA. With axis=1 we are informing the compiler to remove the column but not the row.
Note: If we skip the axis=1 syntax, if any row index is having name as IND_puls_USA then that row will be deleted. To skip that row deletion, we are giving axis=1 to delete the column which name is IND_puls_USA. After deletion we will not get the data back, so have to be cautious while using drop() function. After deletion of IND_puls_USA column, our dataframe looks like below.
Output: E7.PNG


F). DataFrame Rows Manipulation - Select, Create, Rename, Drop:

In this section we will learn about Rows manupulations like how to ( Select, Create, Rename, Drop ) particular rows in dataframe. In general while dealing with dataframe rows and columns we need to specify axis value to 1 or 0. axis = 0 for rows , axis = 1 for columns in the dataframe.
Accessing rows from dataframe can be done in these 2 ways by using loc() and iloc() functions.

axis = 0 for rows selection, axis = 1 for columns selection. in the row selection again we have 2 parameters as below:

  • loc() - used when we know index name for a perticular row.
  • iloc() - used when we dont know the name of index, but we know index order value

F1). Sample dataframe for reference in this section:

a = np.array(np.random.rand(10,5))
df = pd.DataFrame(a, index = np.arange(2000,2010), columns = ['India', 'USA', 'China', 'Japan', 'Italy'])
df

Output: F1.PNG

df.index

Explanation: Here if we observe loc() can be applied on row index name or levels like 2000, 2001, ..., 2009 and iloc() can be applied on row index numbers or labels like 0, 1, ...,9.
Output: F1_2.PNG

F2). Select single row using loc[] :

df.loc[2005]

or 

df.iloc[5]

Explanation: From dataframe we are selecting row which is having index name as 2005. It is similar to selecting 5 index number using iloc[5]. Because we know there is a 2005 index name, so that we can directly use loc[2005]. If we are not sure on index name then we can use index number using iloc[5].
Output: F2.PNG

F3). Selecting multiple rows using loc[]:

df.loc[[2005, 2007]]

Explanation: For multiple row selection we have to give the list with required index names to loc[]. By default we will get all the columns in the given row.
Output: F3.PNG

F4). Selecting few rows with few columns using loc[]:

df.loc[[2005, 2007], ['India', 'USA']]

Explanation: Here we are providing 2 lists to the loc[]. First one is the list of index names we require and second list for list of column names we require to print. These 2 lists can be separated by ,.
Output: F4.PNG

F5). Select single row using iloc[] :

df.iloc[0]

Explanation: Here we are selecting 0 indexed row or 2000 index name row with all columns.
Output: F5.PNG

F6). Selecting multiple rows using iloc[]:

df.iloc[[0,2]]

Explanation: Here we are giving list of index numbers which we want to display. This will print index 0,2 rows with all columns.
Output: F6.PNG

F7). Selecting few rows and few columns using iloc[]:

df.iloc[[0, 2], [0,1]]

Explanation: Here using iloc[] we are printing 0,2 rows and 0,1 columns from the dataframe df.
Output: F7.PNG

F8). Create a new row using loc[]:

df.loc[2010] = np.random.rand(5)
df

Explanation: Here we dont have index named 2010 till now and creating 2010 index name with 5 random numbers.
Output: F8.PNG

F9). Create a new row using iloc[]:

df.iloc[10] = np.random.rand(5)
df

Explanation: Here we are recreating 2010 row using 5 random numbers by using index number 10 with help of iloc[]. Values in 10 index number will get changed to previous example.
Output: F9.PNG

F10). Drop the row in dataframe:

df = df.drop(2010)
df

Explanation: Here we haven't specified the axis=0. By default rows will get dropped with drop() function. We will not be able to see the 2010 record.
Output: F10.PNG

F11). Rename row index name in dataframe:

df = df.rename(index={2009:20009})
df

#renamed back to normal with below commented lines.
#df = df.rename(index={20009:2009})

Explanation: Here we are renaming 2009 index name to 20009 with rename() function. For simplicity purpose I've reverted the changes for upcoming sections.
Output: F11.PNG

F12). Conditional selection of dataframe:

df>0.3

Explanation: Conditional selection will print us the boolean matrix of dataframe with given condition. Here df>0.3 says all values matches this condition will become true and remaining all cells become False. This matrix can be utilized in many places to filter out the data using conditions.
Output: F12.PNG

F13). Conditional selection-2 of dataframe:

df[df>0.3]

Explanation: Here we are passing conditional selection to dataframe level so that all true values will only be shown and remaining values will be printed as Nan i.e. Not a Number. First df>0.3 will be calculated and boolean matrix will be applied to dataframe df.
Output: F13.PNG

F14). Conditional selection-3 of dataframe:

df[df['India']>0.3]

Explanation: Here we are telling if India column is having greater than 0.3 value then print all of those rows will all columns combination.
Ex: Here 2006 index name in India column is having 0.027133 and it is not true so this 2006 row will not be printed in output.
Output: F14.PNG

F15). Conditional selection-4 of dataframe:

df[df['India']>0.3][['India', 'USA']]

Explanation: On top of above example here we are selecting few columns only by proving list of columns we require at the end as [['India', 'USA']].
Output: F15.PNG


G). Missing data Manipulations:

Some times the data we are dealing might be having missing values. These missing data can be represented in Pandas as NaN i.e. Not a Number.
Ex: Let say we are collecting user information for some social media platform. Some users will not provide the address or personal information we are treated as optional fields. These missing fields can be considered as Missing Data.
Nan is a numpy object and None is None type object. Numpy objects better in performance than any other type. So Pandas mainly use NaN over the None to improve performance. This is from Numpy package(np.nan), widely used in numpy arrays , Pandas Series and Dataframes.

  • isnull():Generate a boolean mask indicating missing values.
  • notnull(): Generate a boolean mask indicating proper values.
  • fillna(): we can replace NAN with a scalar value or text.
  • dropna(): used to drop the row or column if it have missing data.

G1). Sample dataframe with missing data :

df = pd.DataFrame(np.random.rand(5,4), index =[2010, 2013, 2015, 2016, 2020], columns = 'A B C D'.split())
df = df.reindex(np.arange(2010,2021))
df['C'] = np.random.rand(11)
df.loc[2014] = np.nan
df.loc[2019] = np.nan
df['E'] = np.random.rand(11)
df['F'] = np.nan
df

Explanation: This is a sample dataframe for this section.
Step1: Create dataframe with 5 rows(2010, 2013, 2015, 2016, 2020) and 4 columns(A, B, C, D) with random numbers. Step2: Creating rows with index names from 2010 to 2021 and new rows can have missng data by default. Step3: Creating new column C and filling with random numbers. Step4, 5: Replacing 2014 and 2019 rows in NaN values. Step6,7: Creating new columns E, F and assigning values.
Output: G1.PNG

G2). isnull() function to get boolean matrix:

df.isnull()

Explanation: isnull() function will provide the boolean matrix of the given dataframe. It will check the data in each cell and if it found NaN then it will print True else it will print False.
Ex: Entire F column is having NaN. So in boolean matrix F column will contain only True values.
Output: G2.PNG

G3). notnull() function to get boolean matrix:

df.notnull()

Explanation: notnull() function is inverse of isnull() function and it will provide the boolean matrix of the given dataframe. It will check the data in each cell and if it found NaN then it will print False else it will print True.
Ex: Entire F column is having NaN. So in boolean matrix F column will contain only False values.
Output: G3.PNG

G4). fillna(value, method=[ffill,bfill], axis=[0,1]) function-1:

df.fillna(0)

Explanation: fillna() function to replace the missing data with given value. We have much control on how to fill the missing data with method=[ffill,bfill]. Here ffill means forward fill, bfill means backward fill with given value.
axis=[0,1] will represent horizontal(axis=1) or vertical(axis=0) axes to apply ffill or bfill parameters with given value. In this example we are simply assigning 0 to the missing data.
Output: G4.PNG

G5). fillna(value, method=[ffill,bfill], axis=[0,1]) function-2:

df.fillna("Missing")

Explanation: With fillna() function we are filling all missing values with text called Missing.
Output: G5.PNG

G6). fillna(value, method=[ffill,bfill], axis=[0,1]) function-3: (forward filling vertically):

df.fillna(method = 'ffill')

#forward filling vertically

Explanation: Here we have not provided the value and axis parameters. By default axis=0 which means vertically and ffill means forward filling will be applicable.
Ex: In column A index 2011, 2012 are missing so 2010 data will be copied forward filling in vertical(all rows) manner. Similarly 2013 data will be copied to 2014. Same way 2016 data will be copied to 2017, 2018, 2019 indexes in all the columns.
In Column F all rows are missing so no data has been taken forward in vertically to modify the missing data because first row(2010) in F column is also missing.
Output: G6.PNG

G7). fillna(value, method=[ffill,bfill], axis=[0,1]) function-4: (forward filling horizontally):

df.fillna(method = 'ffill', axis=1)

# forward filling horizontally

Explanation: axis=1 means horizontally forward fill will takes place.
Ex: In 2011 row D column will get replaced with Ccolumn data. Similarly F column will get replaced with Ecolumn data. A,B columns in 2011 row will remain same as missing because no prior columns there to take place of forward filling in horizontal manner.
If we observe columns E, F are having same data because all rows from column E copied to column F.
Output: G7.PNG

G8). fillna(value, method=[ffill,bfill], axis=[0,1]) function-5: (backward filling vertically):

df.fillna(method = 'bfill')

# backward filling vertically

Explanation: Here we have not provided the value and axis parameters. By default axis=0 which means vertically and bfill means backward filling will be applicable.
Ex: In column A index 2011, 2012 are missing so 2013 data will be copied backward filling in vertical(all rows) manner. Similarly 2015 data will be copied to 2014. Same way 2020 data will be copied to 2017, 2018, 2019 indexes in all the columns.
In Column F all rows are missing so no data has been taken backward in vertically to modify the missing data because last row(2020) in F column is also missing.
Output: G8.PNG

G9). fillna(value, method=[ffill,bfill], axis=[0,1]) function-6: (backward filling horizontally):

df.fillna(method = 'bfill', axis=1)

# backward filling horizontally

Explanation: axis=1 means horizontally backward fill will takes place.
Ex: In 2011 row D column will get replaced with Ecolumn data. Similarly A, B columns will get replaced with C column data. F column will remain have missing data as no other column existis right side to F column.
Output: G9.PNG

G10). Replace missing data for all columns:

df.fillna(df.mean())

Explanation: Within fillna() function we are using mean()function to findout the mean or average value in each row and filling the missing values.
EX: Lets see the mean value in column A with non missing values. The same mean value will be copied into all missing records in column A. Same will repeted for all columns missing data. Column F will remain same because all records in column F are missing.
Note: instead of mean() function, we can use any of the aggregation functions like sum(), min(), max(), prod(), std() functions. This task is very important in all the real time projects as data in real world will be having missing data and we need to replace those missing cells with proper data.
Output: G10b.PNG G10.PNG

G11). Replace missing data only for few columns:

df.fillna(df.mean() ['B':'C'])

Explanation: Here we are only filling the mean() value in columns B, C. Rest of the columns will be having missing data.
Output: G11.PNG

G12). dropna(axis=[0,1], how=[any,all], thresh) Ex-1:

df

Explanation: This section is for sample dataframe with missing data. dropna() function is useful when we want to remove the missing data in terms of rows and columns. We have more control with dropna() function.
In axis parameter bydefault will haveaxis=0 to remove rows. axis=1 for removing columns.
In how parameter bydefault will have how='any' which means if any one value miss then consider that row or column. how='all' means if all values miss then consider that row or column.
With thresh parameter we can define thread=k where k is number <= number of rows or number of columns. Ff we have k number of non-missing values then those rows or columns can be considered in output. Dont worry about theory part, we will be walking through examples.
Output: G12.PNG

G13). dropna(axis=[0,1], how=[any,all], thresh) Ex-2:

df.dropna()
# by default this function looks like df.dropna(axis=0, how='any')
df.dropna(how='any')
df.dropna(axis=0)
df.dropna(axis=0, how='any')

Explanation: Bydefault with dropna() function we carry axis=0 and how='any' parameters. axis=0 means row wise, how='any' means in any row if any one value missing then we will drop that perticular row. But in our example all rows are having atleast 1 missing value(if we recall entireF column is missing). So we will get empty dataframe like below.
Output: G13.PNG

G14). dropna(axis=[0,1], how=[any,all], thresh) Ex-3:

df.dropna(axis=1)
# by default this function looks like df.dropna(axis=1, how='any')
df.dropna(axis=1, how='any')

Explanation: Bydefault with dropna() function we carry how='any' parameter alog with axis=1. axis=1 means column wise, how='any' means in any column if any one value missing then we will drop that perticular column. In our example we have only E column which is not having a single missing value and remaining all other columns having atleast 1 missing data.
Output: G14.PNG

G15). dropna(axis=[0,1], how=[any,all], thresh) Ex-4:

df.dropna(how='all')
#by default this function looks like df.dropna(how='all', axis=0)
df.dropna(how='all', axis=0)

Explanation: how='all' means in given row if all cells are having missing data then that row will get eliminated in the output. axis=0will be added by default to the function. In our example we dont have such rows where all columns data is missing, so we will get our original dataframe as output.
Output: G15.PNG

G16). dropna(axis=[0,1], how=[any,all], thresh) Ex-5:

df.dropna(how='all', axis=1)

Explanation: axis=1 means we have to consider columns and how='all' will be considered as in each column if all cells are missing then that column will get eliminated in the output. In our example F column is having all missing data, so it will get eliminated.
Output: G16.PNG

G17). dropna(axis=[0,1], how=[any,all], thresh) Ex-6:

df.dropna(thresh=2, axis=0)

Explanation: axis=0 means row wise, thresh=2 means in each row if we have atleast 2 non-missing data(proper data) then that row will be considered in the output. If any row having lessthan 2 non-missing data then those rows will get eliminated. In 2014, 2019 rows we have only 1 non-missing data so these rows will get eliminated.
Output: G17.PNG

G18). dropna(axis=[0,1], how=[any,all], thresh) Ex-7:

df.dropna(thresh=3, axis=0)

Explanation: axis=0 means row wise, thresh=3 means in each row if we have atleast 3 non-missing data(proper data) then that row will be considered in the output. If any row having lessthan 3 non-missing data then those rows will get eliminated. In 2011, 2012, 2014, 2017, 2018, 2019 rows we have lessthan 3 non-missing data so these rows will get eliminated.
Output: G18.PNG

G19). dropna(axis=[0,1], how=[any,all], thresh) Ex-8:

df.dropna(thresh=4, axis=1)

Explanation: axis=1 means column wise, thresh=4 means in each column if we have atleast 4 non-missing data(proper data) then that column will be considered in the output. If any column having lessthan 4 non-missing data then those columns will get eliminated. In Fcolumns all cells are having missing data so it will be eliminated.
Output: G19.PNG

G20). dropna(axis=[0,1], how=[any,all], thresh) Ex-9:

df.dropna(thresh=6, axis=1)

Explanation: axis=1 means column wise, thresh=6 means in each column if we have atleast 6 non-missing data(proper data) then that column will be considered in the output. If any column having lessthan 6 non-missing data then those columns will get eliminated. In A, B, D, Fcolumns lessthan 6 non-missing cells are available so these columns will be eliminated.
Output: G20.PNG


H). Group by in Dataframes:

Group by statement is used to group the dataframe columns data into groups and on top of groups we can apply filters or aggregations functions to get the more insights about data.

H1). Sample Data:

dict1 = {"Company":['Google','Microsoft','FB','Google','Microsoft','FB'], 
         "Employe":['AA','BB','CC','DD','EE','FF'], 
         "Sales":[100,200,140,160,150,180]}
df = pd.DataFrame(dict1)
df

Explanation: By using python dictionary with 3 keys (Company, Employe, Sales) and each key is having value with list of 6 objects has been used created pandas dataframe.
Output: H1.PNG

H2). Create groups from Dataframe:

grp_company = df.groupby("Company")
grp_company.groups

Explanation: From df dataframe using Company column we are grouping the data with df.groupby("Company") syntax. Finally we are printing the groups with index value from dataframe.
Output: H2.PNG

H3). Checking statistical information:

grp_company.describe()

Explanation: grp_company will contains each group details and describe() function we can get the statistical information about each group.
Output: H3.PNG

H4). Tranpose of statistical matrix:

grp_company.describe().transpose()

Explanation: By using transpose() function we flip the matrix such a way that rows become columns and vice versa.
Output: H4.PNG

H5). Aggregation functions:

grp_company.sum()

Explanation: On top of each group we can apply aggregation functions like min(), max(), sum(), avg() to get the more insights of the data.
Output: H5.PNG


I). Combine multiple dataframes:

Pandas offers 3 ways to combine the multiple dataframes so that we can see the data from multiple dataframes as single dataframe with controlling the conditions how to combine. Lets explore one by one.

  • merge() - If we are aware of SQL joins and if we want to perform the SQL like joins then this function will help us. Most of the we will use merge() function while working in realtime data. This functions comes with more flexibility to control the combining operations.
  • join() - This function is act as a left join in merge() function. In this we wont specify the what basis join should take place. By default it will join dataframes based on indexes we provide.
  • concat() - This function is little different thatn merge(), join() functions as it will simply combine the 2 or more dataframes either rows wise(vertically) or columns wise(horizontally).

I1). Sample Dataframes:

#df1
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4', 'A5'],
                        'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B999'],
                        'C': ['C0', 'C1', 'C2', 'C3', 'C4', 'C5'],
                        'D': ['D0', 'D1', 'D2', 'D3', 'D4', 'D5']},
                        index=[0, 1, 2, 3, 4, 5])

#df2
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7'],
                        'E': ['E4', 'E5', 'E6', 'E7']},
                         index=[4, 5, 6, 7]) 

#df3
df3 = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                        'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                        'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                        'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                        index=[7, 8, 9, 10, 11])

Explanation: We have created 3 dataframes(df1, df2, df3) with index values. These dataframes will be used in below merge() tutorials.
Output: I1.PNG

Types of joins we are going to discuss with merge() function as follows: I1b.png

I2). merge(df1, df2, how='inner') Ex-1: You can skip this example:

df1_inner_df2 = pd.merge(df1, df2, how='inner')
df1_inner_df2

# df1_merge_df2 = pd.merge(df1, df2)

Explanation: In the real world problems we mainly use merge() function over join(), combine() function as with merge() function we can perform SQL like join operations.
In merge() bydefault how=inner and on=indexes will take place. Inner join means all the records which are matching in both dataframes with given columns or indexes will we printed. If we specifically mention about join condition with on= parameter with either columns or indexes inner join will takes place with given columns or indexes else by default join will happend based on indexes given in the dataframe.
Ex: In this example we have not provided on= parameter, so by default inner join will performed based on indexes of df1, df2. df1 index values =[0, 1, 2, 3, 4, 5] and df2 index values =[4, 5, 6, 7]. out of 2 index lists only [4,5] indexes are matching. Lets explore 4,5 indexes from each dataframe.
df1 index 4=[A4,B4,C4,D4] and df2 index 4=[ A4,B4,C4,D4,E4]. Here we are performing inner join so all matching columns get compared and all are matching.
df1 index 5=[A5,B999,C5,D5] and df2 index 4=[ A5,B5,C5,D5,E5]. Here column B is having different values(B999 != B5) so index B will not get printed.
Output: I2.PNG

I3). merge(df1, df2, how='inner') Ex-2:

df1_inner_df2 = pd.merge(df1, df2, how='inner', on=['A', 'C', 'D'])
df1_inner_df2

# SQL Query for above python code:
SELECT df1.*, df2.* 
FROM df1 
INNER JOIN df2 
ON (df1.A=df2.A AND df1.C=df2.C AND df1.D=df2.D)

Explanation: Here also we are doing inner join with how='inner' parameter but additionally we are giving on=['A', 'C', 'D'] parameter to perform inner join based on A, C, D columns. In both df1, df2 dataframes if any row having same A, C, D column values then those rows will get printed.
Ex: In 4,5 indexes from both dataframes A, C, D columns are matching so 4,5 indexes will get printed. If we observe column names we have B_x, B-y columns. Here B_x is coming from df1 and B_y is coming from df2. Just to get the clear output pandas will automatically append these _x, _ycharacters to matching columns in both dataframe if they have different data.
Output: I3.PNG

I4). merge(df1, df2, how='left', on=['A', 'D']):

df1_left_df2 = pd.merge(df1, df2, how='left', on=['A', 'D'])
df1_left_df2

# SQL Query for above python code:
SELECT df1.*, df2.* 
FROM df1 
LEFT JOIN df2 
ON (df1.A=df2.A AND df1.D=df2.D)

Explanation: We are performing left join between df1, df2 with A,Dcolumns as join condition. Left Join means all the rows from left table(df1) and matching row values from right table(df2) will have proper values. Non-matching rows from right table(df2) will have NaN i.e. Not a Number. Just to avoid the ambiguity between df1, df2 column names _x for df1column names, _y for df2 column names will be appened.
Note: Column names A, D will remain same because these columns are matching between 2 dataframes and column nameE will also remain same because column E is not ambiguous. In output all rows[6] from left df, matching rows[2] from right df will come. Total rows=6.
Output: I4.PNG

I5). merge(df1, df2, how='right', on=['A', 'D']):

df1_right_df2 = pd.merge(df1, df2, how='right', on=['A', 'D'])
df1_right_df2

# SQL Query for above python code:
SELECT df1.*, df2.* 
FROM df1 
RIGHT JOIN df2 
ON (df1.A=df2.A AND df1.D=df2.D)

Explanation: We are performing right join between df1, df2 with A,Dcolumns as join condition. Right Join means all the rows from right table(df2) and matching row values from left table(df1) will have proper values. Non-matching rows from left table(df1) will have NaN i.e. Not a Number. Just to avoid the ambiguity between df1, df2 column names _x for df1 column names, _y for df2 column names will be appened. In output all rows[4] from right df, matching rows[2] from left df will come. Total rows=4.
Output: I5.PNG

I6). merge(df1, df2, how='outer', on=['A', 'D']):

df1_outer_df2 = pd.merge(df1, df2, how='outer', on=['A', 'D'])
df1_outer_df2

# SQL Query for above python code:
SELECT df1.*, df2.* 
FROM df1 
FULL OUTER JOIN df2 
ON (df1.A=df2.A AND df1.D=df2.D)

Explanation: We are performing outer join between df1, df2 with A,Dcolumns as join condition. OuterJoin means all rows from left table(df1) and all rows from right table(df2) will be displayed but non-matching rows from both the tables will be displayed as NaN i.e. Not a Number. In output (leftDF[6]-matching[2]) + (matching[2]) + (rightDF[4]-matching[2]) = 4+2+2 =8 total rows will be displayed.
Output: I6.PNG

I7). join() function:

# df1 declaration
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']},
                      index=['K0', 'K1', 'K2']) 

# df2 declaration
df2 = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
                    'D': ['D0', 'D2', 'D3']},
                      index=['K0', 'K2', 'K3'])

#Join statement
df1_join_df2 = df1.join(df2)
df1_join_df2

Explanation: By default join() function acts as left join in the SQL side. Here we wont specify on what bases it should join, it will consider indexes for joining. In real world applications merge() function used very frequently. In output all rows from left table(df1) and matching rows from right table(df2) will have proper values and non-matching rows from right table(df2) will have NaN.
Output: I7.PNG

I8). concat([df1, df2, df3]):

Lets use the old dataframes in this example as below.

#df1
df1 = pd.DataFrame({'A': ['A0', 'A1', 'A2', 'A3','A4', 'A5'],
                        'B': ['B0', 'B1', 'B2', 'B3', 'B4', 'B999'],
                        'C': ['C0', 'C1', 'C2', 'C3', 'C4', 'C5'],
                        'D': ['D0', 'D1', 'D2', 'D3', 'D4', 'D5']},
                        index=[0, 1, 2, 3, 4, 5])

#df2
df2 = pd.DataFrame({'A': ['A4', 'A5', 'A6', 'A7'],
                        'B': ['B4', 'B5', 'B6', 'B7'],
                        'C': ['C4', 'C5', 'C6', 'C7'],
                        'D': ['D4', 'D5', 'D6', 'D7'],
                        'E': ['E4', 'E5', 'E6', 'E7']},
                         index=[4, 5, 6, 7]) 

#df3
df3 = pd.DataFrame({'A': ['A7', 'A8', 'A9', 'A10', 'A11'],
                        'B': ['B7', 'B8', 'B9', 'B10', 'B11'],
                        'C': ['C7', 'C8', 'C9', 'C10', 'C11'],
                        'D': ['D7', 'D8', 'D9', 'D10', 'D11']},
                        index=[7, 8, 9, 10, 11])
df1_df2_df3_ver_concat = pd.concat([df1, df2, df3])
df1_df2_df3_ver_concat

# df1_df2_df3_ver_concat = pd.concat([df1, df2, df3], axis=0)

Explanation: concat() function will be used with axis=[0,1] parameter. If we give axis=0 then all given dataframes(df1, df2, df3) will be combined vertical manner like below. By default concat() function will have axis=0 parameter. In the output all non-existing rows will be replaced with NaN while combining.
Output: I8.PNG

I). concat([df1, df2, df3] axis=0):

df1_df2_df3_hor_concat = pd.concat([df1, df2, df3], axis=1)
df1_df2_df3_hor_concat

Explanation: concat() function will be used with axis=[0,1] parameter. If we give axis=1 then all given dataframes(df1, df2, df3) will be combined horizontal manner like below. In the output all non-existing rows will be replaced with NaN while combining.
Output: I9.PNG

Conclusion:

I hope you have learned Pandas concepts with simple examples.
Happy Learning...!!

1.End.jpg