Users Online
· Members Online: 0
· Total Members: 188
· Newest Member: meenachowdary055
Forum Threads
Latest Articles
Articles Hierarchy
Pandas Tutorial: Dataframe, Date Range, Slice, Groupby, Import(read_csv)
Pandas Tutorial: Dataframe, Date Range, Slice, Groupby, Import(read_csv)
What is Pandas?
Pandas is an opensource library that allows to you perform data manipulation in Python. Pandas library is built on top of Numpy, meaning Pandas needs Numpy to operate. Pandas provide an easy way to create, manipulate and wrangle the data. Pandas is also an elegant solution for time series data.
In this tutorial, you will learn:
- What is Pandas?
- Why use Pandas?
- How to install Pandas?
- What is a data frame?
- What is a Series?
- Create Data frame
- Range Data
- Inspecting data
- Slice data
- Drop a column
- Concatenation
- Import CSV
- Groupby
Why use Pandas?
Data scientists use Pandas for its following advantages:
- Easily handles missing data
- It uses Series for one-dimensional data structure and DataFrame for multi-dimensional data structure
- It provides an efficient way to slice the data
- It provides a flexible way to merge, concatenate or reshape the data
- It includes a powerful time series tool to work with
In a nutshell, Pandas is a useful library in data analysis. It can be used to perform data manipulation and analysis. Pandas provide powerful and easy-to-use data structures, as well as the means to quickly perform operations on these structures.
How to install Pandas?
To install Pandas library, please refer our tutorial How to install TensorFlow. Pandas is installed by default. In remote case, pandas not installed-
You can install Pandas using:
- Anaconda: conda install -c anaconda pandas
- In Jupyter Notebook :
import sys !conda install --yes --prefix {sys.prefix} pandas
What is a data frame?
A data frame is a two-dimensional array, with labeled axes (rows and columns). A data frame is a standard way to store data.
Data frame is well-known by statistician and other data practitioners. A data frame is a tabular data, with rows to store the information and columns to name the information. For instance, the price can be the name of a column and 2,3,4 the price values.
Below a picture of a Pandas data frame:
What is a Series?
A series is a one-dimensional data structure. It can have any data structure like integer, float, and string. It is useful when you want to perform computation or return a one-dimensional array. A series, by definition, cannot have multiple columns. For the latter case, please use the data frame structure.
Series has one parameters:
- Data: can be a list, dictionary or scalar value
pd.Series([1., 2., 3.])
0 1.0 1 2.0 2 3.0 dtype: float64
You can add the index with index. It helps to name the rows. The length should be equal to the size of the column
pd.Series([1., 2., 3.], index=['a', 'b', 'c'])
Below, you create a Pandas series with a missing value for the third rows. Note, missing values in Python are noted "NaN." You can use numpy to create missing value: np.nan artificially
pd.Series([1,2,np.nan])
Output
0 1.0 1 2.0 2 NaN dtype: float64
Create Data frame
You can convert a numpy array to a pandas data frame with pd.Data frame(). The opposite is also possible. To convert a pandas Data Frame to an array, you can use np.array()
## Numpy to pandas import numpy as np h = [[1,2],[3,4]] df_h = pd.DataFrame(h) print('Data Frame:', df_h) ## Pandas to numpy df_h_n = np.array(df_h) print('Numpy array:', df_h_n) Data Frame: 0 1 0 1 2 1 3 4 Numpy array: [[1 2] [3 4]]
You can also use a dictionary to create a Pandas dataframe.
dic = {'Name': ["John", "Smith"], 'Age': [30, 40]} pd.DataFrame(data=dic)
Age | Name | |
0 | 30 | John |
1 | 40 | Smith |
Range Data
Pandas have a convenient API to create a range of date
pd.data_range(date,period,frequency):
- The first parameter is the starting date
- The second parameter is the number of periods (optional if the end date is specified)
- The last parameter is the frequency: day: 'D,' month: 'M' and year: 'Y.'
## Create date # Days dates_d = pd.date_range('20300101', periods=6, freq='D') print('Day:', dates_d)
Output
Day: DatetimeIndex(['2030-01-01', '2030-01-02', '2030-01-03', '2030-01-04', '2030-01-05', '2030-01-06'], dtype='datetime64[ns]', freq='D')
# Months dates_m = pd.date_range('20300101', periods=6, freq='M') print('Month:', dates_m)
Output
Month: DatetimeIndex(['2030-01-31', '2030-02-28', '2030-03-31', '2030-04-30','2030-05-31', '2030-06-30'], dtype='datetime64[ns]', freq='M')
Inspecting data
You can check the head or tail of the dataset with head(), or tail() preceded by the name of the panda's data frame
Step 1) Create a random sequence with numpy. The sequence has 4 columns and 6 rows
random = np.random.randn(6,4)
Step 2) Then you create a data frame using pandas.
Use dates_m as an index for the data frame. It means each row will be given a "name" or an index, corresponding to a date.
Finally, you give a name to the 4 columns with the argument columns
# Create data with date df = pd.DataFrame(random, index=dates_m, columns=list('ABCD'))
Step 3) Using head function
df.head(3)
A | B | C | D | |
2030-01-31 | 1.139433 | 1.318510 | -0.181334 | 1.615822 |
2030-02-28 | -0.081995 | -0.063582 | 0.857751 | -0.527374 |
2030-03-31 | -0.519179 | 0.080984 | -1.454334 | 1.314947 |
Step 4) Using tail function
df.tail(3)
A | B | C | D | |
2030-04-30 | -0.685448 | -0.011736 | 0.622172 | 0.104993 |
2030-05-31 | -0.935888 | -0.731787 | -0.558729 | 0.768774 |
2030-06-30 | 1.096981 | 0.949180 | -0.196901 | -0.471556 |
Step 5) An excellent practice to get a clue about the data is to use describe(). It provides the counts, mean, std, min, max and percentile of the dataset.
df.describe()
A | B | C | D | |
count | 6.000000 | 6.000000 | 6.000000 | 6.000000 |
mean | 0.002317 | 0.256928 | -0.151896 | 0.467601 |
std | 0.908145 | 0.746939 | 0.834664 | 0.908910 |
min | -0.935888 | -0.731787 | -1.454334 | -0.527374 |
25% | -0.643880 | -0.050621 | -0.468272 | -0.327419 |
50% | -0.300587 | 0.034624 | -0.189118 | 0.436883 |
75% | 0.802237 | 0.732131 | 0.421296 | 1.178404 |
max | 1.139433 | 1.318510 | 0.857751 | 1.615822 |
Slice data
The last point of this tutorial is about how to slice a pandas data frame.
You can use the column name to extract data in a particular column.
## Slice ### Using name df['A'] 2030-01-31 -0.168655 2030-02-28 0.689585 2030-03-31 0.767534 2030-04-30 0.557299 2030-05-31 -1.547836 2030-06-30 0.511551 Freq: M, Name: A, dtype: float64
To select multiple columns, you need to use two times the bracket, [[..,..]]
The first pair of bracket means you want to select columns, the second pairs of bracket tells what columns you want to return.
df[['A', 'B']].
A | B | |
2030-01-31 | -0.168655 | 0.587590 |
2030-02-28 | 0.689585 | 0.998266 |
2030-03-31 | 0.767534 | -0.940617 |
2030-04-30 | 0.557299 | 0.507350 |
2030-05-31 | -1.547836 | 1.276558 |
2030-06-30 | 0.511551 | 1.572085 |
You can slice the rows with :
The code below returns the first three rows
### using a slice for row df[0:3]
A | B | C | D | |
2030-01-31 | -0.168655 | 0.587590 | 0.572301 | -0.031827 |
2030-02-28 | 0.689585 | 0.998266 | 1.164690 | 0.475975 |
2030-03-31 | 0.767534 | -0.940617 | 0.227255 | -0.341532 |
The loc function is used to select columns by names. As usual, the values before the coma stand for the rows and after refer to the column. You need to use the brackets to select more than one column.
## Multi col df.loc[:,['A','B']]
A | B | |
2030-01-31 | -0.168655 | 0.587590 |
2030-02-28 | 0.689585 | 0.998266 |
2030-03-31 | 0.767534 | -0.940617 |
2030-04-30 | 0.557299 | 0.507350 |
2030-05-31 | -1.547836 | 1.276558 |
2030-06-30 | 0.511551 | 1.572085 |
There is another method to select multiple rows and columns in Pandas. You can use iloc[]. This method uses the index instead of the columns name. The code below returns the same data frame as above
df.iloc[:, :2]
A | B | |
2030-01-31 | -0.168655 | 0.587590 |
2030-02-28 | 0.689585 | 0.998266 |
2030-03-31 | 0.767534 | -0.940617 |
2030-04-30 | 0.557299 | 0.507350 |
2030-05-31 | -1.547836 | 1.276558 |
2030-06-30 | 0.511551 | 1.572085 |
Drop a column
You can drop columns using pd.drop()
df.drop(columns=['A', 'C'])
B | D | |
2030-01-31 | 0.587590 | -0.031827 |
2030-02-28 | 0.998266 | 0.475975 |
2030-03-31 | -0.940617 | -0.341532 |
2030-04-30 | 0.507350 | -0.296035 |
2030-05-31 | 1.276558 | 0.523017 |
2030-06-30 | 1.572085 | -0.594772 |
Concatenation
You can concatenate two DataFrame in Pandas. You can use pd.concat()
First of all, you need to create two DataFrames. So far so good, you are already familiar with dataframe creation
import numpy as np df1 = pd.DataFrame({'name': ['John', 'Smith','Paul'], 'Age': ['25', '30', '50']}, index=[0, 1, 2]) df2 = pd.DataFrame({'name': ['Adam', 'Smith' ], 'Age': ['26', '11']}, index=[3, 4])
Finally, you concatenate the two DataFrame
df_concat = pd.concat([df1,df2]) df_concat
Age | name | |
0 | 25 | John |
1 | 30 | Smith |
2 | 50 | Paul |
3 | 26 | Adam |
4 | 11 | Smith |
Drop_duplicates
If a dataset can contain duplicates information use, `drop_duplicates` is an easy to exclude duplicate rows. You can see that `df_concat` has a duplicate observation, `Smith` appears twice in the column `name.`
df_concat.drop_duplicates('name')
Age | name | |
0 | 25 | John |
1 | 30 | Smith |
2 | 50 | Paul |
3 | 26 | Adam |
Sort values
You can sort value with sort_values
df_concat.sort_values('Age')
Age | name | |
4 | 11 | Smith |
0 | 25 | John |
3 | 26 | Adam |
1 | 30 | Smith |
2 | 50 | Paul |
Rename: change of index
You can use rename to rename a column in Pandas. The first value is the current column name and the second value is the new column name.
df_concat.rename(columns={"name": "Surname", "Age": "Age_ppl"})
Age_ppl | Surname | |
0 | 25 | John |
1 | 30 | Smith |
2 | 50 | Paul |
3 | 26 | Adam |
4 | 11 | Smith |
Import CSV
During the TensorFlow tutorial, you will use the adult dataset. It is often used with classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data The data is stored in a CSV format. This dataset includes eights categorical variables:
This dataset includes eights categorical variables:
- workclass
- education
- marital
- occupation
- relationship
- race
- sex
- native_country
moreover, six continuous variables:
- age
- fnlwgt
- education_num
- capital_gain
- capital_loss
hours_week
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:
Syntax:
pandas.read_csv(filepath_or_buffer,sep=', ',`names=None`,`index_col=None`,`skipinitialspace=False`)
- filepath_or_buffer: Path or URL with the data
- sep=', ': Define the delimiter to use
- `names=None`: Name the columns. If the dataset has ten columns, you need to pass ten names
- `index_col=None`: If yes, the first column is used as a row index
- `skipinitialspace=False`: Skip spaces after delimiter.
For more information about readcsv(), please check the official documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.
Consider the following Example
## Import csv import pandas as pd ## Define path data COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_week', 'native_country', 'label'] PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data" df_train = pd.read_csv(PATH, skipinitialspace=True, names = COLUMNS, index_col=False) df_train.shape
Output:
(32561, 15)
Groupby
An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby:
- count: count
- min: min
- max: max
- mean: mean
- median: median
- standard deviation: sdt
- etc
Inside groupby(), you can use the column you want to apply the method.
Let's have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k
df_train.groupby(['label']).mean()
age | fnlwgt | education_num | capital_gain | capital_loss | hours_week | |
label | ||||||
<=50K | 36.783738 | 190340.86517 | 9.595065 | 148.752468 | 53.142921 | 38.840210 |
>50K | 44.249841 | 188005.00000 | 11.611657 | 4006.142456 | 195.001530 | 45.473026 |
You can get the minimum of age by type of household
df_train.groupby(['label'])['age'].min()
label <=50K 17 >50K 19 Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.
df_train.groupby(['label', 'marital'])['capital_gain'].max() label marital <=50K Divorced 34095 Married-AF-spouse 2653 Married-civ-spouse 41310 Married-spouse-absent 6849 Never-married 34095 Separated 7443 Widowed 6849 >50K Divorced 99999 Married-AF-spouse 7298 Married-civ-spouse 99999 Married-spouse-absent 99999 Never-married 99999 Separated 99999 Widowed 99999 Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the grouping.
To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7).
If you use Jupyter Notebook, make sure to add % matplotlib inline, otherwise, no plot will be displayed
% matplotlib inline df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack() df_plot
Summary
Below is a summary of the most useful method for data science with Pandas
import data | read_csv |
create series | Series |
Create Dataframe | DataFrame |
Create date range | date_range |
return head | head |
return tail | tail |
Describe | describe |
slice using name | dataname['columnname'] |
Slice using rows | data_name[0:5] |