Import CSV Data using Pandas.read_csv()

Details: Last Updated: 14 February 2021

In this tutorial, you will learn:

Import CSV
Groupby

Import CSV

During the TensorFlow tutorial, you will use the adult dataset. It is often used with classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

The data is stored in a CSV format. This dataset includes eights categorical variables:

This dataset includes eights categorical variables:

workclass
education
marital
occupation
relationship
race
sex
native_country

moreover, six continuous variables:

age
fnlwgt
education_num
capital_gain
capital_loss

hours_week

To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:

Syntax:

pandas.read_csv(filepath_or_buffer,sep=', ',`names=None`,`index_col=None`,`skipinitialspace=False`)

filepath_or_buffer: Path or URL with the data
sep=', ': Define the delimiter to use
`names=None`: Name the columns. If the dataset has ten columns, you need to pass ten names
`index_col=None`: If yes, the first column is used as a row index
`skipinitialspace=False`: Skip spaces after delimiter.

For more information about readcsv(), please check the official documentation

https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.

Consider the following Example

## Import csv
import pandas as pd
## Define path data
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
           'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
           'hours_week', 'native_country', 'label']
PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df_train = pd.read_csv(PATH,
                       skipinitialspace=True,
                       names = COLUMNS,
                       index_col=False)
df_train.shape

Output:

(32561, 15)

Groupby

An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby:

count: count
min: min
max: max
mean: mean
median: median
standard deviation: sdt
etc

Inside groupby(), you can use the column you want to apply the method.

Let's have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k

df_train.groupby(['label']).mean()

	age	fnlwgt	education_num	capital_gain	capital_loss	hours_week
label
<=50K	36.783738	190340.86517	9.595065	148.752468	53.142921	38.840210
>50K	44.249841	188005.00000	11.611657	4006.142456	195.001530	45.473026

You can get the minimum of age by type of household

df_train.groupby(['label'])['age'].min()

label
<=50K    17
>50K     19
Name: age, dtype: int64

You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.

df_train.groupby(['label', 'marital'])['capital_gain'].max()				
label  marital              
<=50K  Divorced                 34095
       Married-AF-spouse         2653
       Married-civ-spouse       41310
       Married-spouse-absent     6849
       Never-married            34095
       Separated                 7443
       Widowed                   6849
>50K   Divorced                 99999
       Married-AF-spouse         7298
       Married-civ-spouse       99999
       Married-spouse-absent    99999
       Never-married            99999
       Separated                99999
       Widowed                  99999
Name: capital_gain, dtype: int64

You can create a plot following groupby. One way to do it is to use a plot after the grouping.

To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7).

If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed

%matplotlib inline
df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack()
df_plot

Data Warehousing

What is OLTP? Definition, Architecture, Example

What is OLTP? OLTP is an operational system that supports transaction-oriented applications in a...

Data Warehousing

Difference between Data Mining and Data Warehouse

What is Data warehouse? A data warehouse is a technique for collecting and managing data from...

Tableau

Tableau vs Power BI: Key Differences

What is Tableau? Tableau is a powerful and fastest-growing data visualization tool used in the...

Data Warehousing

18 BEST Reporting Tools in 2021

Reporting tools are software that provides reporting, decision making, and business intelligence...

Data Warehousing

Star and Snowflake Schema in Data Warehouse with Examples

What is Multidimensional schema? Multidimensional Schema is especially designed to model data...

Data Warehousing

Data Lake vs Data Warehouse: What's the Difference?

What is Data Warehouse? A data warehouse is a blend of technologies and components which allows the...