Data Warehousing
What is OLTP? Definition, Architecture, Example
What is OLTP? OLTP is an operational system that supports transaction-oriented applications in a...
In this tutorial, you will learn:
During the TensorFlow tutorial, you will use the adult dataset. It is often used with classification task. It is available in this URL https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data
The data is stored in a CSV format. This dataset includes eights categorical variables:
This dataset includes eights categorical variables:
moreover, six continuous variables:
hours_week
To import a CSV dataset, you can use the object pd.read_csv(). The basic argument inside is:
Syntax:
pandas.read_csv(filepath_or_buffer,sep=', ',`names=None`,`index_col=None`,`skipinitialspace=False`)
For more information about readcsv(), please check the official documentation
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html.
Consider the following Example
## Import csv
import pandas as pd
## Define path data
COLUMNS = ['age','workclass', 'fnlwgt', 'education', 'education_num', 'marital',
'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss',
'hours_week', 'native_country', 'label']
PATH = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
df_train = pd.read_csv(PATH,
skipinitialspace=True,
names = COLUMNS,
index_col=False)
df_train.shape
Output:
(32561, 15)
An easy way to see the data is to use the groupby method. This method can help you to summarize the data by group. Below is a list of methods available with groupby:
Inside groupby(), you can use the column you want to apply the method.
Let's have a look at a single grouping with the adult dataset. You will get the mean of all the continuous variables by type of revenue, i.e., above 50k or below 50k
df_train.groupby(['label']).mean()
| age | fnlwgt | education_num | capital_gain | capital_loss | hours_week | |
| label | ||||||
| <=50K | 36.783738 | 190340.86517 | 9.595065 | 148.752468 | 53.142921 | 38.840210 |
| >50K | 44.249841 | 188005.00000 | 11.611657 | 4006.142456 | 195.001530 | 45.473026 |
You can get the minimum of age by type of household
df_train.groupby(['label'])['age'].min()
label <=50K 17 >50K 19 Name: age, dtype: int64
You can also group by multiple columns. For instance, you can get the maximum capital gain according to the household type and marital status.
df_train.groupby(['label', 'marital'])['capital_gain'].max()
label marital
<=50K Divorced 34095
Married-AF-spouse 2653
Married-civ-spouse 41310
Married-spouse-absent 6849
Never-married 34095
Separated 7443
Widowed 6849
>50K Divorced 99999
Married-AF-spouse 7298
Married-civ-spouse 99999
Married-spouse-absent 99999
Never-married 99999
Separated 99999
Widowed 99999
Name: capital_gain, dtype: int64
You can create a plot following groupby. One way to do it is to use a plot after the grouping.
To create a more excellent plot, you will use unstack() after mean() so that you have the same multilevel index, or you join the values by revenue lower than 50k and above 50k. In this case, the plot will have two groups instead of 14 (2*7).
If you use Jupyter Notebook, make sure to add %matplotlib inline, otherwise, no plot will be displayed
%matplotlib inline df_plot = df_train.groupby(['label', 'marital'])['capital_gain'].mean().unstack() df_plot
What is OLTP? OLTP is an operational system that supports transaction-oriented applications in a...
What is Data warehouse? A data warehouse is a technique for collecting and managing data from...
What is Tableau? Tableau is a powerful and fastest-growing data visualization tool used in the...
Reporting tools are software that provides reporting, decision making, and business intelligence...
What is Multidimensional schema? Multidimensional Schema is especially designed to model data...
What is Data Warehouse? A data warehouse is a blend of technologies and components which allows the...