SAP Beginner
Which SAP Module is in Demand? Career & Future Scope
We often emails along the line... "I have done XYZ degree and have ABC work experience. Could you...
Graphs are the third part of the process of data analysis. The first part is about data extraction, the second part deals with cleaning and manipulating the data. At last, the data scientist may need to communicate his results graphically.
The job of the data scientist can be reviewed in the following picture
In this tutorial, you will learn-
This part of the tutorial focuses on how to make graphs/charts with R.
In this tutorial, you are going to use ggplot2 package. This package is built upon the consistent underlying of the book Grammar of graphics written by Wilkinson, 2005. ggplot2 is very flexible, incorporates many themes and plot specification at a high level of abstraction. With ggplot2, you can't plot 3-dimensional graphics and create interactive graphics.
In ggplot2, a graph is composed of the following arguments:
You will learn how to control those arguments in the tutorial.
The basic syntax of ggplot2 is:
ggplot(data, mapping=aes()) + geometric object arguments: data: Dataset used to plot the graph mapping: Control the x and y-axis geometric object: The type of plot you want to show. The most common object are: - Point: `geom_point()` - Bar: `geom_bar()` - Line: `geom_line()` - Histogram: `geom_histogram()`
Let's see how ggplot works with the mtcars dataset. You start by plotting a scatterplot of the mpg variable and drat variable.
library(ggplot2)
ggplot(mtcars, aes(x = drat, y = mpg)) +
geom_point()Code Explanation
Output:
Sometimes, it can be interesting to distinguish the values by a group of data (i.e. factor level data).
ggplot(mtcars, aes(x = mpg, y = drat)) +
geom_point(aes(color = factor(gear)))Code Explanation
Output:
Rescale the data is a big part of the data scientist job. In rare occasion data comes in a nice bell shape. One solution to make your data less sensitive to outliers is to rescale them.
ggplot(mtcars, aes(x = log(mpg), y = log(drat))) +
geom_point(aes(color = factor(gear)))Code Explanation
Note that any other transformation can be applied such as standardization or normalization.
Output:
You can add another level of information to the graph. You can plot the fitted value of a linear regression.
my_graph <- ggplot(mtcars, aes(x = log(mpg), y = log(drat))) +
geom_point(aes(color = factor(gear))) +
stat_smooth(method = "lm",
col = "#C42126",
se = FALSE,
size = 1)
my_graphCode Explanation
Output:
Note that other smoothing methods are available
So far, we haven't added information in the graphs. Graphs need to be informative. The reader should see the story behind the data analysis just by looking at the graph without referring additional documentation. Hence, graphs need good labels. You can add labels with labs()function.
The basic syntax for lab() is :
lab(title = "Hello gtupapers") argument: - title: Control the title. It is possible to change or add title with: - subtitle: Add subtitle below title - caption: Add caption below the graph - x: rename x-axis - y: rename y-axis Example:lab(title = "Hello gtupapers", subtitle = "My first plot")
One mandatory information to add is obviously a title.
my_graph +
labs(
title = "Plot Mile per hours and drat, in log"
)Code Explanation
Output:
A dynamic title is helpful to add more precise information in the title.
You can use the paste() function to print static text and dynamic text. The basic syntax of paste() is:
paste("This is a text", A)
arguments
- " ": Text inside the quotation marks are the static text
- A: Display the variable stored in A
- Note you can add as much static text and variable as you want. You need to separate them with a comma
Example:
A <-2010
paste("The first year is", A)Output:
## [1] "The first year is 2010"
B <-2018paste("The first year is", A, "and the last year is", B)
Output:
## [1] "The first year is 2010 and the last year is 2018"
You can add a dynamic name to our graph, namely the average of mpg.
mean_mpg <- mean(mtcars$mpg)
my_graph + labs(
title = paste("Plot Mile per hours and drat, in log. Average mpg is", mean_mpg)
)Code Explanation
Output:
Two additional detail can make your graph more explicit. You are talking about the subtitle and the caption. The subtitle goes right below the title. The caption can inform about who did the computation and the source of the data.
my_graph +
labs(
title =
"Relation between Mile per hours and drat",
subtitle =
"Relationship break down by gear class",
caption = "Authors own computation"
)Code Explanation
Output:
Variables itself in the dataset might not always be explicit or by convention use the _ when there are multiple words (i.e. GDP_CAP). You don't want such name appear in your graph. It is important to change the name or add more details, like the units.
my_graph +
labs(
x = "Drat definition",
y = "Mile per hours",
color = "Gear",
title = "Relation between Mile per hours and drat",
subtitle = "Relationship break down by gear class",
caption = "Authors own computation"
)Code Explanation
Output:
You can control the scale of the axis.
The function seq() is convenient when you need to create a sequence of number. The basic syntax is:
seq(begin, last, by = x) arguments: - begin: First number of the sequence - last: Last number of the sequence - by= x: The step. For instance, if x is 2, the code adds 2 to `begin-1` until it reaches `last`
For instance, if you want to create a range from 0 to 12 with a step of 3, you will have four numbers, 0 4 8 12
seq(0, 12,4)
Output:
## [1] 0 4 8 12
You can control the scale of the x-axis and y-axis as below
my_graph +
scale_x_continuous(breaks = seq(1, 3.6, by = 0.2)) +
scale_y_continuous(breaks = seq(1, 1.6, by = 0.1)) +
labs(
x = "Drat definition",
y = "Mile per hours",
color = "Gear",
title = "Relation between Mile per hours and drat",
subtitle = "Relationship break down by gear class",
caption = "Authors own computation"
)Code Explanation
Output:
Finally, R allows us to customize out plot with different themes. The library ggplot2 includes eights themes:
my_graph +
theme_dark() +
labs(
x = "Drat definition, in log",
y = "Mile per hours, in log",
color = "Gear",
title = "Relation between Mile per hours and drat",
subtitle = "Relationship break down by gear class",
caption = "Authors own computation"
)Output:
After all these steps, it is time to save and share your graph. You add ggsave('NAME OF THE FILE) right after you plot the graph and it will be stored on the hard drive.
The graph is saved in the working directory. To check the working directory, you can run this code:
directory <-getwd() directory
Let's plot your fantastic graph, saves it and check the location
my_graph +
theme_dark() +
labs(
x = "Drat definition, in log",
y = "Mile per hours, in log",
color = "Gear",
title = "Relation between Mile per hours and drat",
subtitle = "Relationship break down by gear class",
caption = "Authors own computation"
)Output:
ggsave("my_fantastic_plot.png")Output:
## Saving 5 x 4 in image
Note: For pedagogical purpose only, we created a function called open_folder() to open the directory folder for you. You just need to run the code below and see where the picture is stored. You should see a file names my_fantastic_plot.png.
# Run this code to create the
function
open_folder <- function(dir) {
if (.Platform['OS.type'] == "windows") {
shell.exec(dir)
} else {
system(paste(Sys.getenv("R_BROWSER"), dir))
}
}
# Call the
function to open the folder open_folder(directory)You can summarize the arguments to create a scatter plot in the table below:
Objective | Code |
|---|---|
Basic scatter plot | ggplot(df, aes(x = x1, y = y)) + geom_point() |
Scatter plot with color group | ggplot(df, aes(x = x1, y = y)) + geom_point(aes(color = factor(x1)) + stat_smooth(method = "lm") |
Add fitted values | ggplot(df, aes(x = x1, y = y)) + geom_point(aes(color = factor(x1)) |
Add title | ggplot(df, aes(x = x1, y = y)) + geom_point() + labs(title = paste("Hello gtupapers")) |
Add subtitle | ggplot(df, aes(x = x1, y = y)) + geom_point() + labs(subtitle = paste("Hello gtupapers")) |
Rename x | ggplot(df, aes(x = x1, y = y)) + geom_point() + labs(x = "X1") |
Rename y | ggplot(df, aes(x = x1, y = y)) + geom_point() + labs(y = "y1") |
Control the scale | ggplot(df, aes(x = x1, y = y)) + geom_point() + scale_y_continuous(breaks = seq(10, 35, by = 10)) + scale_x_continuous(breaks = seq(2, 5, by = 1) |
Create logs | ggplot(df, aes(x =log(x1), y = log(y))) + geom_point() |
Theme | ggplot(df, aes(x = x1, y = y)) + geom_point() + theme_classic() |
Save | ggsave("my_fantastic_plot.png") |
We often emails along the line... "I have done XYZ degree and have ABC work experience. Could you...
What is DataStage? Datastage is an ETL tool which extracts data, transform and load data from...
{loadposition top-ads-automation-testing-tools} What is Business Intelligence Tool? BUSINESS...
You will need to know Molga of a country ,while running country specific transactions. For example ,...
Very often, we have data from multiple sources. To perform an analysis, we need to merge two...
$20.20 $9.99 for today 4.6 (119 ratings) Key Highlights of SQLite PDF 159+ pages eBook Designed for...