The RStudio console on the left side of RStudio provides direct access to the R interpreter. We can use it as a simple calculator or to interact with the current R session. Type or paste the commands in the dark grey boxes into the console.
print("Hello World!")
## [1] "Hello World!"
x <- 2 * (1:10)
print(x)
## [1] 2 4 6 8 10 12 14 16 18 20
The Environment tab shows the data that has been loaded into the current R session. We can see the x
variable that we just typed. This data is lost once we end the session.
The Files tab is a simple file manager for your project. You can use it to organise your files.
You can get more info on the RStudio IDE by looking at the cheatsheet at “Help > Cheatsheets > RStudio IDE Cheat Sheet”.
We shall use a R script file to record our commands for future use.
x <- sum(1:10)
and on a new line type print(x)
. Save the file.We shall use the “titanic.csv” file to illustrate how we can manipulate larger data sets in R.
To load the data, click on the “Import Dataset” button in the Environment tab in RStudio, and then click on “From Text (readr)”. This opens a dialog from which you should browse to the “titanic.csv” file and then click on the Import button.
You should see the following commands on the console.
library(readr)
titanic <- read_csv("titanic.csv")
## Parsed with column specification:
## cols(
## Survived = col_double(),
## Pclass = col_double(),
## Name = col_character(),
## Sex = col_character(),
## Age = col_double(),
## `Siblings/Spouses Aboard` = col_double(),
## `Parents/Children Aboard` = col_double(),
## Fare = col_double()
## )
View(titanic)
This loads the data into the R session in the form of a tabular data called a “tibble”. You can view the tibble in RStudio by clicking on the grid icon to the right of the dataset in the Environment tab.
You can get more information on loading data into R by looking at the Data Import Cheatsheet.
The R package “dplyr” allows us to manipulate tibbles. Often it is best to build up the manipulation of data in stages so that you can see the effect of each command.
Type the following into your R script “code.R”. This loads the “dplyr” library and summarises the data.
library(dplyr)
titanic %>%
summarise(Total_Survived = sum(Survived),
Number_Passengers = n(),
Survival_Rate = 100 * Total_Survived / Number_Passengers)
## # A tibble: 1 x 3
## Total_Survived Number_Passengers Survival_Rate
## <dbl> <int> <dbl>
## 1 342 887 38.6
The pipe operator %>%
takes the output from the LHS and uses it as input to the RHS. The summarise
function aggregates the titanic tibble and returns new columns of aggregated data. In this case, we find the survival rate for all passengers.
If we want to find the survival rate for each passenger class we first group_by
this column and then summarise
. Type the following into the R console.
titanic_Pclass_rate <- titanic %>%
group_by(Pclass) %>%
summarise(Survival_Rate = 100 * sum(Survived) / n())
titanic_Pclass_rate
## # A tibble: 3 x 2
## Pclass Survival_Rate
## <dbl> <dbl>
## 1 1 63.0
## 2 2 47.3
## 3 3 24.4
We see the survival rate per ticket class in percent.
You can find more information on data manipulation on the cheatsheet.
mean
function from R to calculate survival rates rather than using sum
.Often we can gain additional insight into the data by visualisation techniques. We shall use the R package “ggplot2” for this task.
The “ggplot2” package rivals commercial plotting applications such as Tableau in its plotting capabilities. The package is a language for building plots based on the “Grammar of Graphics”. Unlike “dplyr”, it uses the +
operator to build up a plot.
As a simple example, type the following into the R console.
library(ggplot2)
ggplot(data = titanic_Pclass_rate)
ggplot(data = titanic_Pclass_rate, aes(x = Pclass, y = Survival_Rate))
ggplot(data = titanic_Pclass_rate, aes(x = Pclass, y = Survival_Rate)) + geom_col()
The plot is built up in stages. The function aes
maps the data to the plotting system, while geom_col
selects the type of graphic, which in this case is a bar chart.
We can generate a more sophisticated plot by splitting the bars based on the Sex
column. The argument position = "dodge"
ensures that bars are not stacked on each other.
titanic_Sex_Pclass_rate <- titanic %>%
group_by(Pclass, Sex) %>%
summarise(Survival_Rate = 100 * mean(Survived))
ggplot(data = titanic_Sex_Pclass_rate, aes(x = Pclass, y = Survival_Rate, fill = Sex)) +
geom_col(position = "dodge")
More information can be found in the ggplot cheat sheet. You can see some more features of ggplot used in the plot below.
#
token.