sqrt(5197)
[1] 72.09022
Open source (free!) statistical programming language/software
It can be used for:
It is constantly growing!
Has a strong online support community
Since it’s one programming language, it is versatile enough to take you from raw data to publishable research using free, reproducible code!
RStudio is a free, open source IDE (integrated development environment) for R. (You must install R before you can install RStudio.)
Its interface is organized so that the user can clearly view graphs, tables, R code, and output all at the same time.
It also offers an Import-Wizard-like feature that allows users to import CSV, Excel, SPSS (*.sav), and Stata (*.dta) files into R without having to write the code to do so.
Excel and SPSS are convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However, they are a poor choice for statistical analysis beyond the simplest descriptive statistics, or for more than a very few columns.
Some of the reasons for chosing R over others are are:
Health Data Science is an emerging discipline, combining mathematics, statistics, epidemiology and informatics.
R is widely used in the field of health data science and especially in healthcare industry domains like genetics, drug discovery, bioinformatics, vaccine reasearch, deep learning, epidemiology, public health, vaccine research, etc.
As data-generating technologies have proliferated throughout society and industry, leading hospitals are trying to ensure this data is harnessed to achieve the best outcomes for patients. These internet of things (IoT) technologies include everything from sensors that monitor patient health and the condition of machines to wearables and patients’ mobile phones. All these comprise the “Big Data” in healthcare.
Research is considered to be reproducible when the exact results can be reproduced if given access to the original data, software, or code.
Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results. Reproducibility is a minimum necessary condition for a finding to be believable and informative. — U.S. National Science Foundation (NSF) subcommittee on Replicability in Science
There are four key elements of reproducible research:
Factors behind irreproducible research
While reproducibility is the minimum requirement and can be solved with “good enough” computational practices, replicability/ robustness/ generalisability of scientific findings are an even greater concern involving research misconduct, questionable research practices (p-hacking, HARKing, cherry-picking), sloppy methods, and other conscious and unconscious biases.
What are the good practices of reproducible research?
How to make your work reproducible?
Reproducible workflows give you credibility!
Go here: https://cran.rstudio.com/
Choose the correct “Download R for. . .” option from the top (probably Windows or macOS), then…
For Windows users, choose “Install R for the first time” (next to the base subdirectory) and then “Download R 4.4.2 for Windows”
For macOS users, select the appropriate version for your operating system (e.g. the latest release is version 4.4.2, will look something like R-4.4.2-arm64.pkg), then choose to Save or Open
Once downloaded, save, open once downloaded, agree to license, and install like you would any other software.
RStudio is a user-friendly interface for working with R. That means you must have R already installed for RStudio to work. Make sure you’ve successfully installed R in Step 1, then. . .
Go to https://www.rstudio.com/products/rstudio/download/ to download RStudio Desktop (Open Source License). You’ll know you’re clicking the right one because it says “FREE” right above the download button.
Click download, which takes you just down the page to where you can select the correct version under Installers for Supported Platforms (almost everyone will choose one of the first two options, RStudio for Windows or macOS).
Click on the correct installer version, save, open once downloaded, agree to license and install like you would any other software. The version should be at least RStudio 2024.09 “Cranberry Hibiscus”, 2024.
The RStudio environment consist of multiple windows. Each window consist of certain Panels
Panels in RStudio
It is important to understand that not all panels will be used by you in routine as well as by us during the workshop. The workshop focuses on using R for healthcare professionals as a database management, visualization, and communication tool. The most common panels which requires attention are the source, console, environment, history, files, packages, help, tutorial, and viewer panels.
You are requested to make your own notes during the workshop. Let us dive deep into understanding the environment further in the workshop.
The most common used file types are
.R
: Script file.Rmd
: RMarkdown file.qmd
: Quarto file.rds
: Single R database file.RData
: Multiple files in a single R database fileR is easiest to use when you know how the R language works. This section will teach you the implicit background knowledge that informs every piece of R code. You’ll learn about:
To do anything in R, we call functions to work for us. Take for example, we want to compute square root of 5197. Now, we need to call a function sqrt()
for the same.
sqrt(5197)
[1] 72.09022
Important things to know about functions include:
Typing code body and running it enables us understand what a function does in background.
sqrt
function (x) .Primitive("sqrt")
To run a function, we need to add a parenthesis ()
after the code body. Within the parenthesis we add the details such as number in the above example.
Placing a question mark before the function takes you to the help page. This is an important aspect we need to understand. When calling help page parenthesis is not placed. This help page will enable you learn about new functions in your journey!
?sqrt
Tip:
Annotations are meant for humans to read and not by machines. It enables us take notes as we write. As a result, next time when you open your code even after a long time, you will know what you did last summer :)
Arguments are inputs provided to the function. There are functions which take no arguments, some take a single argument and some take multiple arguments. When there are two or more arguments, the arguments are separated by a comma.
# No argument
Sys.Date()
[1] "2024-11-11"
# One argument
sqrt(5197)
[1] 72.09022
# Two arguments
sum(2,3)
[1] 5
# Multiple arguments
seq(from=1,
to = 10,
by = 2)
[1] 1 3 5 7 9
Matching arguments: Some arguments are understood as such by the software. Take for example, generating a sequence includes three arguments viz: from, to, by. The right inputs are automatically matched to the right argument.
seq(1,10,2)
[1] 1 3 5 7 9
Caution: The wrong inputs are also matched. Best practice is to be explicit at early stages. Use argument names!
seq(2,10,1)
[1] 2 3 4 5 6 7 8 9 10
seq(by = 2,
to = 10,
from = 1)
[1] 1 3 5 7 9
Optional arguments: Some arguments are optional. They may be added or removed as per requirement. By default these optional arguments are taken by R as default values. Take for example, in sum()
function, na.rm = FALSE
is an optional argument. It ensures that the NA values are not removed by default and the sum is not returned when there are NA values. These optional arguments can be override by mentioning them explicitly.
sum(2,3,NA)
[1] NA
sum(2,3,NA, na.rm = T)
[1] 5
In contrast, the arguments which needs to be mentioned explicitly are mandatory! Without them, errors are returned as output.
sqrt()
If we want to use the results in addition to viewing them in console, we need to store them as objects. To create an object, type the name of the object (Choose wisely, let it be explicit and self explanatory!), then provide an assignment operator. Everything to the right of the operator will be assigned to the object. You can save a single value or output of a function or multiple values or an entire data set in a single object.
# Single value
<- 3
x x
[1] 3
# Output from function
<- seq(from=1,
x to = 10,
by = 2)
# Better name:
<- seq(from=1,
sequence_from_1_to_10 to = 10,
by = 2)
Creating an object helps us in viewing its contents as well make it easier to apply additional functions
Tip. While typing functions/ object names, R prompts are provided. Choose from the prompts rather than typing the entire thing. It will ease out many things later!
sequence_from_1_to_10
[1] 1 3 5 7 9
sum(sequence_from_1_to_10)
[1] 25
R stores values as a vector which is one dimensional array. Arrays can be two dimensional (similar to excel data/ tabular data), or multidimensional. Vectors are always one dimensional!
Vectors can be a single value or a combination of values. We can create our own vectors using c()
function.
<- 3
single_number single_number
[1] 3
<- c(1,2,3)
number_vector number_vector
[1] 1 2 3
Creating personalized vectors is powerful as a lot of functions in R takes vectors as inputs.
mean(number_vector)
[1] 2
Vectorized functions: The function is applied to each element of the vector:
sqrt(number_vector)
[1] 1.000000 1.414214 1.732051
If we have two vectors of similar lengths (such as columns of a research data), vectorised functions help us compute for new columns by applying the said function on each element of both the vectors and give a vector of the same length (Consider this as a new column in the research data)
<- c(3,-4,5.4)
number_vector2 + number_vector2 number_vector
[1] 4.0 -2.0 8.4
R recognizes different types of vectors based on the values in the vector.
If all values are numbers (positive numbers, negative numbers, decimals), R will consider that vector as numerical and allows you to carry out mathematical operations/ functions. You can find the class of the vector by using class()
function.R labels these vectors as “double”, “numeric”, or “integers”.
class(number_vector)
[1] "numeric"
class(number_vector2)
[1] "numeric"
If the values are within quotation marks, it is character variable by default. It is equivalent to nominal variable.
<- c("a", "b", "c")
alphabets_vector class(alphabets_vector)
[1] "character"
<- c(1L,2L)
integer_vector class(integer_vector)
[1] "integer"
Logical vectors contain TRUE and FALSE values
<- c(TRUE, FALSE)
logical_vector class(logical_vector)
[1] "logical"
Factor vectors are categorical variables. Other variable types can be converted to factor type using functionfactor()
<- factor(number_vector)
factor_vector factor_vector
[1] 1 2 3
Levels: 1 2 3
We can add labels to factor vectors using optional arguments
<- factor(number_vector,
factor_vector levels =c(1,2,3),
labels = c("level1",
"level2",
"level3"))
factor_vector
[1] level1 level2 level3
Levels: level1 level2 level3
One vector = One type. For example: When there is mix of numbers and characters, R will consider all as character.
<- c(1,"a")
mix_vector class(mix_vector)
[1] "character"
Note that the number 1 has been converted into character class.
1] mix_vector[
[1] "1"
1] |> class() mix_vector[
[1] "character"
Double, character, integer, logical, complex, raw, dates, etc… There are many other data types and objects but for now, lets start with these. You will understand additional types as you will proceed in your R journey!
In addition to vectors, lists are another powerful objects. A list can be considered as a vector of vectors!! They enable you to store multiple types of vectors together. A list can be made using a list()
function. It is similar to c()
function but creates a list rather than a vector. It is a good practice to name the vectors in the list.
<- list(numbers = number_vector,
example_list alphabets = alphabets_vector)
class(example_list)
[1] "list"
example_list
$numbers
[1] 1 2 3
$alphabets
[1] "a" "b" "c"
The elements of a named list/ a named vector can be called by using a $
.
$numbers example_list
[1] 1 2 3
There are thousands of functions in R. To be computationally efficient, R do not load all functions on start. It loads only base functions. As you want to use additional functions, we need to load the packages using library()
function.
The additional packages are installed once but loaded everytime you start R sessions.
With these basics, lets deep dive into the workshop!! Are you ready?
In this section, we will be learning how to import, and export data from R. We will also be talking about the different file types. This section is based on the relevant chapters from two of the renowned textbooks on tidyverse
.1 These textbooks take different approaches for importing and working with data in RStudio using tidyverse
packages. We present to you the most optimal workflows to facilitate reproducibility and ease of understanding.
Before diving into data analysis and working with R, it’s crucial to establish a well-organized workflow. Setting up an R project for each analysis in RStudio is one of the best practices for maintaining this structure. Here’s why it matters:
Organized Workspace: An R project creates a dedicated workspace, keeping all files, scripts, and data for each analysis in one place. This structure makes it easier to locate and manage your resources and helps prevent clutter on your computer.
Consistent File Paths: When working within an R project, file paths become relative to the project’s root directory. This avoids the need for absolute paths (e.g., C:/Users/YourName/ProjectFolder), making your code portable. For example, using relative paths allows you to share your project with others without requiring adjustments to file paths.
Enhanced Reproducibility: With an R project, you can easily recreate your analysis environment. The .Rproj file saves specific project settings, allowing you to return to the project later and pick up where you left off with minimal setup. This is particularly valuable when revisiting analyses or sharing work with collaborators.
To create a new project, open RStudio, go to the File menu, select New Project, and follow the prompts. You’ll see that RStudio sets up a unique working directory, which helps maintain consistency and clarity throughout your analysis.
Or you could try
By following this practice, you set up a solid foundation for a clean, organized, and reproducible workflow in R.
here
Reading and writing files often involves the use of file paths. A file path is a string of characters that point R and RStudio to the location of the file on your computer.
These file paths can be a complete location (C:/Users/Arun/RIntro_Book.Rmd
) or just the file name (RIntro_Book.Rmd
). If you pass R a partial file path, R will append it to the end of the file path that leads to your working directory. The working directory is the directory where your .Rproj
file is.
When working with files in R, defining paths correctly is essential for accessing your data and saving outputs. The here package is a powerful tool that simplifies file paths, especially within R projects, by automatically locating the root directory of your project.
here
?Simplifies Paths: Instead of typing out long, complex file paths, here constructs paths relative to the root of your project. This makes your code cleaner and easier to read.
Improves Portability: Using here makes your code more portable. When sharing your project with others or switching between computers, the paths generated by here adjust automatically based on the project’s root, so there’s no need to modify paths manually.
Avoids Path Errors: Typing out file paths can lead to errors if you move files around or change directories. The here function helps prevent these issues by always starting paths from the same project root.
Using here in Practice The here package creates paths by combining the project root directory with any subdirectories or file names you specify. For example:
#install.packages("pacman")
::p_load(here)
pacman
here()
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book"
When you run here::here() in your R project, it returns the full file path up to the directory where your R project was created. This directory is known as the project root.
If you have a file named nhanes_modified_df.rds
stored inside a folder called data
within your project, you can easily reference it using the here function. By writing:
here("data", "nhanes_modified_df.rds")
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_modified_df.rds"
you’re creating a file path that points directly to the nhanes_modified_df.rds
file within the data folder, starting from the root of your project. This method keeps things neat, adaptable, and prevents hardcoding of long file paths. Whether you move your project to another computer or share it with someone else, this path will still work without any changes. It’s a simple way to make your workflow more efficient!
The RStudio IDE provides an Import Dataset button in the Environment pane, which appears in the top right corner of the IDE by default. You can use this button to import data that is stored in plain text files as well as in Excel, SAS, SPSS, and Stata files.
We recommend using .csv
file type to read and write your data as a best practice. This will ensure cross compatibility between various programs as it is just a raw text file but just separated by a comma.
.rds
) using a Library.rds
is a file format native to R for saving compressed content. .rds
files are not text files and are not human readable in their raw form. Each .rds
file contains a single object, which makes it easy to assign its output directly to a single R object. This is not necessarily the case for .RData
files, which makes .rds
files safer to use.
We can use the read_rds()
and write_rds()
function from the readr
package to read and write an .rds
file. write_rds()
function save the previously loaded data, as an .rds
file using this function. You can look at the help menu to know more on the syntax or you can type ?write_rds
in the Console pane.
eg:
<- readr::read_rds(here("data", "nhanes_modified_df.rds")) df
In the above line of code we are instructing R to:
Look inside the project folder: here::here("data", "nhanes_modified_df.rds")
tells R to look in the data
folder within your project for a file named nhanes_modified_df.rds.
Read the .rds file: readr::read_rds()
is used to load this .rds file into the object df.
However, if there is:
A spelling mistake in either the folder name (data
) or the file name (nhanes_modified_df.rds
), or
The file doesn’t exist at the specified location,
R will not be able to find the file, and you’ll encounter an error message, typically saying the file cannot be found.
Similarly, you can use the write_csv()
function from the readr
package to write a .csv
file.
try!!!
Note
There are different packages to import different types of data.
haven
: SPSS, Stata, or SASreadxl
: Excel spreadsheetsreadr
: csv, txt, tsv etc.When working with data in R, you’ll frequently encounter two common types of data structures: tibbles and data.frames. While both are used to store tabular data, they have some important differences that affect how they behave and how you interact with them. Understanding these differences can help streamline your data analysis and avoid potential pitfalls.
To learn more in-depth about tibbles, you can run vignette(“tibble”) in your R console, which provides a comprehensive overview.
Some major differences are:
data.frame
changes strings as factors; tibble
will notdata.frame
will remove spaces or add “x” before numeric column names. tibble
will not.row.names()
for a tibble
tibble
print first ten rows and columns that fit on one screenTidy data is a way to describe data that’s organized with a particular structure – a rectangular structure, where each variable has its own column, and each observation has its own row. — Hadley Wickham, 2014
These three rules are interrelated because it’s impossible to only satisfy two of the three.
Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham
Working with messy data can be messy!. You need to build custom tools from scratch each time you work with a new dataset.
Illustrations from : https://github.com/allisonhorst/stats-illustrations
Packages like tidyr
and dplyr
can enable you to get on with analysing your data and start answering key questions rather than spending time in trying to clean the data.
Note
Tidy data allows you to be more efficient by using specialised tools built for the tidy workflow. There are a lot of tools specifically built to wrangle untidy data into tidy data.
One other advantage of working with Tidy data is that it makes it easier for collaboration, as your colleagues can use the same familiar tools rather than getting overwhelmed with all the work you did from scratch. It is also helpful for your future self as it becomes a consistent workflow and takes less adjustment time for any incremental changes.
Tidy data also makes it easier to reproduce analyses because they are easier to understand, update, and reuse. By using tools together that all expect tidy data as inputs, you can build and iterate really powerful workflows.
When loading data into R using the RStudio GUI using tidyverse
, the data is automatically saved as a tibble
. A tibble
is a data frame, but they have some new functionalities and properties to make our life easier. It is the single most important workhorse of tidyverse
.
You can change data.frame
objects to a tibble
using the as_tibble()
function.
Now that you have imported data into RStudio its a good practice to have a look at the data. There are many ways you can do it within RStudio.
View()
functionSome other things you can do to have a look at your data are:
Checking the class of the dataset using class()
function
Checking the structure of the dataset using str()
function
Note
class()
and str()
are not just limited to datasets, they can be used for any R objects.
Some additional tips for quickly looking at your data:
head()
tail()
glimpse()
To recap what we learnt in the previous sessions.. we now know to work within the R Project environment. here::here()
makes it easy for us to manage file paths. You can quickly have a look at your data using the View()
and glimpse()
functions. Most of the tidy data is read as tibble
which is a workhorse of tidyverse
.
#install.packages("pacman")
::p_load(tidyverse, here)
pacman
#tidyverse required for tidy workflows
#rio required for importing and exporting data
#here required for managing file paths
Note
The shortcut for code commenting is Ctrl+Shift+C
.
The dataset we will be working with has been cleaned (to an extent) for the purposes of this workshop. It is a dataset about NHANES that has been took from the
NHANES
and cleaned up and modified for our use.
# Check the file path
::here("data", "nhanes_basic_info.csv") here
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_basic_info.csv"
# Read Data
<- read_csv(here("data", "nhanes_basic_info.csv")) df
Try the following functions using tb as the argument:
glimpse()
head()
names()
Now, we will be introducing you to two new packages:
dplyr
skimr
DataExplorer
dplyr
PackageThe dplyr
is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.
There are many verbs in dplyr
that are useful, some of them are given here…
|>
or %>%
Note
The pipe |>
means THEN…
The pipe is an operator in R that allows you to chain together functions in dplyr
.
Let’s find the bottom 50 rows of tb without and with the pipe.
Tips The native pipe |> is preferred.
#without the pipe
tail(df, n = 50)
#with the pipe
|> tail(n = 50) df
Now let’s see what the code looks like if we need 2 functions. Find the unique age in the bottom 50 rows of df
#without the pipe
unique(tail(df, n = 50)$age)
# with the pipe
|>
df tail(50) |>
distinct(age)
Note
The shortcut for the pipe is Ctrl+Shift+M
You will notice that we used different functions to complete our task. The code without the pipe uses functions from base R while the code with the pipe uses a mixture (tail()
from base R and distinct()
from dplyr
). Not all functions work with the pipe, but we will usually opt for those that do when we have a choice.
distinct()
and count()
The distinct()
function will return the distinct values of a column, while count()
provides both the distinct values of a column and then number of times each value shows up. The following example investigates the different race (race
) in the df
dataset:
|>
df distinct(race)
|>
df count(race)
Notice that there is a new column produced by the count function called n
.
arrange()
The arrange()
function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc()
function to arrange by descending.
The following code would get the number of times each race is in the dataset:
|>
df count(race) |>
arrange(n)
# Since the default is ascending order,
# we are not getting the results that are probably useful,
# so let's use the desc() function
|>
df count(race) |>
arrange(desc(n))
# shortcut for desc() is -
|>
df count(race) |>
arrange(-n)
filter()
If you want to return rows of the data where some criteria are met, use the filter()
function. This is how we subset in the tidyverse. (Base R function is subset()
)
Here are the logical criteria in R:
==
: Equal to!=
: Not equal to>
: Greater than>=
: Greater than or equal to<
: Less than<=
: Less than or equal toIf you want to satisfy all of multiple conditions, you can use the “and” operator, &
.
The “or” operator |
(the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.
Let’s see all the data for age 60 or above
|>
df filter(age >= 60)
Let’s just see data for white
|>
df filter(race == "White")
Both White and age 60 or more
<- df |>
df_60_plus_white filter(age >= 60 & race == "White")
%in%
To filter()
a categorical variable for only certain levels, we can use the %in%
operator.
Lets check which are the race groups that are in the dataset.
|>
df select(race) |>
unique()
# A tibble: 5 × 1
race
<chr>
1 White
2 Mexican
3 Hispanic
4 Other
5 Black
Now we’ll create a vector of races we are interested in
<- c("Mexican",
others "Hispanic",
"Other")
And use that vector to filter()
df
for races %in%
minority
|>
df filter(race %in% others)
You can also save the results of a pipeline. Notice that the rows belonging to minority races are returned in the console. If we wanted to do something with those rows, it might be helpful to save them as their own dataset. To create a new object, we use the <-
operator.
<- df |>
others_df filter(race %in% others)
drop_na()
The drop_na()
function is extremely useful for when we need to subset a variable to remove missing values.
Return the NHANES dataset without rows that were missing on the education variable
|>
df drop_na(education)
Return the dataset without any rows that had an NA in any column. *Use with caution because this will remove a lot of data
|>
df drop_na()
select()
Whereas the filter()
function allows you to return only certain rows matching a condition, the select()
function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.
See just the country, year, incidence_100k columns
# list the column names you want to see separated by a comma
|>
df select(id, age, education)
Use the - sign to drop these same columns
|>
df select(-age_months, -poverty, -home_rooms)
select()
helper functionsThe starts_with()
, ends_with()
and contains()
functions provide very useful tools for dropping/keeping several variables at once without having to list each and every column you want to keep. The function will return columns that either start with a specific string of text, ends with a certain string of text, or contain a certain string of text.
# these functions are all case sensitive
|>
df select(starts_with("home"))
|>
df select(ends_with("t"))
|>
df select(contains("_"))
# columns that do not contain -
|>
df select(-contains("_"))
summarize()
The summarize()
function summarizes multiple values to a single value. On its own the summarize()
function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n()
and n_distinct()
that tell you the number of observations or the number of distinct values of a particular variable.
Note summarize()
is the same as summarise()
Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column.
|>
df summarize(mean(age))
# watch out for nas. Use na.rm = TRUE to run the calculation after excluding nas.
|>
df summarize(mean(weight, na.rm = TRUE))
The name of the column is the expression used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something better.
|>
df summarize(mean_age = mean(age, na.rm = TRUE))
group_by()
We saw that summarize()
isn’t that useful on its own. Neither is group_by()
. All this does is takes an existing data frame and converts it into a grouped data frame where operations are performed by group.
|>
df group_by(gender)
|>
df group_by(gender, race)
group_by()
and summarize()
togetherThe real power comes in where group_by()
and summarize()
are used together. First, write the group_by()
statement. Then pipe the result to a call to summarize()
.
Let’s summarize the mean incidence of tb for each year
|>
df group_by(race) |>
summarize(mean_height = mean(height, na.rm = TRUE))
#sort the output by descending mean_inc
|>
df group_by(race) |>
summarize(mean_height = mean(height, na.rm = TRUE))|>
arrange(desc(mean_height))
mutate()
Mutate creates a new variable or modifies an existing one.
Lets create a column called elderly
if the age is greater than or equal to 65.
|>
df mutate(elderly = if_else(
>= 65,
age "Yes",
"No"))
The same thing can be done using case_when()
.
|>
df mutate(elderly = case_when(
>= 65 ~ "Yes",
age < 65 ~ "No",
age TRUE ~ NA))
Lets do it again, but this time let us make it 1 and 0, 1 if age is greater than or equal to 65, 0 if otherwise.
|>
df mutate(old = case_when(
>= 65 ~ 1,
age < 65 ~ 0,
age TRUE ~ NA))
Note
The if_else()
function may result in slightly shorter code if you only need to code for 2 options. For more options, nested if_else()
statements become hard to read and could result in mismatched parentheses so case_when()
will be a more elegant solution.
As a second example of case_when()
, let’s say we wanted to create a new income variable that is low, medium, or high.
See the income_hh
broken into 3 equally sized portions
quantile(df$income_hh, prob = c(.33, .66), na.rm = T)
Note
See the help file for quanile
function or type ?quantile
in the console.
We’ll say:
|>
df mutate(income_cat = case_when(
<= 30000 ~ "low",
income_hh > 30000 & income_hh <= 70000 ~ "medium",
income_hh > 70000 ~ "high",
income_hh TRUE ~ NA))
join()
Typically in a data science or data analysis project one would have to work with many sources of data. The researcher must be able to combine multiple datasets to answer the questions he or she is interested in. Collectively, these multiple tables of data are called relational data because more than the individual datasets, its the relations that are more important.
As with the other dplyr
verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.
These include:
left_join(x, y)
which combines all columns in data frame x
with those in data frame y
but only retains rows from x
.
right_join(x, y)
also keeps all columns but operates in the opposite direction, returning only rows from y
.
full_join(x, y)
combines all columns of x
with all columns of y
and retains all rows from both data frames.
inner_join(x, y)
combines all columns present in either x
or y
but only retains rows that are present in both data frames.
anti_join(x, y)
returns the columns from x
only and retains rows of x
that are not present in y
.
anti_join(y, x)
returns the columns from y
only and retains rows of y
that are not present in x
.
Apart from specifying the data frames to be joined, we also need to specify the key column(s) that is to be used for joining the data. Key columns are specified with the by
argument, e.g. inner_join(x, y, by = "subject_id")
adds columns of y
to x
for all rows where the values of the “subject_id
” column (present in each data frame) match. If the name of the key column is different in both the dataframes, e.g. “subject_id
” in x
and “subj_id
” in y
, then you have to specify both names using by = c("subject_id" = "subj_id")
.
Example
Lets try to join the basic information dataset (nhanes_basic_info.csv
) with clinical dataset (nhanes_clinical_info.rds
).
<- read_csv(
basic here("data",
"nhanes_basic_info.csv"))
Rows: 5679 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): gender, race, education, marital_status, home_own, work, bmi_who
dbl (7): unique_id, age, income_hh, poverty, home_rooms, height, weight
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_rds(
clinical here("data",
"nhanes_clinical_info.rds"))
<- basic |>
df left_join(clinical)
Joining with `by = join_by(unique_id)`
Try to join behaviour dataset (nhanes_behaviour_info.rds
).
pivot()
Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot
family of functions to achieve this task. What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:
“Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.
“Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).
Let us now use the pivot
functions to reshape the data in practice. The two pivot
functions are:
pivot_wider()
: from long to wide format.pivot_longer()
: from wide to long format.Lets try pivot_longer
. Suppose we need a long data format for the bp_sys
and bp_sys_post
variables:
<- df |>
df_long pivot_longer(
cols = c(bp_sys, bp_sys_post),
names_to = "bp_sys_cat",
values_to = "bp_value")
Lets try pivot_wider
. Suppose we need a wide data format for height
variable based on race
variable.
<- df |>
df_wider pivot_wider(names_from = "race",
values_from = "height",
names_prefix = "height_")
Check out the Data Wrangling cheatsheet that covers dplyr and tidyr functions.(https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)
Review the Tibbles chapter of the excellent, free R for Data Science book.(https://r4ds.had.co.nz/tibbles.html)
Check out the Transformations chapter to learn more about the dplyr package. Note that this chapter also uses the graphing package ggplot2 which we have covered yesterday.(https://r4ds.had.co.nz/transform.html)
Check out the Relational Data chapter to learn more about the joins.(https://r4ds.had.co.nz/relational-data.html)
skimr
Packageskimr
is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The core function of skimr
is the skim()
function, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible.
Give skim()
a try.
|>
df ::skim() skimr
Check out the names of the output of skimr
|>
df ::skim() |>
skimrnames()
Also works with dplyr
verbs
|>
df group_by(race) |>
::skim() skimr
|>
df ::skim() |>
skimr::select(skim_type, skim_variable, n_missing) dplyr
DataExplorer
PackageThe DataExplorer
package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.2
The single most important function from the DataExplorer
package is create_report()
Try it for yourself.
::p_load(DataExplorer)
pacman
create_report(df)