5  Introduction to R and RStudio

5.1 What is R?

  • Open source (free!) statistical programming language/software

  • It can be used for:

    • Working with data - cleaning, wrangling and transforming
    • Conducting analyses including advanced statistical methods
    • Creating high-quality tables & figures
    • Communicate research with R Markdown
  • It is constantly growing!

  • Has a strong online support community

  • Since it’s one programming language, it is versatile enough to take you from raw data to publishable research using free, reproducible code!

5.2 What is RStudio?

  • RStudio is a free, open source IDE (integrated development environment) for R. (You must install R before you can install RStudio.)

  • Its interface is organized so that the user can clearly view graphs, tables, R code, and output all at the same time.

  • It also offers an Import-Wizard-like feature that allows users to import CSV, Excel, SPSS (*.sav), and Stata (*.dta) files into R without having to write the code to do so.

5.3 R versus Others Softwares

Excel and SPSS are convenient for data entry, and for quickly manipulating rows and columns prior to statistical analysis. However, they are a poor choice for statistical analysis beyond the simplest descriptive statistics, or for more than a very few columns.

Proportion of articles in health decision sciences using the identified software

5.4 Why should you learn R

  • R is becoming the “lingua franca” of data science
  • Most widely used and it is rising in popularity
  • R is also the tool of choice for data scientists at Microsoft, Google, Facebook, Amazon
  • R’s popularity in academia is important because that creates a pool of talent that feeds industry.
  • Learning the “skills of data science” is easiest in R

Increasing use of R in scientific research

Some of the reasons for chosing R over others are are:

  • Missing values are handled inconsistently, and sometimes incorrectly.
  • Data organisation difficult.
  • Analyses can only be done on one column at a time.
  • Output is poorly organised.
  • No record of how an analysis was accomplished.
  • Some advanced analyses are impossible

5.5 Health Data Science

Health Data Science is an emerging discipline, combining mathematics, statistics, epidemiology and informatics.

R is widely used in the field of health data science and especially in healthcare industry domains like genetics, drug discovery, bioinformatics, vaccine reasearch, deep learning, epidemiology, public health, vaccine research, etc.


Applications of Data Science in Healthcare

As data-generating technologies have proliferated throughout society and industry, leading hospitals are trying to ensure this data is harnessed to achieve the best outcomes for patients. These internet of things (IoT) technologies include everything from sensors that monitor patient health and the condition of machines to wearables and patients’ mobile phones. All these comprise the “Big Data” in healthcare.

5.6 Reproducible Research

Research is considered to be reproducible when the exact results can be reproduced if given access to the original data, software, or code.

  • The same results should be obtained under the same conditions
  • It should be possible to recreate the same conditions

Reproducibility refers to the ability of a researcher to duplicate the results of a prior study using the same materials as were used by the original investigator. That is, a second researcher might use the same raw data to build the same analysis files and implement the same statistical analysis in an attempt to yield the same results. Reproducibility is a minimum necessary condition for a finding to be believable and informative. — U.S. National Science Foundation (NSF) subcommittee on Replicability in Science

There are four key elements of reproducible research:

  • data documentation
  • data publication
  • code publication
  • output publication

Baker, M. 1,500 scientists lift the lid on reproducibility. Nature 533, 452–454 (2016)

Flavours of Reproducible Research

Factors behind irreproducible research

  • Not enough documentation on how experiment is conducted and data is generated
  • Data used to generate original results unavailable
  • Software used to generate original results unavailable
  • Difficult to recreate software environment (libraries, versions) used to generate original results
  • Difficult to rerun the computational steps

Threats to Reproducibility (Munafo. et. al, 2017)

While reproducibility is the minimum requirement and can be solved with “good enough” computational practices, replicability/ robustness/ generalisability of scientific findings are an even greater concern involving research misconduct, questionable research practices (p-hacking, HARKing, cherry-picking), sloppy methods, and other conscious and unconscious biases.

What are the good practices of reproducible research?

How to make your work reproducible?

Reproducible workflows give you credibility!

Cartoon created by Sidney Harris (The New Yorker)

Reproducibility spectrum for published research. Source: Peng, RD Reproducible Research in Computational Science Science (2011)

5.7 Getting Comfortable with R and RStudio

5.7.1 Install R

  1. Go here: https://cran.rstudio.com/

  2. Choose the correct “Download R for. . .” option from the top (probably Windows or macOS), then…

  1. For Windows users, choose “Install R for the first time” (next to the base subdirectory) and then “Download R 4.4.2 for Windows”

  2. For macOS users, select the appropriate version for your operating system (e.g. the latest release is version 4.4.2, will look something like R-4.4.2-arm64.pkg), then choose to Save or Open

  3. Once downloaded, save, open once downloaded, agree to license, and install like you would any other software.

If it installs, you should be able to find the R icon in your applications.

5.7.2 Install RStudio

RStudio is a user-friendly interface for working with R. That means you must have R already installed for RStudio to work. Make sure you’ve successfully installed R in Step 1, then. . .

  1. Go to https://www.rstudio.com/products/rstudio/download/ to download RStudio Desktop (Open Source License). You’ll know you’re clicking the right one because it says “FREE” right above the download button.

  2. Click download, which takes you just down the page to where you can select the correct version under Installers for Supported Platforms (almost everyone will choose one of the first two options, RStudio for Windows or macOS).

  3. Click on the correct installer version, save, open once downloaded, agree to license and install like you would any other software. The version should be at least RStudio 2024.09 “Cranberry Hibiscus”, 2024.

If it installs, you should be able to find the RStudio icon in your applications.

5.8 Understanding the RStudio environment

5.8.1 Pane layout

The RStudio environment consist of multiple windows. Each window consist of certain Panels

Panels in RStudio

  1. Source
  2. Console
  3. Environment
  4. History
  5. Files
  6. Plots
  7. Connections
  8. Packages
  9. Help
  10. Build
  11. Tutorial
  12. Viewer

It is important to understand that not all panels will be used by you in routine as well as by us during the workshop. The workshop focuses on using R for healthcare professionals as a database management, visualization, and communication tool. The most common panels which requires attention are the source, console, environment, history, files, packages, help, tutorial, and viewer panels.

5.8.2 A guided tour

You are requested to make your own notes during the workshop. Let us dive deep into understanding the environment further in the workshop.

5.8.3 File types in R

The most common used file types are

  1. .R : Script file
  2. .Rmd : RMarkdown file
  3. .qmd : Quarto file
  4. .rds : Single R database file
  5. .RData : Multiple files in a single R database file

5.8.4 Programming basics.

R is easiest to use when you know how the R language works. This section will teach you the implicit background knowledge that informs every piece of R code. You’ll learn about:

  1. Functions and their arguments
  2. Objects
  3. R’s basic data types
  4. R’s basic data structures including vectors and lists
  5. R’s package system

5.8.5 Functions and their arguments.

To do anything in R, we call functions to work for us. Take for example, we want to compute square root of 5197. Now, we need to call a function sqrt() for the same.

sqrt(5197)
[1] 72.09022

Important things to know about functions include:

  1. Code body.

Typing code body and running it enables us understand what a function does in background.

sqrt
function (x)  .Primitive("sqrt")
  1. Run a function.

To run a function, we need to add a parenthesis () after the code body. Within the parenthesis we add the details such as number in the above example.

  1. Help page.

Placing a question mark before the function takes you to the help page. This is an important aspect we need to understand. When calling help page parenthesis is not placed. This help page will enable you learn about new functions in your journey!

?sqrt 

Tip:

Annotations are meant for humans to read and not by machines. It enables us take notes as we write. As a result, next time when you open your code even after a long time, you will know what you did last summer :)


Arguments are inputs provided to the function. There are functions which take no arguments, some take a single argument and some take multiple arguments. When there are two or more arguments, the arguments are separated by a comma.

# No argument
Sys.Date()
[1] "2024-11-11"
# One argument
sqrt(5197)
[1] 72.09022
# Two arguments
sum(2,3)
[1] 5
# Multiple arguments
seq(from=1,
    to = 10, 
    by  = 2)
[1] 1 3 5 7 9

Matching arguments: Some arguments are understood as such by the software. Take for example, generating a sequence includes three arguments viz: from, to, by. The right inputs are automatically matched to the right argument.

seq(1,10,2)
[1] 1 3 5 7 9

Caution: The wrong inputs are also matched. Best practice is to be explicit at early stages. Use argument names!

seq(2,10,1)
[1]  2  3  4  5  6  7  8  9 10
seq(by = 2,
    to = 10,
    from = 1)
[1] 1 3 5 7 9

Optional arguments: Some arguments are optional. They may be added or removed as per requirement. By default these optional arguments are taken by R as default values. Take for example, in sum() function, na.rm = FALSE is an optional argument. It ensures that the NA values are not removed by default and the sum is not returned when there are NA values. These optional arguments can be override by mentioning them explicitly.

sum(2,3,NA)
[1] NA
sum(2,3,NA, na.rm = T)
[1] 5

In contrast, the arguments which needs to be mentioned explicitly are mandatory! Without them, errors are returned as output.

sqrt()

5.8.6 Objects.

If we want to use the results in addition to viewing them in console, we need to store them as objects. To create an object, type the name of the object (Choose wisely, let it be explicit and self explanatory!), then provide an assignment operator. Everything to the right of the operator will be assigned to the object. You can save a single value or output of a function or multiple values or an entire data set in a single object.

# Single value
x <- 3
x
[1] 3
# Output from function
x <- seq(from=1,
    to = 10, 
    by  = 2)
# Better name:
sequence_from_1_to_10 <- seq(from=1,
    to = 10, 
    by  = 2)

Creating an object helps us in viewing its contents as well make it easier to apply additional functions

Tip. While typing functions/ object names, R prompts are provided. Choose from the prompts rather than typing the entire thing. It will ease out many things later!

sequence_from_1_to_10
[1] 1 3 5 7 9
sum(sequence_from_1_to_10)
[1] 25

5.8.7 Vectors

R stores values as a vector which is one dimensional array. Arrays can be two dimensional (similar to excel data/ tabular data), or multidimensional. Vectors are always one dimensional!

Vectors can be a single value or a combination of values. We can create our own vectors using c() function.

single_number <- 3
single_number
[1] 3
number_vector <- c(1,2,3)
number_vector
[1] 1 2 3

Creating personalized vectors is powerful as a lot of functions in R takes vectors as inputs.

mean(number_vector)
[1] 2

Vectorized functions: The function is applied to each element of the vector:

sqrt(number_vector)
[1] 1.000000 1.414214 1.732051

If we have two vectors of similar lengths (such as columns of a research data), vectorised functions help us compute for new columns by applying the said function on each element of both the vectors and give a vector of the same length (Consider this as a new column in the research data)

number_vector2 <- c(3,-4,5.4)
number_vector + number_vector2
[1]  4.0 -2.0  8.4

5.8.8 Data Types

R recognizes different types of vectors based on the values in the vector.

If all values are numbers (positive numbers, negative numbers, decimals), R will consider that vector as numerical and allows you to carry out mathematical operations/ functions. You can find the class of the vector by using class() function.R labels these vectors as “double”, “numeric”, or “integers”.

class(number_vector)
[1] "numeric"
class(number_vector2)
[1] "numeric"

If the values are within quotation marks, it is character variable by default. It is equivalent to nominal variable.

alphabets_vector <- c("a", "b", "c")
class(alphabets_vector)
[1] "character"
integer_vector <- c(1L,2L)
class(integer_vector)
[1] "integer"

Logical vectors contain TRUE and FALSE values

logical_vector <- c(TRUE, FALSE)
class(logical_vector)
[1] "logical"

Factor vectors are categorical variables. Other variable types can be converted to factor type using functionfactor()

factor_vector <- factor(number_vector)
factor_vector
[1] 1 2 3
Levels: 1 2 3

We can add labels to factor vectors using optional arguments

factor_vector <- factor(number_vector,
                        levels =c(1,2,3),
                        labels = c("level1", 
                                   "level2", 
                                   "level3"))
factor_vector
[1] level1 level2 level3
Levels: level1 level2 level3

One vector = One type. For example: When there is mix of numbers and characters, R will consider all as character.

mix_vector <- c(1,"a")
class(mix_vector)
[1] "character"

Note that the number 1 has been converted into character class.

mix_vector[1]
[1] "1"
mix_vector[1] |> class()
[1] "character"

Double, character, integer, logical, complex, raw, dates, etc… There are many other data types and objects but for now, lets start with these. You will understand additional types as you will proceed in your R journey!

5.8.9 Lists

In addition to vectors, lists are another powerful objects. A list can be considered as a vector of vectors!! They enable you to store multiple types of vectors together. A list can be made using a list() function. It is similar to c() function but creates a list rather than a vector. It is a good practice to name the vectors in the list.

example_list <- list(numbers = number_vector, 
                     alphabets = alphabets_vector)
class(example_list)
[1] "list"
example_list
$numbers
[1] 1 2 3

$alphabets
[1] "a" "b" "c"

The elements of a named list/ a named vector can be called by using a $.

example_list$numbers
[1] 1 2 3

5.8.10 Packages

There are thousands of functions in R. To be computationally efficient, R do not load all functions on start. It loads only base functions. As you want to use additional functions, we need to load the packages using library() function.

The additional packages are installed once but loaded everytime you start R sessions.

With these basics, lets deep dive into the workshop!! Are you ready?

5.9 Fundamentals of Working with Data

In this section, we will be learning how to import, and export data from R. We will also be talking about the different file types. This section is based on the relevant chapters from two of the renowned textbooks on tidyverse.1 These textbooks take different approaches for importing and working with data in RStudio using tidyverse packages. We present to you the most optimal workflows to facilitate reproducibility and ease of understanding.

5.9.1 The Importance of Setting Up a Project

Before diving into data analysis and working with R, it’s crucial to establish a well-organized workflow. Setting up an R project for each analysis in RStudio is one of the best practices for maintaining this structure. Here’s why it matters:

Organized Workspace: An R project creates a dedicated workspace, keeping all files, scripts, and data for each analysis in one place. This structure makes it easier to locate and manage your resources and helps prevent clutter on your computer.

Consistent File Paths: When working within an R project, file paths become relative to the project’s root directory. This avoids the need for absolute paths (e.g., C:/Users/YourName/ProjectFolder), making your code portable. For example, using relative paths allows you to share your project with others without requiring adjustments to file paths.

Enhanced Reproducibility: With an R project, you can easily recreate your analysis environment. The .Rproj file saves specific project settings, allowing you to return to the project later and pick up where you left off with minimal setup. This is particularly valuable when revisiting analyses or sharing work with collaborators.

To create a new project, open RStudio, go to the File menu, select New Project, and follow the prompts. You’ll see that RStudio sets up a unique working directory, which helps maintain consistency and clarity throughout your analysis.

Creating Project using RStudio

Or you could try

By following this practice, you set up a solid foundation for a clean, organized, and reproducible workflow in R.

5.9.2 File Paths using here

Reading and writing files often involves the use of file paths. A file path is a string of characters that point R and RStudio to the location of the file on your computer.

These file paths can be a complete location (C:/Users/Arun/RIntro_Book.Rmd) or just the file name (RIntro_Book.Rmd). If you pass R a partial file path, R will append it to the end of the file path that leads to your working directory. The working directory is the directory where your .Rproj file is.

When working with files in R, defining paths correctly is essential for accessing your data and saving outputs. The here package is a powerful tool that simplifies file paths, especially within R projects, by automatically locating the root directory of your project.

5.9.2.1 Why Use here?

Simplifies Paths: Instead of typing out long, complex file paths, here constructs paths relative to the root of your project. This makes your code cleaner and easier to read.

Improves Portability: Using here makes your code more portable. When sharing your project with others or switching between computers, the paths generated by here adjust automatically based on the project’s root, so there’s no need to modify paths manually.

Avoids Path Errors: Typing out file paths can lead to errors if you move files around or change directories. The here function helps prevent these issues by always starting paths from the same project root.

Using here in Practice The here package creates paths by combining the project root directory with any subdirectories or file names you specify. For example:

#install.packages("pacman")

pacman::p_load(here)

here()
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book"

When you run here::here() in your R project, it returns the full file path up to the directory where your R project was created. This directory is known as the project root.

If you have a file named nhanes_modified_df.rds stored inside a folder called data within your project, you can easily reference it using the here function. By writing:

here("data", "nhanes_modified_df.rds")
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_modified_df.rds"

you’re creating a file path that points directly to the nhanes_modified_df.rds file within the data folder, starting from the root of your project. This method keeps things neat, adaptable, and prevents hardcoding of long file paths. Whether you move your project to another computer or share it with someone else, this path will still work without any changes. It’s a simple way to make your workflow more efficient!

5.9.3 Importing data using the RStudio GUI

The RStudio IDE provides an Import Dataset button in the Environment pane, which appears in the top right corner of the IDE by default. You can use this button to import data that is stored in plain text files as well as in Excel, SAS, SPSS, and Stata files.

We recommend using .csv file type to read and write your data as a best practice. This will ensure cross compatibility between various programs as it is just a raw text file but just separated by a comma.

5.9.4 Importing and Exporting Data (.rds) using a Library

.rds is a file format native to R for saving compressed content. .rds files are not text files and are not human readable in their raw form. Each .rds file contains a single object, which makes it easy to assign its output directly to a single R object. This is not necessarily the case for .RData files, which makes .rds files safer to use.

We can use the read_rds() and write_rds() function from the readr package to read and write an .rds file. write_rds() function save the previously loaded data, as an .rds file using this function. You can look at the help menu to know more on the syntax or you can type ?write_rds in the Console pane.

eg:

df <- readr::read_rds(here("data", "nhanes_modified_df.rds"))

In the above line of code we are instructing R to:

  • Look inside the project folder: here::here("data", "nhanes_modified_df.rds") tells R to look in the data folder within your project for a file named nhanes_modified_df.rds.

  • Read the .rds file: readr::read_rds() is used to load this .rds file into the object df.

However, if there is:

  • A spelling mistake in either the folder name (data) or the file name (nhanes_modified_df.rds), or

  • The file doesn’t exist at the specified location,

R will not be able to find the file, and you’ll encounter an error message, typically saying the file cannot be found.

Similarly, you can use the write_csv() function from the readr package to write a .csv file.

try!!!

Note

There are different packages to import different types of data.

  • haven : SPSS, Stata, or SAS
  • readxl : Excel spreadsheets
  • readr : csv, txt, tsv etc.

5.9.5 Tibble and Data.frames

When working with data in R, you’ll frequently encounter two common types of data structures: tibbles and data.frames. While both are used to store tabular data, they have some important differences that affect how they behave and how you interact with them. Understanding these differences can help streamline your data analysis and avoid potential pitfalls.

To learn more in-depth about tibbles, you can run vignette(“tibble”) in your R console, which provides a comprehensive overview.

Some major differences are:

  • Input type remains unchanged - data.frame changes strings as factors; tibble will not
  • Variable names remain unchanged - data.frame will remove spaces or add “x” before numeric column names. tibble will not.
  • There are no row.names() for a tibble
  • tibble print first ten rows and columns that fit on one screen

5.9.6 Principles of Tidy Data

5.9.6.1 What is Tidy Data?

Tidy data is a way to describe data that’s organized with a particular structure – a rectangular structure, where each variable has its own column, and each observation has its own row. — Hadley Wickham, 2014

5.9.6.2 Three Rules of Tidy Data

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

These three rules are interrelated because it’s impossible to only satisfy two of the three.

5.9.6.3 Messy Data vs Tidy Data

Tidy datasets are all alike, but every messy dataset is messy in its own way. - Hadley Wickham

Source: R for Data Science (http://r4ds.had.co.nz/)

Working with messy data can be messy!. You need to build custom tools from scratch each time you work with a new dataset.

Illustrations from : https://github.com/allisonhorst/stats-illustrations

5.9.6.4 Tidy data for more efficient data science

Packages like tidyr and dplyr can enable you to get on with analysing your data and start answering key questions rather than spending time in trying to clean the data.

Note

Tidy data allows you to be more efficient by using specialised tools built for the tidy workflow. There are a lot of tools specifically built to wrangle untidy data into tidy data.

One other advantage of working with Tidy data is that it makes it easier for collaboration, as your colleagues can use the same familiar tools rather than getting overwhelmed with all the work you did from scratch. It is also helpful for your future self as it becomes a consistent workflow and takes less adjustment time for any incremental changes.

Tidy data also makes it easier to reproduce analyses because they are easier to understand, update, and reuse. By using tools together that all expect tidy data as inputs, you can build and iterate really powerful workflows.

5.9.7 A word on Tibble

When loading data into R using the RStudio GUI using tidyverse, the data is automatically saved as a tibble. A tibble is a data frame, but they have some new functionalities and properties to make our life easier. It is the single most important workhorse of tidyverse.

tibble() vs data.frame()

You can change data.frame objects to a tibble using the as_tibble() function.

5.9.8 Working with Tibbles

Now that you have imported data into RStudio its a good practice to have a look at the data. There are many ways you can do it within RStudio.

  1. Through the Environment pane
  2. View() function
  3. Simply typing the name of the dataset in the Console

Some other things you can do to have a look at your data are:

  1. Checking the class of the dataset using class() function

  2. Checking the structure of the dataset using str() function

Note

class() and str() are not just limited to datasets, they can be used for any R objects.

Some additional tips for quickly looking at your data:

  • head()
  • tail()
  • glimpse()

5.10 Exploring Data with R

To recap what we learnt in the previous sessions.. we now know to work within the R Project environment. here::here() makes it easy for us to manage file paths. You can quickly have a look at your data using the View() and glimpse() functions. Most of the tidy data is read as tibble which is a workhorse of tidyverse.

It is here::here() is better than setwd()

here::here() allows us to filepaths very easily

5.11 Getting Started with the Data Exploration Pipeline

5.11.1 Set-up

#install.packages("pacman")


pacman::p_load(tidyverse, here)

#tidyverse required for tidy workflows
#rio required for importing and exporting data
#here required for managing file paths

Note

The shortcut for code commenting is Ctrl+Shift+C.

5.11.2 Load Data

The dataset we will be working with has been cleaned (to an extent) for the purposes of this workshop. It is a dataset about NHANES that has been took from the NHANES and cleaned up and modified for our use.

# Check the file path
here::here("data", "nhanes_basic_info.csv")
[1] "D:/RWorkshops/research_methodology_data_analysis/rmda_book/data/nhanes_basic_info.csv"
# Read Data
df <- read_csv(here("data", "nhanes_basic_info.csv"))

Try the following functions using tb as the argument:

  • glimpse()
  • head()
  • names()

Now, we will be introducing you to two new packages:

  1. dplyr
  2. skimr
  3. DataExplorer

5.12 dplyr Package

The dplyr is a powerful R-package to manipulate, clean and summarize unstructured data. In short, it makes data exploration and data manipulation easy and fast in R.

There are many verbs in dplyr that are useful, some of them are given here…

Important functions of the dplyr package to remember

Syntax structure of the dplyr verb

5.12.1 Getting used to the pipe |> or %>%

The pipe operator in dplyr

Note

The pipe |> means THEN…

The pipe is an operator in R that allows you to chain together functions in dplyr.

Let’s find the bottom 50 rows of tb without and with the pipe.

Tips The native pipe |> is preferred.

#without the pipe
tail(df, n = 50)

#with the pipe
df |> tail(n = 50)

Now let’s see what the code looks like if we need 2 functions. Find the unique age in the bottom 50 rows of df

#without the pipe
unique(tail(df, n = 50)$age)

# with the pipe
df |> 
  tail(50) |>
  distinct(age)

Note

The shortcut for the pipe is Ctrl+Shift+M

You will notice that we used different functions to complete our task. The code without the pipe uses functions from base R while the code with the pipe uses a mixture (tail() from base R and distinct() from dplyr). Not all functions work with the pipe, but we will usually opt for those that do when we have a choice.

5.12.2 distinct() and count()

The distinct() function will return the distinct values of a column, while count() provides both the distinct values of a column and then number of times each value shows up. The following example investigates the different race (race) in the df dataset:

df |> 
  distinct(race) 

df |> 
  count(race)

Notice that there is a new column produced by the count function called n.

5.12.3 arrange()

The arrange() function does what it sounds like. It takes a data frame or tbl and arranges (or sorts) by column(s) of interest. The first argument is the data, and subsequent arguments are columns to sort on. Use the desc() function to arrange by descending.

The following code would get the number of times each race is in the dataset:

df |> 
  count(race) |> 
  arrange(n)

# Since the default is ascending order, 
# we are not getting the results that are probably useful, 
# so let's use the desc() function
df |> 
  count(race) |> 
  arrange(desc(n))

# shortcut for desc() is -
df |> 
  count(race) |> 
  arrange(-n)

5.12.4 filter()

If you want to return rows of the data where some criteria are met, use the filter() function. This is how we subset in the tidyverse. (Base R function is subset())

Here are the logical criteria in R:

  • ==: Equal to
  • !=: Not equal to
  • >: Greater than
  • >=: Greater than or equal to
  • <: Less than
  • <=: Less than or equal to

If you want to satisfy all of multiple conditions, you can use the “and” operator, &.

The “or” operator | (the vertical pipe character, shift-backslash) will return a subset that meet any of the conditions.

Let’s see all the data for age 60 or above

df |> 
  filter(age >= 60)

Let’s just see data for white

df |> 
  filter(race == "White")

Both White and age 60 or more

df_60_plus_white <- df |> 
  filter(age >= 60 & race == "White")

5.12.5 %in%

To filter() a categorical variable for only certain levels, we can use the %in% operator.

Lets check which are the race groups that are in the dataset.

df |> 
  select(race) |> 
  unique()
# A tibble: 5 × 1
  race    
  <chr>   
1 White   
2 Mexican 
3 Hispanic
4 Other   
5 Black   

Now we’ll create a vector of races we are interested in

others <- c("Mexican", 
              "Hispanic", 
              "Other")

And use that vector to filter() df for races %in% minority

df |> 
  filter(race %in% others)

You can also save the results of a pipeline. Notice that the rows belonging to minority races are returned in the console. If we wanted to do something with those rows, it might be helpful to save them as their own dataset. To create a new object, we use the <- operator.

others_df <- df |> 
  filter(race %in% others)

5.12.6 drop_na()

The drop_na() function is extremely useful for when we need to subset a variable to remove missing values.

Return the NHANES dataset without rows that were missing on the education variable

df |> 
  drop_na(education)

Return the dataset without any rows that had an NA in any column. *Use with caution because this will remove a lot of data

df |> 
  drop_na()

5.12.7 select()

Whereas the filter() function allows you to return only certain rows matching a condition, the select() function returns only certain columns. The first argument is the data, and subsequent arguments are the columns you want.

See just the country, year, incidence_100k columns

# list the column names you want to see separated by a comma

df |>
  select(id, age, education)

Use the - sign to drop these same columns

df |>
  select(-age_months, -poverty, -home_rooms)

5.12.8 select() helper functions

The starts_with(), ends_with() and contains() functions provide very useful tools for dropping/keeping several variables at once without having to list each and every column you want to keep. The function will return columns that either start with a specific string of text, ends with a certain string of text, or contain a certain string of text.

# these functions are all case sensitive
df |>
  select(starts_with("home"))

df |>
  select(ends_with("t"))

df |>
  select(contains("_"))

# columns that do not contain -
df |>
  select(-contains("_"))

5.12.9 summarize()

The summarize() function summarizes multiple values to a single value. On its own the summarize() function doesn’t seem to be all that useful. The dplyr package provides a few convenience functions called n() and n_distinct() that tell you the number of observations or the number of distinct values of a particular variable.

Note summarize() is the same as summarise()

Notice that summarize takes a data frame and returns a data frame. In this case it’s a 1x1 data frame with a single row and a single column.

df |>
  summarize(mean(age))

# watch out for nas. Use na.rm = TRUE to run the calculation after excluding nas.

df |>
  summarize(mean(weight, na.rm = TRUE))

The name of the column is the expression used to summarize the data. This usually isn’t pretty, and if we wanted to work with this resulting data frame later on, we’d want to name that returned value something better.

df |>
  summarize(mean_age = mean(age, na.rm = TRUE))

5.12.10 group_by()

We saw that summarize() isn’t that useful on its own. Neither is group_by(). All this does is takes an existing data frame and converts it into a grouped data frame where operations are performed by group.

df |>
  group_by(gender) 

df |>
  group_by(gender, race)

5.12.11 group_by() and summarize() together

The real power comes in where group_by() and summarize() are used together. First, write the group_by() statement. Then pipe the result to a call to summarize().

Let’s summarize the mean incidence of tb for each year

df |>
  group_by(race) |>
  summarize(mean_height = mean(height, na.rm = TRUE))

#sort the output by descending mean_inc
df |>
  group_by(race) |>
  summarize(mean_height = mean(height, na.rm = TRUE))|>
  arrange(desc(mean_height))

5.12.12 mutate()

Mutate creates a new variable or modifies an existing one.

Lets create a column called elderly if the age is greater than or equal to 65.

df |>
  mutate(elderly = if_else(
    age >= 65,
    "Yes", 
    "No"))

The same thing can be done using case_when().

df |>
  mutate(elderly = case_when(
    age >= 65 ~ "Yes",
    age < 65 ~ "No",
    TRUE ~ NA))

Lets do it again, but this time let us make it 1 and 0, 1 if age is greater than or equal to 65, 0 if otherwise.

df |>
  mutate(old = case_when(
    age >= 65 ~ 1,
    age < 65 ~ 0,
    TRUE ~ NA))

Note

The if_else() function may result in slightly shorter code if you only need to code for 2 options. For more options, nested if_else() statements become hard to read and could result in mismatched parentheses so case_when() will be a more elegant solution.

As a second example of case_when(), let’s say we wanted to create a new income variable that is low, medium, or high.

See the income_hh broken into 3 equally sized portions

quantile(df$income_hh, prob = c(.33, .66), na.rm = T)

Note

See the help file for quanile function or type ?quantile in the console.

We’ll say:

  • low = 30000 or less
  • medium = between 30000 and 70000
  • high = above 70000
df |>
  mutate(income_cat = case_when(
    income_hh <= 30000 ~ "low",
    income_hh > 30000 & income_hh <= 70000 ~ "medium",
    income_hh > 70000 ~ "high",
    TRUE ~ NA)) 

5.12.13 join()

Typically in a data science or data analysis project one would have to work with many sources of data. The researcher must be able to combine multiple datasets to answer the questions he or she is interested in. Collectively, these multiple tables of data are called relational data because more than the individual datasets, its the relations that are more important.

As with the other dplyr verbs, there are different families of verbs that are designed to work with relational data and one of the most commonly used family of verbs are the mutating joins.

Different type of joins, represented by a series of Venn Diagram

These include:

  • left_join(x, y) which combines all columns in data frame x with those in data frame y but only retains rows from x.

  • right_join(x, y) also keeps all columns but operates in the opposite direction, returning only rows from y.

  • full_join(x, y) combines all columns of x with all columns of y and retains all rows from both data frames.

  • inner_join(x, y) combines all columns present in either x or y but only retains rows that are present in both data frames.

  • anti_join(x, y) returns the columns from x only and retains rows of x that are not present in y.

  • anti_join(y, x) returns the columns from y only and retains rows of y that are not present in x.

Visual representation of the join() family of verbs

Apart from specifying the data frames to be joined, we also need to specify the key column(s) that is to be used for joining the data. Key columns are specified with the by argument, e.g. inner_join(x, y, by = "subject_id") adds columns of y to x for all rows where the values of the “subject_id” column (present in each data frame) match. If the name of the key column is different in both the dataframes, e.g. “subject_id” in x and “subj_id” in y, then you have to specify both names using by = c("subject_id" = "subj_id").

Example

Lets try to join the basic information dataset (nhanes_basic_info.csv) with clinical dataset (nhanes_clinical_info.rds).

basic <- read_csv(
  here("data", 
       "nhanes_basic_info.csv"))
Rows: 5679 Columns: 14
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): gender, race, education, marital_status, home_own, work, bmi_who
dbl (7): unique_id, age, income_hh, poverty, home_rooms, height, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
clinical <- read_rds(
  here("data", 
       "nhanes_clinical_info.rds"))

df <- basic |> 
  left_join(clinical)
Joining with `by = join_by(unique_id)`

Try to join behaviour dataset (nhanes_behaviour_info.rds).

5.12.14 pivot()

Most often, when working with our data we may have to reshape our data from long format to wide format and back. We can use the pivot family of functions to achieve this task. What we mean by “the shape of our data” is how the values are distributed across rows or columns. Here’s a visual representation of the same data in two different shapes:

Long and Wide format of our data
  • “Long” format is where we have a column for each of the types of things we measured or recorded in our data. In other words, each variable has its own column.

  • “Wide” format occurs when we have data relating to the same measured thing in different columns. In this case, we have values related to our “metric” spread across multiple columns (a column each for a year).

Let us now use the pivot functions to reshape the data in practice. The two pivot functions are:

  • pivot_wider(): from long to wide format.
  • pivot_longer(): from wide to long format.

Lets try pivot_longer. Suppose we need a long data format for the bp_sys and bp_sys_post variables:

df_long <- df |> 
  pivot_longer(
    cols = c(bp_sys, bp_sys_post),
    names_to = "bp_sys_cat",
    values_to = "bp_value")

Lets try pivot_wider. Suppose we need a wide data format for height variable based on race variable.

df_wider <- df |> 
  pivot_wider(names_from = "race",
              values_from = "height",
              names_prefix = "height_")

Resources for learning more dplyr

  • Check out the Data Wrangling cheatsheet that covers dplyr and tidyr functions.(https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

  • Review the Tibbles chapter of the excellent, free R for Data Science book.(https://r4ds.had.co.nz/tibbles.html)

  • Check out the Transformations chapter to learn more about the dplyr package. Note that this chapter also uses the graphing package ggplot2 which we have covered yesterday.(https://r4ds.had.co.nz/transform.html)

  • Check out the Relational Data chapter to learn more about the joins.(https://r4ds.had.co.nz/relational-data.html)

5.13 skimr Package

skimr is designed to provide summary statistics about variables in data frames, tibbles, data tables and vectors. The core function of skimr is the skim() function, which is designed to work with (grouped) data frames, and will try coerce other objects to data frames if possible.

Give skim() a try.

df |> 
  skimr::skim()

Check out the names of the output of skimr

df |> 
  skimr::skim() |> 
  names()

Also works with dplyr verbs

df |> 
  group_by(race) |> 
  skimr::skim()
df |> 
  skimr::skim() |>
  dplyr::select(skim_type, skim_variable, n_missing)

5.14 DataExplorer Package

The DataExplorer package aims to automate most of data handling and visualization, so that users could focus on studying the data and extracting insights.2

The single most important function from the DataExplorer package is create_report()

Try it for yourself.

pacman::p_load(DataExplorer)

create_report(df)

  1. Tidyverse Skills for Data Science and The Tidyverse Cookbook↩︎

  2. DataExplorer Package↩︎