Description

This workshop will provide an introduction into R!

R is a popular programming language that many researchers use for organizing data, visualizing data, and carrying out statistical analyses.

By the end of this workshop series, my hope is that you will feel comfortable enough to work independently in R!

[What people think coding is versus what it actually is]

Outline

Pre-Workshop: Downloading R and RStudio Software | R: https://ftp.osuosl.org/pub/cran/ RStudio: https://posit.co/download/rstudio-desktop/
Week 1: Intro to R | Learning how to navigate the software: R, RStudio, and creating scripts in R Markdown. We will also work on installing and loading packages.
Week 2: Working Directories | Learn how to navigate working directories and read in data into R
Week 3: Subsetting | Understand how to access rows and columns and filter observations
Week 4: If Else statements | Using the ifelse() function to create new columns
Week 5: Intro to For Loops | Learning the structure and application of For loops in R
Week 6: Pivoting data from wide to long and long to wide | Understanding the differences between data in wide-format and long-format
Week 7: Merging data frames | Merging two data frames together
Week 8: Data cleaning | Learning how to apply previously learned functions toward cleaning a raw dataset
Week 9: Analyzing Data w/ Categorical Independent Variables | Conducting statistical analyses with categorical predictors
Week 10: Analyzing Data w/ Continuous Independent Variables | Conducting statistical analyses with continuous and categorical predictors
Week 11: Visualizing data: Intro to ggplot | Learn how to create ggplot visualizations and customize plots
Final Project? | TBD
Conclusion | Closing and general notes

Are you ready to start learning R?!

Pre-Workshop: Downloading R and RStudio Software

Before the workshop, we’ll need to download R and RStudio. Throughout the workshop, we’ll be working in RStudio, which will allow us to write code in R. So let’s make sure we have both R and RStudio installed before we begin!

  1. Download a R CRAN Mirror, which basically just hosts the R programming language that we will be using in RStudio. https://cran.r-project.org/

  2. Download RStudio, which is the main software that we will be using to work with R. https://posit.co/download/rstudio-desktop/

  3. Download the CABLAB-R-Workshop-Series folder from the CABLAB R Workshop Series Github page (https://github.com/steventmartinez/CABLAB-R-Workshop-Series) by pressing the green Code button and downloading the ZIP folder. This is the folder containing the all the files we will be working with for the purposes of this workshop.

  4. Open up a new R Markdown document by clicking File > New File > R Markdown. First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!

Week 1: Intro to R

Opening a new R Markdown File

To get things started, open R Studio. Then, let’s try opening a new R Markdown document, by clicking File > New File > R Markdown…

First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!

This should produce a dialogue box where you can enter the name of the script and your name before selecting OK.

Next, let’s clear out all of the default text that appears in a new R Markdown document, which I have highlighted below:

Intro to R Markdown

In a typical coding script, every line must contain code that the language could interpret. If you want to include notes, you have to include a hash mark (#) before any code in order for the program to “ignore this line”. So, in order to leave ourselves any notes, we had to use hash marks, which can get a bit annoying. However, an R Markdown script does the same things as a typical coding script, but it’s more user friendly.

With R Markdown, any code that you would like R to interpret belongs in the coding chunk as illustrated below!

If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does.

That’s because a typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). An R markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.

A typical script can’t make sense of this, though. We need to use R markdown scripts to do it. You might be thinking, though, that manually denoting code from non-code seems like extra work, and it is a little bit, but it can also be a lot more convenient because the output of any given chunk will appear in the R Studio Console Window. By output, we just mean the product, sum, or status of whatever calculation or item you are asking R to compute and show you.

R Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above, you can then follow along with the next bit:

2 + 2
## [1] 4

With a typical script, if we want to know the output of a line we ran awhile ago, we either have to rerun it or scroll through the console to find it. With Markdown we can minimize entire chunks and their output by using the minimization button [Minimization Arrow] on the left side of the window.

If we want to hide output, we can use the expand/collapse button [Minimize Command] on the right side of the output window.

We can choose exactly what we want to run using the the “Run” command [Run Command] in the upper right corner of the chunk.

Also of note, the down-facing arrow (second icon in the upper right corner of the code block) will tell R “Run all of the blocks of command that I have before this block” [Run All Chunks Command]. It can be helpful if you make a mistake and don’t want to manually rerun all of the previous blocks one by one to get back to where you were. It also makes your code very easy for other people to run. They can quite literally do it with the click of a button!

If we click the cog icon in the same tray, we can access the output options and manipulate where output appears and what it looks like, but that’s beyond the scope of this review [Settings Command].

What’s a “Package”?

Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a function so now we don’t have to! We just use their tools to do it.

Installing packages

Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that pretty simply:

install.packages("PACKAGENAME")

If you need to update a package, you can just re-run the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.

Packages tab of viewer window where one can visualize previously installed packages
Packages tab of viewer window where one can visualize previously installed packages

Loading packages

Now we’ve installed a package, that doesn’t mean we can use it yet. We need to tell R “We want access to the functions this package has during this session” by calling it with the library() command.

library(PACKAGENAME)

Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it lets R know we are planning on using that this session.

You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.

If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.

Let’s try installing and loading in a few package for practice. Let’s install and load the following packages in R: naniar, report, tidyverse, dplyr, Matrix, lme4, lmerTest, and ggplot2

Week 1 Exercise: Installing and Loading Packages

'



'

Click for solution

Week 1 Assignment: Install and Load “swirl” library and complete “Module 1: Basic Building Blocks”

Swirl is a really cool package in R that teaches you R programming and data science interactively, at your own pace, and right in the R console! For our first assignment, I think swirl explains some fundamental concepts in a better way than I can, so let’s tackle the “R Programming: The basics of programming in R” course and complete Module 1: Basic Building Blocks in swirl.

Some of it will make sense, and some of it won’t (and that’s okay!), but I think swirl does a pretty good job of orienting people to how basic operations in R work, and I think this is especially helpful before we start working with any actual data.

Let’s give this a try and we can talk through any problems people ran into during our next workshop. I’ve attached some screenshots below demonstrating how to install and load swirl().

Week 2: Working Directories

Working Directories in R: What is a Working Directory?

Hopefully swirl() has helped you feel a bit more comfortable in navigating R. Today we will focus on working with directories in R.

A working directory is a fancy term that refers to the default location where R will look for files you want to load and where it will put any files you save. Like any other language or program, R needs to be told where the data that we’d like to work with is located on our computer. It doesn’t just know automatically.

Below we’ll use the getwd() command to check out where where your current working directory is.

Using the list.files() command will show you what files exist in your current working directory.

getwd() #get your current working directory
## [1] "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main"
list.files() #Use list.files() to check the contents of your working directory
## [1] "CABLAB_R_online.Rmd" "datasets"            "exercise_solutions" 
## [4] "images"              "index.html"          "misc"               
## [7] "R memes"             "README.md"

Working Directories in R: Specifying your Working Directory

In order to work with the data that we want to work with, we’ll have to tell R where the files are located, so we can create a new variable containing a filepath to make this process simple so we aren’t writing it out multiple times. Filepaths will differ based on whether you are using a Windows versus a Mac. If you’re using a Windows computer, it’s likely your file path will exist within your “C:/ Drive”. If you’re on a Mac, it’s likely your file path will start with a forward slash “/”. If you’re not sure of your path, R makes it relatively easy to find it.

You can press tab when your cursor is to the left of the slash to see a list of directories contained within your computer.

# For Windows
Path <- "C:/"

# For Mac
Path <- "/"

Here’s an example of what you should see:

An example of R’s Tab-Controlled drop-down menus
An example of R’s Tab-Controlled drop-down menus

Pressing tab again will enter into a directory, thus showing me the contents of that directory. From there, I can keep hitting tab until I get to the directory, or folder, that contains the files I want to work with. I can then save this filepath, which is just what we call a string (i.e., text that does not contain a quantitative value), as an object named Path. We do so by placing the object on the left of an equal sign (=) or an arrow (<-) and the value that object is taking on the right side of it.

Below, let’s assign the filepath where our CABLAB R Workshop Series folder exists to an object called “Path”.

# For Windows
Path <- "C:/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

This format of assigning a value to an object is really important and we’ll keep coming back to it throughout this tutorial!

Intro to “Fright Night” dataset

For the purposes of this project, we are going to work with the Fright Night dataset! The Fright Night project took place in 2021 at the Eastern State Penitentiary’s annual “Halloween Nights” haunted house event in Philadelphia. 116 participants completed a haunted house tour as part of a research study assessing the relationship between threat and memory.

Specifically, we explored 2 main research questions: 1) How does naturalistic threat affect memory accuracy? 2) Does naturalistic threat affect the way in which we communicate our memories?

Participants toured four haunted house segments (Delirium, Take 13, Machine Shop, and Crypt) that included low-threat and high-threat segments. Delirium and Take 13 were low-threat segments, whereas Machine Shop and Crypt were high-threat segments.

To assess memory accuracy, we focused on temporal memory accuracy specifically. Temporal memory refers to memory for the order in which events occur. To measure temporal memory within our study, we focused on accuracy on the recency discrimination task that participants completed for each haunted house segment. As part of the recency discrimination task, participants were shown pairs of trial-unique events within each haunted house segment and asked to select which event came first. In this way, we can determine the accuracy of people’s temporal memory for the order of the events they experienced.

To assess communication styles during memory recall, we focused on the free recall memory task where we asked participants to freely recall their memory for each haunted house segments. We fed the free recall memory transcripts into a natural language processing instrument called the Linguistic Inquiry and Word Count (LIWC) software. LIWC calculates the percentage of words in a given text that belong to linguistic categories that have been shown to index psychosocial constructs. In the example attached below, you can see the percentage of words that contribute to a linguistic category called “Authenticity” which is thought to reflect perceived honesty and genuineness, and the percentage of words that belong to a linguistic category called “Analytical Thinking”, which is thought to reflect formal or logical thinking.

There were also 3 experimental conditions: Control, Share, and Test.

Control condition: Participants were instructed to tour the haunted house segment as they normally would.

Share condition: Participants were instructed to tour the haunted house segment in anticipation of an opportunity to post about their experience on social media afterwards.

Test condition: Participants were instructed to tour the haunted house segment in anticipation of being tested on their knowledge of the haunted house segment afterwards.

For the first two segments (Delirium and Take 13), all participants toured the segment in the Control condition. However, in the last two segments (Crypt and Machine Shop), some participants toured the segments in the Control condition, other participants toured Machine Shop in the Share condition and Crypt in the Test condition, while other participants toured Machine Shop in the Test condition and Crypt in the Share condition.

After completing the haunted house tour, participants were assessed at two time points: immediately afterwards and again 1-week later. During the Immediate assessments, participants completed a recency discrimination task and freely recalled their memory for 1 low-threat and 1-high threat haunted house segment. During the one week-delay assessments, participants completed a recency discrimination task and freely recalled their memory for all haunted house segments. Check out the study design below as well as the vignette illustrating when the three experimental conditions (i.e., Control, Share, and Test) took place throughout the haunted house tour.

Now that we have a better idea about the study design, we can finally start working with some data!

The dataset that we start off working with for the purposes of the workshop is titled frightnight_practice.csv.

What is a “data frame”?

Before we load in the data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. A dataframe represents the same thing that a spreadsheet represents in Excel. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names).

How do I load data into R?

There are many ways to load data into R and they all depend upon what format the data is in. R can handle data from .csv, .xlsx, .txt, .html, .json, SPSS, Stata, SAS, among others. R also has it’s own data format (.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest means of reading in data. We won’t cover the other formats, but they are fairly exhaustively covered . https://www.datacamp.com/tutorial/r-data-import-tutorial

Before reading in our fright night practice data CSV file, we need to use the setwd() function to tell R where to look for our CSV file. Let’s use the Path object that we created earlier to set our working directory to where the frightnight_practice.csv file is located on our computer.

In the most basic sense, we can load our fright night practice data CSV data file using the read.csv() function like this:

setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file

The setwd() command accepts our Path variable and tells R where to look for our .csv file. The read.csv() command actually loads in the data. If done correctly, we should see our R Environment populate with a dataframe labeled df.

A visualization of the Environment Window. Since we’re all using the same dataset, the number of observations and variables should be the same as in the picture above. Here, you can think of observations as “rows” and variables as “columns”. If you click on df in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it. However, we can also look at the data in our markdown file though by entering the head() command from base R, which will show us the first few lines:

head(df) #will show you a subset of rows within the Data Frame
View(df) #will open up the full data frame like you would in Excel

Amazing! Now we have hundreds of columns of data, like we should. We might also notice that the first row of column is PID, which refers to each participant’s ID. You’ll see that each participant has 6 rows. Remember that there were two stages of assessment: 1) Immediately after the haunted house tour; and 2) A delay 1-week later. Participants were tested on 2 of the 4 haunted house segments during the Immediate Stage, and they were tested on all 4 haunted house segments during the One-Week Delay stage. As a result, every participant should have 6 rows.

PID column – The participant IDs.

Section column – The name of each haunted house segment.

Stage column – Whether the assessment was immediately afterwards or 1 week later.

Condition column – Participants completed the haunted house segment in the Baseline, Share, or Test condition.

Fear.rating column – Participants were also asked to rate how fearful they found each haunted house segment immediately afterwards.

TOAccuracy column – Their accuracy score on the recency discrimination task for each haunted house segment

Recall column – their free recall for each haunted house segment

Week 2 Exercise: Working Directories

1) Read in the frightnight_wide_exercise.csv CSV file and store it in an object called “df_wide”

2) Print out the first few rows using the head() function

3) Open up the df_wide dataframe by using the View() function OR by clicking on the df_wide dataframe in the global environment

'



'

Click for solution

Week 2 Assignment: Working Directories

There will be no week 2 assignment :)

Week 3: Subsetting data

For the Week 3 workshop, let’s read in the frightnight_practice CSV file

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file

By looking at the dataframe, we can see that we aren’t working with a perfectly clean dataset: some of the rows have missing data! And we don’t really need all of the columns in the dataframe to do the analyses that we’re interested in doing.

So how do we access rows? How do we access columns? And how can we check what data is missing? Learning how to access specific elements of a data frame is an extremely important part of learning R!

dataframe$column will print out all the rows in that column. Let’s print out all the participant IDs that exist in the data frame.

df$PID 
##   [1] 1001 1001 1001 1001 1001 1001 1002 1002 1002 1002 1002 1002 1003 1003 1003
##  [16] 1003 1003 1003 1004 1004 1004 1004 1004 1004 1005 1005 1005 1005 1005 1005
##  [31] 1006 1006 1006 1006 1006 1006 1007 1007 1007 1007 1007 1007 1008 1008 1008
##  [46] 1008 1008 1008 1009 1009 1009 1009 1009 1009 1010 1010 1010 1010 1010 1010
##  [61] 1011 1011 1011 1011 1011 1011 1012 1012 1012 1012 1012 1012 1013 1013 1013
##  [76] 1013 1013 1013 1014 1014 1014 1014 1014 1014 1015 1015 1015 1015 1015 1015
##  [91] 1016 1016 1016 1016 1016 1016 1017 1017 1017 1017 1017 1017 1018 1018 1018
## [106] 1018 1018 1018 1019 1019 1019 1019 1019 1019 1020 1020 1020 1020 1020 1020
## [121] 1021 1021 1021 1021 1021 1021 1022 1022 1022 1022 1022 1022 1023 1023 1023
## [136] 1023 1023 1023 1024 1024 1024 1024 1024 1024 1025 1025 1025 1025 1025 1025
## [151] 1026 1026 1026 1026 1026 1026 1027 1027 1027 1027 1027 1027 1028 1028 1028
## [166] 1028 1028 1028 1029 1029 1029 1029 1029 1029 1030 1030 1030 1030 1030 1030
## [181] 1032 1032 1032 1032 1032 1032 1033 1033 1033 1033 1033 1033 1034 1034 1034
## [196] 1034 1034 1034 1035 1035 1035 1035 1035 1035 1036 1036 1036 1036 1036 1036
## [211] 1037 1037 1037 1037 1037 1037 1038 1038 1038 1038 1038 1038 1039 1039 1039
## [226] 1039 1039 1039 1040 1040 1040 1040 1040 1040 1041 1041 1041 1041 1041 1041
## [241] 1042 1042 1042 1042 1042 1042 1043 1043 1043 1043 1043 1043 1044 1044 1044
## [256] 1044 1044 1044 1045 1045 1045 1045 1045 1045 1046 1046 1046 1046 1046 1046
## [271] 1047 1047 1047 1047 1047 1047 1048 1048 1048 1048 1048 1048 1049 1049 1049
## [286] 1049 1049 1049 1050 1050 1050 1050 1050 1050 1051 1051 1051 1051 1051 1051
## [301] 1052 1052 1052 1052 1052 1052 1054 1054 1054 1054 1054 1054 1055 1055 1055
## [316] 1055 1055 1055 1056 1056 1056 1056 1056 1056 1057 1057 1057 1057 1057 1057
## [331] 1058 1058 1058 1058 1058 1058 1059 1059 1059 1059 1059 1059 1060 1060 1060
## [346] 1060 1060 1060 1061 1061 1061 1061 1061 1061 1062 1062 1062 1062 1062 1062
## [361] 1063 1063 1063 1063 1063 1063 1064 1064 1064 1064 1064 1064 1065 1065 1065
## [376] 1065 1065 1065 1066 1066 1066 1066 1066 1066 1067 1067 1067 1067 1067 1067
## [391] 1068 1068 1068 1068 1068 1070 1070 1070 1070 1070 1070 1071 1071 1071 1071
## [406] 1071 1071 1072 1072 1072 1072 1072 1072 1073 1073 1073 1073 1073 1073 1074
## [421] 1074 1074 1074 1074 1074 1075 1075 1075 1075 1075 1075 1076 1076 1076 1076
## [436] 1076 1076 1078 1078 1078 1078 1078 1079 1079 1079 1079 1079 1080 1080 1080
## [451] 1080 1080 1080 1081 1081 1081 1081 1081 1081 1082 1082 1082 1082 1082 1082
## [466] 1083 1083 1083 1083 1083 1083 1084 1084 1084 1084 1084 1084 1086 1086 1086
## [481] 1086 1086 1086 1087 1087 1087 1087 1087 1087 1088 1088 1088 1088 1088 1088
## [496] 1089 1089 1089 1089 1089 1089 1090 1090 1090 1090 1090 1090 1091 1091 1091
## [511] 1091 1091 1091 1092 1092 1092 1092 1092 1092 1093 1093 1093 1093 1093 1093
## [526] 1094 1094 1094 1094 1094 1094 1095 1095 1095 1095 1095 1095 1096 1096 1096
## [541] 1096 1096 1096 1097 1097 1097 1097 1097 1097 1098 1098 1098 1098 1098 1098
## [556] 1099 1099 1099 1099 1099 1099 1100 1100 1100 1100 1100 1100 1101 1101 1101
## [571] 1101 1101 1101 1102 1102 1102 1102 1102 1102 1103 1103 1103 1103 1103 1103
## [586] 1104 1104 1104 1104 1104 1104 1105 1105 1105 1105 1105 1105 1106 1106 1106
## [601] 1106 1106 1106 1107 1107 1107 1107 1107 1107 1108 1108 1108 1108 1108 1108
## [616] 1109 1109 1109 1109 1109 1109 1110 1110 1110 1110 1110 1110 1111 1111 1111
## [631] 1111 1111 1111 1112 1112 1112 1112 1112 1112 1113 1113 1113 1113 1113 1113
## [646] 1114 1114 1114 1114 1114 1114 1115 1115 1115 1115 1115 1115 1116 1116 1116
## [661] 1116 1116 1116 1117 1117 1117 1117 1117 1117 1118 1118 1118 1118 1118 1118
## [676] 1119 1119 1119 1119 1119 1119 1120 1120 1120 1120 1120 1120 1122 1122 1122
## [691] 1122 1122 1122 1123 1123 1123 1123 1123 1123 1124 1124 1124 1124 1124 1124

What if we want to see a specific row? Let’s say row 2 within the PID column? To reference a specific row in a given column, I can add brackets and the number of that row behind it:

The code below will print out the second row in the PID column.

df$PID[2]
## [1] 1001

However, we can also index the column using it’s relative position. Knowing that the PID column is the first column, I can use bracket notation. Bracket notation is super helpful once you understand its structure. It helps me to think of it as [rows, columns]. Any number that appears before the comma will access rows, and any number that appears after the comma will access columns.

By including the name of the data frame before the bracket notation, we can pull certain rows and columns from that data frame

df[1,] # print the first row across all columns
df[,2] # print column 2
##   [1] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##   [5] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
##   [9] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
##  [13] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##  [17] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
##  [21] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
##  [25] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##  [29] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
##  [33] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
##  [37] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##  [41] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
##  [45] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
##  [49] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##  [53] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
##  [57] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
##  [61] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
##  [65] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
##  [69] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
##  [73] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
##  [77] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
##  [81] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
##  [85] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
##  [89] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
##  [93] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
##  [97] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [101] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [105] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [109] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [113] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [117] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [121] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [125] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [129] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [133] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [137] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [141] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [145] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [149] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [153] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [157] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [161] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [165] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [169] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [173] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [177] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [181] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [185] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [189] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [193] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [197] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [201] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [205] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [209] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [213] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [217] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [221] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [225] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [229] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [233] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [237] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [241] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [245] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [249] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [253] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [257] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [261] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [265] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [269] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [273] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [277] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [281] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [285] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [289] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [293] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [297] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [301] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [305] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [309] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [313] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [317] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [321] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [325] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [329] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [333] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [337] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [341] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [345] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [349] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [353] "DevilsDen"      "GhostlyGrounds" "Infirmary"      "Asylum"        
## [357] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [361] "Infirmary"      "Infirmary"      "Asylum"         "DevilsDen"     
## [365] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Infirmary"     
## [369] "Asylum"         "DevilsDen"      "DevilsDen"      "GhostlyGrounds"
## [373] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [377] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [381] "Asylum"         "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds"
## [385] "Infirmary"      "Asylum"         "Asylum"         "DevilsDen"     
## [389] "GhostlyGrounds" "GhostlyGrounds" "Infirmary"      "Asylum"        
## [393] "DevilsDen"      "GhostlyGrounds" NA               "Infirmary"     
## [397] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [401] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [405] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [409] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [413] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [417] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [421] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [425] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [429] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [433] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [437] "GhostlyGrounds" "Infirmary"      "Asylum"         "DevilsDen"     
## [441] "GhostlyGrounds" NA               "Infirmary"      "Asylum"        
## [445] "DevilsDen"      "GhostlyGrounds" NA               "Infirmary"     
## [449] "Infirmary"      "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [453] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [457] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [461] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [465] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [469] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [473] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [477] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [481] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [485] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [489] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [493] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [497] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [501] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [505] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [509] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [513] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [517] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [521] "Infirmary"      "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [525] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [529] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [533] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [537] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [541] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [545] "Infirmary"      "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [549] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [553] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [557] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [561] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [565] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [569] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [573] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [577] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [581] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [585] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [589] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [593] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [597] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [601] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [605] "Infirmary"      "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [609] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [613] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [617] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [621] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [625] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [629] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [633] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [637] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [641] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [645] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [649] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [653] "Asylum"         "Asylum"         "DevilsDen"      "DevilsDen"     
## [657] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [661] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [665] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [669] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [673] "DevilsDen"      "DevilsDen"      "GhostlyGrounds" "Infirmary"     
## [677] "Asylum"         "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [681] "GhostlyGrounds" "Infirmary"      "Infirmary"      "Asylum"        
## [685] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [689] "Infirmary"      "Asylum"         "DevilsDen"      "DevilsDen"     
## [693] "GhostlyGrounds" "Infirmary"      "Asylum"         "Asylum"        
## [697] "DevilsDen"      "GhostlyGrounds" "GhostlyGrounds" "Infirmary"     
## [701] "Infirmary"      "Asylum"         "DevilsDen"      "GhostlyGrounds"
## [705] "GhostlyGrounds"
df[1,2] # print the first row in column 2
## [1] "Infirmary"

Now that we know how to access rows and columns, let’s talk about subsetting! Subestting is a technique for filtering rows or columns in a given data frame.

Conditional Subsetting

Let’s say we only cared about participants’ experiences for the Infirmary section of the haunted house. In order to do this, let’s talk about how operators work in R.

#print TRUE or False for whether a row in the Section column reflects "Infirmary" or not
df$Section == "Infirmary" 
##   [1]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [13]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [25]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [37]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [49]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [61]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [73]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [85]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [97]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [109]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [121]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [133]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [145]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [157]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [169]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [181]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [193]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [205]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [217]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [229]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [241]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [253]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [265]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [277]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [289]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [301]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [313]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [325]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [337]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [349]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [361]  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
## [373]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [385]  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE    NA  TRUE
## [397] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [409] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [421]  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [433] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE    NA  TRUE FALSE
## [445] FALSE FALSE    NA  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [457] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [469] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [481] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [493] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [505] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [517] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [529] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [541] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [553] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [565] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [577] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [589] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [601] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [613] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [625] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [637] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [649] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [661] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [673] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [685] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [697] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

Notice the two equals signs (==). When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either TRUE or FALSE is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column value, and noting whether the conditional is TRUE or FALSE.

Subsetting rows and columns using bracket notation

So, we could theoretically plug just about any conditional statement in our subset approaches and subset the data as we wish:

We can subset specific rows that we care about using bracket notation.

We can also subset specific columns using bracket notation.

Let’s use bracket notation to subset the rows that belong to participant 1001 and store these rows in a new data frame called “df_1001”.

Let’s also use bracket notation to subset the PID, Section, Stage, and Recall columns and store these columns in a new data frame called “df_sub”

# The nrow() command outputs how many rows the data frame has
# We're doing this to show that both approaches yield the same result

#Subseting rows using bracket notation
df_1001 <- df[df$PID == "1001",]
nrow(df_1001)
## [1] 6
#Subseting columns using bracket notation
cols <- c("PID", "Section", "Stage", "Recall") #create a vector of column names that we want to subset
df_sub <- df[, cols] #use bracket notation to pull the columns that we included in the "cols" vector from the df data frame.

Understanding the structure of bracket notation [rows, columns] is super important and this structure will be used to carry out more complicated functions that we’ll talk about later on in the workshop series!

Subsetting rows and columns using the subset() function

As mentioned above, we can also use the subset() function to subset rows and columns.

We will first use the subset() function to subset specific rows.

We will also use the subset() function to subset specific columns.

Let’s use the subset() function to subset the rows that belong to participant 1001 and store these rows in a new data frame called “df_1001”.

Let’s also use the subset() function to subset the following columns: PID, Section, Stage, Recall and store these columns in a new data frame called “df_sub”

#Subsetting rows using the subset() function
df_1001 <- subset(df, PID == "1001")
nrow(df_1001)
## [1] 6
#Subsetting columns using the subset() function
df_sub <- subset(df, select=c(PID, Section, Stage, Recall))

Most people prefer to use the subset() function compared to bracket notation since it’s a little bit more readable, but it’s totally okay to do whatever makes the most sense to you. They both accomplish the same thing, just in slightly different ways.

Subsetting rows based on multiple conditions

What if, rather than subsetting based on one condition (i.e., rows that belong to participant 1001), we wanted to subset based on multiple conditions?

We can take advantage of OR ( | ) and AND ( & ) operators using the subset() function.

Below, we will be subsetting all rows where the assessment is based on the Infirmary OR Asylum haunted house segments.

df_multiple_conditions <- subset(df, Section == "Infirmary" | Section == "Asylum")

Here, we are telling R to subset all rows where Section is equal to Infirmary OR Asylum. As you can tell, leveraging the OR ( | ) and AND ( & ) operators within the subset() function can be especially powerful.

Week 3 Exercise: Subsetting data

1) Create a new data frame called “df2” and subset the following columns from the df data frame: PID, Section, Stage, Fear.rating, and TOAccuracy.

2) Do this using bracket notation

3) Repeat this using the subset() function.

'



'

Click for solution

Missing data

What if we wanted to see which rows had missing values (e.g., NA) or not? What if, for whatever reason, some participants were not able to complete the temporal memory accuracy assessment for the haunted house events?

We can use the is.na() function to determine which rows have missing values in the Temporal Accuracy (TOAccuracy) column

is.na(df$TOAccuracy)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [445]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [613] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [637] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [685] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

This will produce an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each individual TRUE and FALSE is telling us whether each row in that column meets the condition we defined. If we see a FALSE in the first position, we know that the first row does NOT have a missing value. If we see a TRUE, we know that the second row does NOT have a Temporal Memory accuracy score.

But how can we create a data frame that does not have any missing data (i.e., rows that are blank or have an ‘NA’ in it)?

Here, we can use bracket notation to create a new data frame called “df_complete” that only includes data that is NOT missing in the TOAccuracy column in the df data frame. By putting an exclamation point in front of the is.na() function, this is our way of telling R that we want it to do the inverse of the is.na() function!

This idea of putting an exclamation point before the is.na() is generalizable to many functions, not just is.na().

#is.na() function
df_complete <- df[!is.na(df$TOAccuracy),]

What if, instead of removing rows that have a missing value in ONE column, we wanted to remove any rows that have a missing value in ANY column?

Rather than using the is.na() function, I personally like to use the complete.cases() function for situations like this.

df_complete <- df[complete.cases(df), ]

Here, we are again using bracket notation to tell R, within the df data frame, remove any rows that have a missing value in ANY column in the df data frame. Push the remaining non-empty rows into a new data object called “df.complete”.

Week 3 Assignment: Subsetting data

For this week’s assignment, let’s continue focusing on subsetting in R.

1) Read in the frightnight_practice.csv dataset

2) Create a new data frame and subset the following columns from the df data frame: PID, Section, Stage, Recall, TOAccuracy

3) From this new data frame, subset only rows that only contains TOAccuracy scores greater than .40

'



'

Week 4: If Else statements

For the Week 4 workshop, let’s read in the frightnight_practice CSV file

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file

If else statements

Let’s use an If Else statement to create a new column that represents whether a Section was a high threat or low threat section.

The structure for ifelse() statements is as follows: dataframe$name_of_new_column <- if the value in the Section column has a cell that = “Infirmary”, we will insert a value of “Low” in the new Threat column for that cell, else, insert a value of “High” to represent high threat.

df$Threat <- ifelse(df$Section == "Infirmary", "Low", "High")

However, Infirmary wasn’t the only Low threat column! We need to find a way to use the ifelse() function to tell R if the Section is equal to Infirmary OR Asylum, assign a value of “Low”, else, assign a value of “High”.

We can do this with more than one conditions using the OR (i.e., |) operator or the AND (i.e., &) operator.

The “|” operator means OR in R language. Using the “|” operator allows you to include multiple conditions.

df$Threat <- ifelse(df$Section == "Infirmary" | df$Section == "Asylum", "Low Threat", "High Threat")

Here, if Section == “Infirmary” OR if Section == “Asylum”, assign a “Low” value in the Threat column. Else, assign a “High” value.

Re-organizing the position of columns

So we just created this Threat column. Any time you create a new column, it appears at the end of the data frame.

What if we wanted to organize our columns in a certain order?

We can do this in multiple ways:

Approach 1: Re-organize multiple columns in a data frame using the subset() function.

Approach 2: Re-organize one specific column in a data frame using the relocate() function.

#Approach 1: Re-organize multiple columns in a data frame
df_reorganized <- subset(df, select=c(PID, Stage, Section, Group, Threat, Recall)) #if we want to include all the columns, this may take a while...

#Approach 2: Re-organize one specific column in a data frame
df_reorganized <- df %>% relocate(Threat, .after = Group) #Can relocate columns *after* a certain column
df_reorganized <- df %>% relocate(Threat, .before = Recall) #Can relocate columns *before* a certain column

Week 4 Exercise: If Else statements

Given that Eastern State Penitentiary updates its haunted house segments every year, let’s clarify which year haunted house segments were introduced. Infirmary and Ghostly Grounds were introduced in 2019, whereas Asylum and Devil’s Den are newer segments and were introduced in 2021.

1) Use the ifelse() function to create a new column called Year, where, if the Section was equal to Infirmary or Ghostly Grounds, assign a value of “2019”, else, assign a value of “2021”.

'



'

Click for solution

Week 4: More advanced ifelse statements

For the purposes of this example, let’s subset a data frame with the following columns: PID, Section, Stage, Condition, TOAccuracy.

Using the ifelse() function, we’re going to categorize Temporal Memory Accuracy performance in 3 groups: High, Medium, or Low.

Let’s make a new column called “MemoryStrength” where a Temporal Memory Accuracy score less than or equal to .3 is “Low”, any Temporal Memory Accuracy score between .3 and .7 is “Medium”, and a Temporal Memory Accuracy score greater than or equal to .7 is “High”.

#Subset a data frame with the following columns: PID, Section, Stage, Condition, TOAccuracy.
df_memory <- subset(df, select=c(PID, Section, Stage, Condition, TOAccuracy))


#a Temporal Memory Accuracy score less than or equal to .3 is "Low"
df_memory$MemoryStrength <- ifelse(df_memory$TOAccuracy <= .3, "Low", NA)


#any Temporal Memory Accuracy score between .3 and .7 is "Medium"
df_memory$MemoryStrength <-ifelse(df_memory$TOAccuracy > .3 & df_memory$TOAccuracy < .7, "Medium", df_memory$MemoryStrength)


#a Temporal Memory Accuracy score greater than or equal to .7 is "High".
df_memory$MemoryStrength <-ifelse(df_memory$TOAccuracy >= .7, "High", df_memory$MemoryStrength)

Week 4 Assignment: If Else statements

1) Read in the frightnight_practice.csv file

2) Create a new data frame and subset the following columns from the df data frame: PID, Section, Stage, Group, Recall, WordCount

3) We need to categorize Word Count during free recall in 3 groups: Long, Medium, or Short.

4) Use the ifelse() function to create a new column called “RecallLength” that meets the following criteria: Word count less than or equal to 40 is “Short”, word count in between 40 and 60 is “Medium”, and word count greater than oe equal to 60 is “Long”

'



'

Week 5: Intro to For Loops

For the Week 5 workshop, let’s read in the frightnight_practice CSV file. Before we start working with actual data, we’ll work with some general examples first.

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file

A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.

Below, let’s walk through the general structure of a for loop and run a quick example of a for loop that will loop through and print array of numbers

# -- For Loop general expression ---
for (variable in sequence) {
    expression
}


# --- Using a for loop on an array of numbers ---
for (i in 1:10) {
    print(i)
}

As you can see, i represents a temporary variable that iterates through each value in the 1:10 sequence.

Given that we are using the print() function to print i, the output should print the 1:10 sequence.

If Else statements in For Loops

Before we continue with for loops, let’s do a quick refresher on “if else” statements because they are integral to for loops.

Last week, we went through how to use the ifelse() function to do “if else” statements, which we can do pretty concisely. However, using an “if else” statement within a for loop is a bit different.

#General structure of if statement
if (condition) {
  expression
} else {
  expression
}

Next, let’s go through some additional examples to get a better idea of how these “if else” statements actually work!

# --- Example of if statement ---
team_A <- 3 # Number of goals scored by Team A
team_B <- 1 # Number of goals scored by Team B

if (team_A > team_B){
  print ("Team A wins")
}
## [1] "Team A wins"
# --- Example of if statement with the else statement explicitly mentioned ---
team_A <- 1 # Number of goals scored by Team A
team_B <- 3 # Number of goals scored by Team B

if (team_A > team_B){
    print ("Team A will make the playoffs")
} else {
    print ("Team B will make the playoffs")
}
## [1] "Team B will make the playoffs"

So far so good. Next, let’s wrap these if else statements in a for loop, which makes these arguments especially powerful.

#Create a vector that includes the numbers ranging from 1 to 10.
x2 <- 1:10                      


#For loop where, if i = 1, print "The if condition is TRUE", else, print "The if condition is FALSE"
for (i in 1:length(x2)) {  
  if (x2[i] == 1) {
      print("The if condition is TRUE") }
  else {
      print("The if condition is FALSE")
  }
}
## [1] "The if condition is TRUE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"
## [1] "The if condition is FALSE"

Lets break this code down in some more detail.

1) for (i in 1:length(x2)) { — “i” is a temporary variable that store the values of the current position in the range of the for loop. In this case, we are telling R that we want “i” to represent each position within the length of the x2 vector, starting at 1 and going up until 10. “i” will iterate across each of these values (1-10)

2) if (x2[i] == 1) { — This if statement is saying: if the value of i within x2 == 1. Add a new { to indicate the start of the if statement

3) print(“The if condition is TRUE”) } — print the result if the if statement is true and add a bracket } to signify that its the end of the if statement

4) else { — add a new bracket { to signify that its the start of the else statement.

5) print(“The if condition is FALSE”) } – print the result if the if statement is false and add a bracket } to signify that its the end of the ifelse statement.

6) } — add a last } to indicate the end of the for loop!

Week 5 Exercise: Intro to For Loops

1) Create a vector that includes the following letters: “A”, “B”, “C”, “D”, “E”, “F”

2) Create a for loop where, if i = A, print This value represents A”, else, print “This value does not represent A”

'



'

Click for solution

Next, let’s map the for loop onto the data we’ve been using! I’ll also try to convert the structure of the ifelse() function into the structure of a for loop “if else” statement in case that context would be helpful.

#An if else statement using the ifelse() function like we did earlier
#ifelse() function
df$newcolumn <- ifelse(df$Section == "GhostlyGrounds", 1, 0)

#Let's subset data just to make things easier to see
df_example <- subset(df, select=c(PID, Section))

#Remove any rows that have a missing value in any column
df_example <- df_example[complete.cases(df_example), ]

#Create an empty new column called "newcolumn"
df_example$newcolumn <- NA

#For Loop if else statement structure
for (i in 1:nrow(df_example)) { 
  if (df_example$Section[i] == "GhostlyGrounds") { 
      df_example$newcolumn[i] <- 1 
} else {
      df_example$newcolumn[i] <- 0 
}
}

Let’s break down this for loop code in some more detail:

1) i in 1:nrow will iterate over the amount of rows in the df_example data frame

2) if the value in the df_example$Section column is equal to GhostlyGrounds

3) Assign a value of 1 to the “newcolumn” column

4) else

5) Assign a value of 0 to the newcolumn column.

Hopefully these examples provide a helpful understanding of how for loops actually work in R!

Week 5 Assignment: Intro to For Loops

1) Let’s create a vector called x4 that contains the following values: “A”, “B”, “A”, “D”, “A”, “F”

2) Create a for loop that iterates across all values in x4. If the value == A, print “This value represents A”, else, print “This value represents, B, D, or F”

'



'

Week 6: Pivoting data from wide to long and long to wide

Pivot a data frame from wide to long

When data exists in a “wide” format, that means that each participant only has 1 row.

When data exists in a “long” format, that means that each participant has more than 1 row.

For the purposes of understanding how to pivot data frames, we will be using a new CSV file where each participant will only have 1 row and for illustrative purposes, the columns will reflect each participant’s Temporal Memory Accuracy score for a given haunted house segment.

Importantly, in order to do most of the analyses that we’ll do later (e.g., bivariate linear regression, multiple linear regression, linear mixed effects regression), we need to have the data in long format. The pivot_longer() function from the tidyr package in R can be used to pivot a data frame from a wide format to a long format.

So let’s walk through how to convert data from wide format to long format using the new CSV file. Let’s read in the “frightnight_wide.csv” file, as well as the “frightnight_practice.csv” file that we’ve been working with over the past few weeks.

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

#Read in the df_wide.csv file
df_wide <- read.csv(file = "frightnight_wide.csv")
df <- read.csv(file = "frightnight_practice.csv")


#Approach 1: Basic Pivot longer
df_long <- df_wide %>% pivot_longer(
                        cols=c("Asylum_TOAccuracy", "GhostlyGrounds_TOAccuracy", "DevilsDen_TOAccuracy", "Infirmary_TOAccuracy"), #The names of the columns to pivot
                        names_to = "Section", #The name for the new character column
                        values_to = "TOAccuracy") #The name for the new values column


#Approach 2: Pivot longer using grep function
df_long <- df_wide %>% pivot_longer(
                          cols = grep("_", colnames(df_wide)),
                          names_to = c("Section", ".value"), 
                          names_sep = "_",
                          values_drop_na = TRUE)

Here, we pivoted the data frame from wide to long format using two approaches

In Approach 1, we did a basic pivot longer to convert the data from wide to long, where the new column that we created (Section) reflects the column names that we included in the cols=c argument of the pivot_longer function. The other new column that we created, TOAccuracy, contains the values that existed in the columns that we included in the cols=c argument of the pivot_longer function.

In Approach 2, we used the grep (which is a pattern matching function) and colnames (which references column names in a data frame) functions to search for naming patterns in the column names and create a new column called “Section” and the TOAccuracy column.

By using grep(“”) to find all columns with a ”” in the column names, you can use the names_to= (“Section”,“.value”) and the names_sep = (“”) arguments such that R will use the ”” pattern to make two columns: one column called “Section” with the words before the “” appearing as the row values. For example, let’s consider the column ”Asylum_TOAccuracy”. In this example, we are telling R that ”Asylum” will appear in the rows for the new ”Section” column that we created. The other new column will be named after whatever comes after the ”” and the row values will represent the values from each column that the cols = grep(“_“, colnames(df_wide)) captures.

Pivot_longer took me a super long time to fully understand, but hopefully this example helps!

Pivot a data frame from long to wide

Data can also be converted from long to wide!

The pivot_wider() function from the tidyr package in R can be used to pivot a data frame from a long format to a wide format.

For the purposes of this example and to make things easier, let’s use the df data frame and just focus on one participant and subset a few select columns.

#Subset all rows where PID == 1001
df_one_sub <- subset(df, PID == "1001")


#Subset the PID, Stage, Section, and Recall columns
df_long <- subset(df_one_sub, select=c(PID, Stage, Section, TOAccuracy))


#Pivot wider
df_wide <- df_long %>% pivot_wider(names_from = Section, #names_from: The column whose values will be used as column names
                                      values_from = TOAccuracy) #values_from: The column whose values will be used as cell values

Here, we converted the data frame from long to wide!

By including the “Section” column in the “names_from” column, we are telling R that this is the column that will be used to generate column names. By including the “Recall” column in the “values_from” column, we are telling R that this is the column whose values will be used to generate row values.

As you can see, we’ve generated two rows per participant:

1 row that reflects the participant’s Temporal Memory Accuracy score during the Immediate study visit (which only included two haunted house segments as part of the study design)

Another row that reflects the participant’s Temporal Memory Accuracy score during the Delay study visit (all 4 haunted house segments as part of the study design).

Week 6 Exercise: Pivoting data from wide to long and long to wide

1) Read in the “frightnight_wide_exercise.csv” file

2) Pivot the data frame from wide to long

3) You should end up with 4 columns: PID, Group, Section, and TOAccuracy

'



'

Click for solution

Week 6 Assignment: Pivoting data from wide to long and long to wide

1) Read in the “frightnight_wide_assignment.csv” file

2) Use the pivot_longer function to convert the data from wide to long format

3) You should end up with a dataframe that has 4 columns: PID, Group, Section, and Word Count.

'



'

Week 7: Merging data frames

For the Week 7 workshop, let’s read in the “frightnight_practice.csv” file.

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_practice.csv") #Load in the fright night practice csv file

Learning how to merge multiple data frames is another integral part of data cleaning.

Let’s create a data frame called “df_new” that has the following columns from the df data frame: “PID”, “Section”, “Stage”, “Group”, “Condition”, and “TOAccuracy”

Next, let’s subset and creating two new data frames from the df_new data frame: 1 data frame that only has assessments from the Immediate study visit, and a second data frame that has all the assessments from the 1 week Delay study visit.

#Subset a data frame with the following columns: "PID", "Section", "Stage", "Group", "Condition", and "TOAccuracy"
df_new <- subset(df, select=c("PID", "Section", "Stage", "Group", "Condition", "TOAccuracy"))

#Create data frame that only has assessments from the Immediate study visit
Immediate.df <- subset(df_new, Stage == "Immediate")


#Create data frame that only has assessments from the Delay study visit
Delay.df <- subset(df_new, Stage == "Delay")

Next, let’s create a new column in each data frame for illustrative purposes. Let’s use an ifelse statement to create a new column called “MemoryStrength”, where if TOAccuracy is greater than .50, we’ll classify that memory as a “Strong” memory, else, we’ll classify it as a “Weak” memory.

After creating the column, we’ll try two approaches to merging the two data frames (Immediate.df and Delay.df) into one data frame.

Approach 1 will illustrate an incorrect approach toward merging data frames as it will yield duplicate columns

Approach 2 will illustrate a correct approach toward merging data frames as it will not yield duplicate columns

Immediate.df$MemoryStrength <- ifelse(Immediate.df$TOAccuracy > .50, "Strong", "Weak")

Delay.df$MemoryStrength <- ifelse(Delay.df$TOAccuracy > .50, "Strong", "Weak")


#Approach 1 -- WRONG
data.merged <- merge(Immediate.df, Delay.df, by="PID")


#Approach 2  -- CORRECT
data.merged <- merge(Immediate.df, Delay.df, by=c("PID", "Section", "Stage", "Group", "Condition", "TOAccuracy", "MemoryStrength"), all.x=TRUE, all.y=TRUE)

As we can see, Approach 1 yielded duplicate columns for all columns except PID, because we explicitly defined that we wished to merge by the PID column in the code.

In Approach 2, because we fed all of the columns that are shared across the Immediate.df data frame and the Delay.df data frame, R understood that we wanted to merge the two data frames by those columns.

Here’s some more info about the all.x and all.y arguments in the merge() function that may be helpful: all.x=TRUE is logical; which means if TRUE, then all extra rows that exist in one data frame but not the other, will be added to the data frame. In other words, each row in X data frame that has no matching row in Y data frame will be added to the new data frame. Same philosophy applies for all.y=TRUE

The main takeaway here is that R needs to understand which columns you wish to merge the two data frames by. In other words, if columns from two different data frames share the same column name, we need to tell R that those columns are in fact the same, which we can do by feeding R the exact column names that we wish to merge by. Importantly, there are many different ways to go about merging data frames (some don’t involve writing each individual column name, which saves a lot of time), but for the purposes of an introduction into this space, it’s super important to be mindful of how R merges columns.

Week 7 Exercise: Merging data frames

1) Subset two data frames from the df_new data frame that we created earlier: 1 data frame called “low_df” that only contains rows where the Temporal Memory Accuracy score is less than .5. A second data frame that only contains rows where the Temporal Memory Accuracy score is greater than .5.

2) Merge the two data frames back together and create a new data frame called “merged_data”

'



'

Click for solution

Week 7 Assignment: Merging data frames

1) Read in the frightnight_practice CSV file.

2) Subset the following columns from the original df data frame: PID, Section, Stage, Group, Condition, TOAccuracy

3) Create two new data frames. One data frame will contain rows that reflect touring the Infirmary or Devil’s Den haunted house sections. The other data frame will contain rows that reflect touring Asylumn or Ghostly Grounds haunted house sections.

4) Create a new column called fear_level in both data frames. If the Section is equal to Devil’s Den, assign a value of “scary” to the new fear_level column, else, assign a value of “not scary”. For the second data frame, if the Section is equal to Ghostly Grounds, assign a value of “scary” to the new fear_level column, else, assign a value of “not scary”.

5) Merge the two data frames back together and store it in a new data frame!

'



'

Week 8: Data cleaning

Up until now, we’ve learned a lot of isolated functions, but we haven’t really ‘cleaned up’ any data in the traditional sense. This workshop will focus on cleaning up data and for that, we’ll plan to use the frightnight_raw.csv file, so let’s read it in!

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df_raw <- read.csv(file = "frightnight_raw.csv") #Load in the fright night raw csv file

As we can see, this data frame is pretty messy! No column names, columns are not in order, there are missing values, and the data is in wide format! Let’s work on cleaning this up and start by adding some column names (which we haven’t talked about, yet!).

#Use the colnames() function to print out the column names in an existing data frame
colnames(df_raw)
##  [1] "Q1"  "Q2"  "Q3"  "Q4"  "Q5"  "Q6"  "Q7"  "Q8"  "Q9"  "Q10"
#Let's print out the first row which has the header for the columns we're interested in.
df_raw[1,]
#Two ways you can replace column names.
#Approach 1: replace one column name at a time
colnames(df_raw)[colnames(df_raw) == "Q1"] = "PID"

#Approach 2: Create a vector of column names and replace the old column names with the new column names
cols <- c("PID", "Aslyum.WordCount", "GhostlyGrounds.WordCount", "DevilsDen.WordCount", "Infirmary.WordCount", "Group", "Aslyum_TOAccuracy", "GhostlyGrounds_TOAccuracy", "DevilsDen_TOAccuracy", "Infirmary_TOAccuracy")

#Use the colnames() function to replace the old column names with the new column names stored in the "cols" vector
colnames(df_raw) <- cols

Okay, now that we have some decent column names, it looks like we don’t need that first row anymore. Let’s remove that row using bracket notation.

df_raw <- df_raw[-1,]

Next, let’s re-order the columns in a way that’s more readable. Although there are many different ways to re-order a dataframe, using the subset() function is often the easiest. While we’ve mainly used the subset() functions to pull out certain columns, we can also use it to re-order columns in a data frame like below.

df_raw <- subset(df_raw, select=c(PID, Group, Aslyum.WordCount, GhostlyGrounds.WordCount, DevilsDen.WordCount, Infirmary.WordCount, Aslyum_TOAccuracy, GhostlyGrounds_TOAccuracy, DevilsDen_TOAccuracy, Infirmary_TOAccuracy))

Here, we’ll leverage bracket notation and use the complete.cases function to remove any rows that have a missing value in ANY column in the df data frame. We’ll subset the remaining non-empty rows into the df_raw data object.

df_complete <- df_raw[complete.cases(df_raw), ]

Nice! At this point our dataset is starting to come together, but it’s still in wide_format. Let’s use the pivot_longer function to turn this data from wide format to long format. In the example below, we’ll use the grep function to leverage the use of patterns in our column names and make things easier to pivot.

#Subset the Word Count columns
df_wordcount <- subset(df_complete, select=c(PID, Group, Aslyum.WordCount, GhostlyGrounds.WordCount, DevilsDen.WordCount, Infirmary.WordCount))

#Pivot the Word Count columns
df_wordcount_long <- df_wordcount %>% pivot_longer(
                          cols = grep("\\.", colnames(df_wordcount)),
                          names_to = c("Section", ".value"), 
                          names_sep = "\\.",
                          values_drop_na = TRUE)


#Subset the TOAccuracy columns
df_TOAccuracy <- subset(df_complete, select=c(PID, Group, Aslyum_TOAccuracy, GhostlyGrounds_TOAccuracy, DevilsDen_TOAccuracy, Infirmary_TOAccuracy))

#Pivot the TOAccuracy columns
df_TOAccuracy_long <- df_TOAccuracy %>% pivot_longer(
                          cols = grep("_", colnames(df_TOAccuracy)),
                          names_to = c("Section", ".value"), 
                          names_sep = "_",
                          values_drop_na = TRUE)


#Merge the data frames back together!
df_complete_clean <- merge(df_wordcount_long, df_TOAccuracy_long, by=c("PID", "Group", "Section"), all.x=TRUE, all.y=TRUE)

Amazing! Next, let’s use the ifelse() function to create a new “Threat” column, where, if the Section is equal to Infirmary or Asylum, we assign a value of “Low”, else, we assign a value of “High”.

#Use the ifelse() function to create a new "Threat" column to define the low-threat and high-threat segments.
df_complete_clean$Threat <- ifelse(df_complete_clean$Section == "Infirmary" | df_complete_clean$Section == "Asylum", "Low", "High")


#Let's also re-order the Threat column
df_complete_clean <- df_complete_clean %>% relocate(Threat, .after = Group)

Look at how nice our data frame looks now! If we wanted to, we could move forward with conducting some analyses. Speaking of analyses, we’ll start talking about how to complete statistical analyses in R starting next week!

Week 8 Assignment: Data cleaning

1) Read in the frightnight_raw_assignment CSV file

2) Rename the columns in a way that makes sense to you

3) Remove the second row since we no longer need the column headers

4) Remove any row that has a missing value in any column!

5) Pivot the data from long to wide such that you end up with a data frame that has 4 columns: PID, Group, Section, Fear

6) Re-organize the position of the columns so that the final data frame appears as follows: PID, Section, Group, Fear

'



'

Week 9: Analyzing Data w/ Categorical Independent Variables

Next, we are going to start talking about different types of statistical analyses in R!

Before we do that, let’s read in the frightnight_analyses CSV file for the purposes of the Week 9 workshop!

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_analyses.csv") #Load in the fright night practice csv file

It’s super important to recognize the different data types that R uses. Some of the common data types include: character, factor, numeric, integer.

We can use the str() function to help which columns are which data types across the entire data frame.

We’ll also learn how to convert columns into different data types ahead of our analyses. For example, the Section column represents 4 haunted house sections: Infirmary, Asylum, Devil’s Den, and Ghostly Grounds. When reading in the fright night practice dataset into R, R will treat the Section column as a character column (because the column contains words). However, for the purposes of our analyses, the Section column represents more than just characters. It represents 4 categories that we think could lead to differences in a dependent variable (i.e., TOAccuracy).

The same logic would apply to other columns that R treats as characters, but we actually consider to represent different categories. These columns include: Group, Threat, Condition, Time_HH. Let’s convert those columns from character to factor.

We can also treat data types as numeric. Although R correctly treats the TOAccuracy, Authentic, Analytic, Clout, and Tone columns as numeric columns, let’s use the as.numeric() function to make sure these columns are treated numerically just for the sake of practice.

Data types in R

str(df) #str() function is a helpful tool for seeing the data types for a data frame
## 'data.frame':    705 obs. of  119 variables:
##  $ X                     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ PID                   : int  1001 1001 1001 1001 1001 1001 1002 1002 1002 1002 ...
##  $ Section               : chr  "Infirmary" "Infirmary" "Asylum" "DevilsDen" ...
##  $ Stage                 : chr  "Immediate" "Delay" "Delay" "Delay" ...
##  $ Group                 : chr  "Control" "Control" "Control" "Control" ...
##  $ Condition             : chr  "Baseline" "Baseline" "Baseline" "Baseline" ...
##  $ Code                  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Time_HH               : chr  "06:15pm" "06:15pm" "06:15pm" "06:15pm" ...
##  $ Fear.rating           : int  2 2 2 3 2 2 5 5 5 5 ...
##  $ Threat                : chr  "Low Threat" "Low Threat" "Low Threat" "High Threat" ...
##  $ Repeat                : int  0 1 0 0 0 1 0 0 1 0 ...
##  $ TOAccuracy            : num  0.8 0.8 0.5 0.4 0.8 0.2 0.6 0.5 0.25 0.6 ...
##  $ Recall                : chr  "Infirmary was very colorful. It started with an entry room with neon color murals. You were then given 3D glass"| __truncated__ "Infirmary was very colorful. There were pictures painted on wood canvases and the walls had trippy looking mura"| __truncated__ "Asylum had a woman that was on a couch in a room with a director chair. It was 1920s ish themed. There was a pe"| __truncated__ "The Devil's Den section was very jumpy (trying to scare you with loud nosies). There was one point we went outs"| __truncated__ ...
##  $ Segment               : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Analytic              : num  89.5 58.4 86.6 93.3 94.3 ...
##  $ Clout                 : num  40.1 85.7 82.4 94.2 99 ...
##  $ Authentic             : num  91.21 28.56 6.58 93.83 95.6 ...
##  $ Tone                  : num  20.23 78.09 1 1.4 1.92 ...
##  $ Drives                : num  0 1.25 1.96 2.42 2.9 2.29 5.6 5.41 7.81 7.94 ...
##  $ affiliation           : num  0 1.25 0 0.81 0 0 4.8 4.05 5.47 3.17 ...
##  $ achieve               : num  0 0 0 1.61 2.9 2.29 0 0 0 3.17 ...
##  $ power                 : num  0 0 1.96 0 0 0 0.8 1.35 2.34 1.59 ...
##  $ Cognition             : num  9.3 3.75 3.92 6.45 4.35 5.34 10.4 6.76 4.69 9.52 ...
##  $ allnone               : num  0 0 0 0 0 0 0 0 0.78 0 ...
##  $ cogproc               : num  9.3 3.75 3.92 6.45 4.35 5.34 10.4 6.76 3.91 9.52 ...
##  $ insight               : num  1.16 1.25 1.96 1.61 0 0 4 1.35 1.56 3.17 ...
##  $ cause                 : num  3.49 1.25 0 1.61 0 0 1.6 0 0 0 ...
##  $ discrep               : num  0 0 0 1.61 2.9 4.58 0 1.35 0 1.59 ...
##  $ tentat                : num  0 0 0 1.61 0 0.76 1.6 1.35 0 0 ...
##  $ certitude             : num  0 1.25 1.96 0 0 0 0 0 0 0 ...
##  $ differ                : num  3.49 0 0 0 1.45 0 1.6 4.05 2.34 4.76 ...
##  $ memory                : num  0 0 1.96 0 0 0 1.6 1.35 0.78 3.17 ...
##  $ Affect                : num  2.33 3.75 3.92 3.23 2.9 2.29 1.6 1.35 3.13 1.59 ...
##  $ tone_pos              : num  1.16 3.75 0 0 0 0 0 0 0.78 0 ...
##  $ tone_neg              : num  1.16 0 3.92 3.23 2.9 2.29 1.6 1.35 2.34 1.59 ...
##  $ emotion               : num  0 0 0 1.61 2.9 2.29 1.6 0 1.56 1.59 ...
##  $ emo_pos               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ emo_neg               : num  0 0 0 1.61 2.9 2.29 1.6 0 1.56 1.59 ...
##  $ emo_anx               : num  0 0 0 1.61 2.9 2.29 0.8 0 0 1.59 ...
##  $ emo_anger             : num  0 0 0 0 0 0 0.8 0 0.78 0 ...
##  $ emo_sad               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ swear                 : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Social                : num  2.33 6.25 13.73 10.48 11.59 ...
##  $ socbehav              : num  0 0 3.92 0 1.45 0 1.6 6.76 7.81 0 ...
##  $ prosocial             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ polite                : num  0 0 0 0 0 0 0.8 2.7 2.34 0 ...
##  $ conflict              : num  0 0 0 0 0 0 0.8 0 0.78 0 ...
##  $ moral                 : num  0 0 0 0 1.45 0 0 0 0 0 ...
##  $ comm                  : num  0 0 3.92 0 0 0 0.8 4.05 6.25 0 ...
##  $ socrefs               : num  2.33 6.25 9.8 10.48 10.14 ...
##  $ family                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ friend                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ female                : num  0 2.5 1.96 2.42 0 0 0 2.7 2.34 1.59 ...
##  $ male                  : num  1.16 0 0 0.81 1.45 0.76 0 2.7 1.56 0 ...
##  $ need                  : num  0 0 0 0 0 0 3.2 0 0 0 ...
##  $ want                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ acquire               : num  0 0 1.96 0 0 0 0 0 0 0 ...
##  $ lack                  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fulfill               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fatigue               : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ reward                : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ risk                  : num  0 0 0 0 0 0 0.8 0 0 0 ...
##  $ curiosity             : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ allure                : num  2.33 2.5 1.96 3.23 5.8 1.53 5.6 2.7 0 3.17 ...
##  $ Perception            : num  24.42 21.25 5.88 26.61 26.09 ...
##  $ attention             : num  0 1.25 0 0 0 0 0 0 0 0 ...
##  $ motion                : num  3.49 2.5 0 4.84 8.7 3.05 5.6 1.35 1.56 4.76 ...
##  $ space                 : num  16.28 8.75 5.88 16.94 17.39 ...
##  $ visual                : num  4.65 8.75 0 0 0 0.76 4 2.7 2.34 0 ...
##  $ auditory              : num  0 0 0 4.03 0 0.76 0.8 1.35 0 0 ...
##  $ feeling               : num  0 1.25 0 0.81 1.45 0.76 3.2 0 0 0 ...
##  $ States                : num  0 0 1.96 0 0 0 3.2 0 0 0 ...
##  $ Motives               : num  2.33 2.5 1.96 3.23 5.8 1.53 6.4 2.7 0 3.17 ...
##  $ Narrativity_Overall   : num  -39.18 19.03 -7.36 -6.22 NA ...
##  $ Narrativity_Staging   : num  -23.5 15.5 61.2 -36.2 65.3 ...
##  $ Narrativity_PlotProg  : num  -38.8 10.7 -28 -12.4 36.5 ...
##  $ Narrativity_CogTension: num  -55.3 30.9 -55.3 29.9 NA ...
##  $ Peak_Staging          : int  3 5 3 2 1 2 3 2 5 3 ...
##  $ Peak_PlotProg         : int  5 4 5 4 3 4 4 5 4 2 ...
##  $ Peak_CogTension       : int  5 4 5 5 1 1 2 5 3 5 ...
##  $ Valley_Staging        : int  1 2 5 1 2 4 2 5 1 2 ...
##  $ Valley_PlotProg       : int  3 5 2 2 1 2 2 2 5 3 ...
##  $ Valley_CogTension     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Staging_1             : num  0 20 87.9 0 100 ...
##  $ Staging_2             : num  68.4 0 66.7 100 0 ...
##  $ Staging_3             : num  100 60 100 100 0 ...
##  $ Staging_4             : num  5.26 40 66.67 66.67 100 ...
##  $ Staging_5             : num  5.26 100 0 8.33 65.38 ...
##  $ PlotProg_1            : num  91.7 80 24.2 50 0 ...
##  $ PlotProg_2            : num  25 80 0 0 50 ...
##  $ PlotProg_3            : num  0 80 0 0 100 ...
##  $ PlotProg_4            : num  75 100 0 100 0 ...
##  $ PlotProg_5            : num  100 0 100 57.3 11.5 ...
##  $ CogTension_1          : num  0 0 0 0 50 50 0 0 0 0 ...
##  $ CogTension_2          : num  0 0 0 0 50 50 100 0 50 0 ...
##  $ CogTension_3          : num  0 0 0 96 50 50 0 0 100 0 ...
##  $ CogTension_4          : num  0 100 0 96 50 50 50 0 0 0 ...
##  $ CogTension_5          : num  100 0 100 100 50 50 100 100 0 100 ...
##  $ Event.Int             : int  1 1 1 6 4 6 3 4 2 3 ...
##   [list output truncated]
#Convert Section, Group, Threat, and Condition, and Time_HH columns into Factor columns
df$Section <- as.factor(df$Section)
df$Group <- as.factor(df$Group)
df$Threat <- as.factor(df$Threat)
df$Condition <- as.factor(df$Condition)
df$Time_HH <- as.factor(df$Time_HH)


#Convert TOAccuracy, Authentic, Analytic, Clout, and Tone columns to numeric
df$TOAccuracy <- as.numeric(df$TOAccuracy)
df$Authentic <- as.numeric(df$Authentic)
df$Analytic <- as.numeric(df$Analytic)
df$Clout <- as.numeric(df$Clout)
df$Tone <- as.numeric(df$Tone)

As you can see, converting columns into different data types isn’t too tricky at all! It’s a good habit to ensure your variables are in the correct data type before carrying out any analyses.

Create data frame for analyses

Next, we will move forward with learning some common statistical analyses in R including t-tests, ANOVAs, linear regressions, and multiple linear regressions. As a reminder, this workshop won’t dive too deep into the statistical theory behind each analysis, but will help us learn how to run these analyses in R.

Before we start any analyses, let’s subset a dataframe with the columns we want to focus on as we explore some research questions.

#Subset the following columns: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone
df_analyses <- subset(df, select=c(PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone))

#Subset only the One-Week Delay
df_analyses <- subset(df_analyses, Stage == "Delay")

#Subset only rows for Ghostly Grounds
df_analyses <- subset(df_analyses, Section == "GhostlyGrounds")

#Let's also remove rows with NA in any column of data frame
df_analyses <- df_analyses[complete.cases(df_analyses), ]

T-Tests!

A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right materials produces higher test scores. Within the context of our data, we could ask a question like whether participants assigned to an Experimental condition (e.g., goal-assigned) or Control condition (e.g., Control) demonstrate significant differences in temporal memory accuracy?

QUESTION: Do goal-assigned and control participants differ significantly in their temporal memory accuracy?

HYPOTHESIS: “On average, goal-assigned participants who were assigned to a specific goal while touring the haunted house segment will have better temporal memory accuracy compared to control participants”

RELEVANT VARIABLES: Dependent: TOAccuracy (numeric) Independent: Group (Factor)

ANALYSIS: Two-Sample T-Test

model1 <- t.test(x = df_analyses$TOAccuracy[df_analyses$Group == "Goal-assigned"],
                y = df_analyses$TOAccuracy[df_analyses$Group == "Control"],
                paired = FALSE,
                alternative = "two.sided")

Okay, so let’s run the actual t-test. We’ll need to use conditional statements again to specify our variables. What we are comparing here are the mean values of temporal memory accuracy for goal-assigned and control participants. As such, we are going to specify we want to see temporal memory accuracy when Group == “Goal-assigned” and when Group == “Control. We next have an argument which asks us whether this study is a within-subjects or a between-subjects design. This question is between-subjects, since each participant is either goal-assigned or control for this study, so we mark that as FALSE. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call.

Lastly, when we look at T-Tests, standard deviations are very important, but the t.test() function won’t automatically generate those. We are using the sd() function to capture the standard deviation of reaction action, and we’re adding the argument (na.rm) that tells R to ignore any row that has a value of N/A.

sd(df_analyses$TOAccuracy, na.rm = T)
## [1] 0.2134029

So it looks like our hypothesis did NOT pan out! We do not see statistically significant differences, judging by the p-value (t = -0.58167, df = 425.54, p-value = 0.5611). To make interpretations even easier, we can pull out a nifty little tool from the report package. report() will save us the trouble and summarize the results of our test (albeit a little imperfectly) for us.

report(model1)
## Warning: Unable to retrieve data from htest object.
##   Returning an approximate effect size using t_to_d().
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The Welch Two Sample t-test testing the difference between
## df_analyses$TOAccuracy[df_analyses$Group == "Goal-assigned"] and
## df_analyses$TOAccuracy[df_analyses$Group == "Control"] (mean of x = 0.55, mean
## of y = 0.49) suggests that the effect is positive, statistically not
## significant, and small (difference = 0.06, 95% CI [-0.02, 0.14], t(110.01) =
## 1.47, p = 0.145; Cohen's d = 0.28, 95% CI [-0.10, 0.65])

ANOVAs

An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another). Within the context of our data, we could use this type of test to determine whether there are any condition-related differences in temporal memory accuracy.

QUESTION: Are there differences in gray matter volume by experimental condition?

HYPOTHESIS: Differences will exist in temporal memory accuracy by condition

RELEVANT VARIABLES: Dependent: TOAccuracy (numeric) Independent: Condition (Factor)

ANALYSIS: ANOVA

Pay close attention to the formatting of the syntax here. It is the standard way in which we specify most statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.

Due to the study design, the only time participants ever had a Goal-assigned was in the last two segments (DevilsDen and GhostlyGrounds). As a result, for these analyses, we’ll have to subset rows that reflect either Devils Den or Ghostly Grounds.

model2 <- aov(TOAccuracy ~ Condition, data = df_analyses) #create ANOVA model and store in an object called m2

The Base R command we use to specify that this is an ANOVA model is aov(). We then place our outcome/criterion/dependent variable next, followed by a tilde (~). Following the tilde comes the predictor/independent variable(s). Once the model has been specified, we note its end with a comma and tell the model where the data exists. Now if we run this model, we’ll see a new object by the name aov. But when we call that object…

model2
## Call:
##    aov(formula = TOAccuracy ~ Condition, data = df_analyses)
## 
## Terms:
##                 Condition Residuals
## Sum of Squares   0.353990  4.837662
## Deg. of Freedom         2       112
## 
## Residual standard error: 0.2078302
## Estimated effects may be unbalanced

… the information isn’t really formatted in a way that’s immediately meaningful or understandable. We typically look towards metrics like p values or means to understand ANOVA results and those are not present here. In order to see those, we need to summarize the ANOVA object.

summary(model2)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## Condition     2  0.354 0.17700   4.098 0.0192 *
## Residuals   112  4.838 0.04319                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Now we can see the variable we were exploring to the left with it’s related degrees of freedom, sum of squares, mean square value, F vlaue, and p value to its right, respectively. The summary function uses the significance code at the bottom, so we see we again only find a relationship that trends significance.

The report function works equally well on ANOVA objects as well.

report(model2)
## The ANOVA (formula: TOAccuracy ~ Condition) suggests that:
## 
##   - The main effect of Condition is statistically significant and medium (F(2,
## 112) = 4.10, p = 0.019; Eta2 = 0.07, 95% CI [6.43e-03, 1.00])
## 
## Effect sizes were labelled following Field's (2013) recommendations.

We can add more variables to our model though. What if we wanted to explore the main effects of Condition and the time of the haunted house tour (Time_HH) on Temporal Memory Accuracy? There were 3 nightly slots for the haunted house tour: 6:30PM, 7:30PM, 8:30PM. What if, for whatever reason, completing the haunted house tour at 7:30PM was more memorable than completing it at 6:30PM or 8:30PM? Maybe 7:30PM was the peak time of the haunted house tour and added to the memorability of the experience. Adding the Time_HH variable to the model will help us look at that

model3 <- aov(TOAccuracy ~ Condition  + Time_HH, data = df_analyses)

summary(model3)
##              Df Sum Sq Mean Sq F value Pr(>F)  
## Condition     2  0.354 0.17700   4.035 0.0204 *
## Time_HH       2  0.013 0.00640   0.146 0.8645  
## Residuals   110  4.825 0.04386                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All of a sudden, when we add Section into the model, we see that there is a significant difference in Temporal Memory Accuracy by Section. That’s pretty interesting! Of course, we don’t yet know whether Temporal Memory Accuracy was better for Devils Den or Ghostly Grounds.

Week 9 Exercise: Analyzing Data w/ Categorical Independent Variables

1) Conduct an ANOVA test assessing whether experimental condition predicted differences in how Analytically someone recalls their memory and store the model in an object called m1

2) Focus on these relevant variables: Dependent: Analytic (numeric) Independent: Condition (Factor)

3) Use the summary() function to print the model output and try to intepret the model yourself

4) Use the report() function to print the model interpretation

'



'

Click for solution

Week 9 Assignment: Analyzing Data w/ Categorical Independent Variables

I’m curious whether differences in Group (Goal-assigned versus Control) is associated with differences in how Authentically someone recalls their memory.

1) Read in the frightnight_analyses CSV file.

2) Subset a new data frame with the following columns:: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone

3 Subset rows that reflect assessments for Ghostly Grounds during the Delay study visit.

4 Remove rows with a missing value in ANY column

5) Run a t-test to test whether differences in Group (Goal-assigned versus Control) is associated with differences in how Authentically someone recalls their memory.

6) Store the t-test in a model called “model2”

7) Print model 2 and try to interpret the output yourself.

8) Run the report() function to try to see whether your interpretation matches.

'



'

Week 10: Analyzing Data w/ Continuous Independent Variables

We just finished covering analyses that use qualitative, categorical predictors. Next, we’ll cover analyses that use quantitative, numeric predictors, probably most common of which is linear regression. Regression can come in many flavors, including bivariate linear regression, multivariate linear regression, and binary logistic regression. We won’t get too much into the theory, but we’ll work through some useful tools and syntax to get you prepared to use R on your own.

For the Week 10 workshop, let’s read in the frightnight_analyses CSV file!

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_analyses.csv") #Load in the fright night practice csv file

Create data frame for analyses

And just like we did last week, let’s subset a data frame with the columns that we want to focus on for our Week 10 analyses.

#Subset the following columns: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone
df_analyses <- subset(df, select=c(PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone))

#Subset only the One-Week Delay
df_analyses <- subset(df_analyses, Stage == "Delay")

#Subset only rows for Ghostly Grounds
df_analyses <- subset(df_analyses, Section == "GhostlyGrounds")

#Let's also remove rows with NA in any column of data frame
df_analyses <- df_analyses[complete.cases(df_analyses), ]

Bivariate Linear Regression

A bivariate linear regression can be used when both the predictor variable (X) and outcome variable(Y) consist of continuous numeric values. A linear regression tells us how predictive of Y that X is. In other words, if we measured temperature and ice cream sales, we might find, using linear regression that as temperature increases, we could predict with decent accuracy that ice cream sales would increase as well, and we could predict how many ice cream sales we expect to see for any one value of temperature. Within the context of our data, we could ask a question like whether the accuracy of someone’s memory (i.e., temporal memory accuracy) predicts how authentically the person would communicate their memory.

QUESTION: Does temporal memory accuracy predict an Authentic style of communication?

HYPOTHESIS: As temporal memory accuracy increases, Authenticity during memory recall will increase.

RELEVANT VARIABLES: Dependent: Temporal Memory Accuracy (numeric) Independent: Authentic (numeric)

ANALYSIS: Bivariate Linear Regression

You’re going to notice right off the bat that the structure of the syntax here looks awfully similar to what we just did in ANOVA. We start by noting our method with the lm() function (Linear Modeling). We then note our outcome variable, add a tilde (~), note our predictor(s), and finally note our datasource.

m1 <- lm(Authentic ~ TOAccuracy, data = df_analyses) #create bivariate linear regression and store in an object called "m1"

Just like ANOVA, we need to use a summary() function to read the data.

summary(m1) #use summary() function to print summary for m1 bivariate linear model
## 
## Call:
## lm(formula = Authentic ~ TOAccuracy, data = df_analyses)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -80.303  -4.755  12.574  16.894  18.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   82.508      6.379  12.934   <2e-16 ***
## TOAccuracy    -2.008     11.389  -0.176     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.95 on 113 degrees of freedom
## Multiple R-squared:  0.0002751,  Adjusted R-squared:  -0.008572 
## F-statistic: 0.03109 on 1 and 113 DF,  p-value: 0.8603

Lastly, we can run the report() function for linear models as well!

report(m1) #use report() function to print report for m1 bivariate linear model
## We fitted a linear model (estimated using OLS) to predict Authentic with
## TOAccuracy (formula: Authentic ~ TOAccuracy). The model explains a
## statistically not significant and very weak proportion of variance (R2 =
## 2.75e-04, F(1, 113) = 0.03, p = 0.860, adj. R2 = -8.57e-03). The model's
## intercept, corresponding to TOAccuracy = 0, is at 82.51 (95% CI [69.87, 95.15],
## t(113) = 12.93, p < .001). Within this model:
## 
##   - The effect of TOAccuracy is statistically non-significant and negative (beta
## = -2.01, 95% CI [-24.57, 20.56], t(113) = -0.18, p = 0.860; Std. beta = -0.02,
## 95% CI [-0.20, 0.17])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

Multivariate Linear Regression

A multivariate linear regression builds upon bivariate regression by allowing for multiple predictors. If we measured temperature, ice cream sales, and the time since someone last ate, we might find that our previously specified model is now even more accurate because of the addition of the “time since last ate” variable. Within the context of our data, let’s examine how the time of the haunted house tour interacts with Temporal Memory Accuracy to predict Authenticity during memory recall.

QUESTION: Does the interaction of Temporal Memory Accuracy and Time of Haunted House Tour predict how Authentically someone will recall their memory?

HYPOTHESIS: The time of the haunted house tour will interact with temporal memory accuracy to predict differences in how Authentically people recall their memory

RELEVANT VARIABLES: Dependent: Authentic (numeric) Independent: Temporal Memory Accuracy (numeric) Independent: Time_HH (factor)

ANALYSIS: Multivariate Linear Regression

m2 <- lm(Authentic ~ TOAccuracy*Time_HH, data = df_analyses)

summary(m2)
## 
## Call:
## lm(formula = Authentic ~ TOAccuracy * Time_HH, data = df_analyses)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -82.42  -3.74  10.18  15.40  28.54 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  64.51      10.83   5.958 3.19e-08 ***
## TOAccuracy                   29.71      19.77   1.503   0.1357    
## Time_HH07:30pm               33.51      15.35   2.184   0.0311 *  
## Time_HH08:30pm               19.96      15.64   1.276   0.2047    
## TOAccuracy:Time_HH07:30pm   -61.15      27.30  -2.240   0.0271 *  
## TOAccuracy:Time_HH08:30pm   -31.47      28.43  -1.107   0.2708    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25.78 on 109 degrees of freedom
## Multiple R-squared:  0.04846,    Adjusted R-squared:  0.004806 
## F-statistic:  1.11 on 5 and 109 DF,  p-value: 0.3593
report(m2)
## We fitted a linear model (estimated using OLS) to predict Authentic with
## TOAccuracy and Time_HH (formula: Authentic ~ TOAccuracy * Time_HH). The model
## explains a statistically not significant and weak proportion of variance (R2 =
## 0.05, F(5, 109) = 1.11, p = 0.359, adj. R2 = 4.81e-03). The model's intercept,
## corresponding to TOAccuracy = 0 and Time_HH = 06:15pm, is at 64.51 (95% CI
## [43.05, 85.98], t(109) = 5.96, p < .001). Within this model:
## 
##   - The effect of TOAccuracy is statistically non-significant and positive (beta
## = 29.71, 95% CI [-9.47, 68.90], t(109) = 1.50, p = 0.136; Std. beta = 0.25, 95%
## CI [-0.08, 0.57])
##   - The effect of Time HH07 × 30pm is statistically significant and positive
## (beta = 33.51, 95% CI [3.10, 63.93], t(109) = 2.18, p = 0.031; Std. beta =
## 0.07, 95% CI [-0.39, 0.53])
##   - The effect of Time HH08 × 30pm is statistically non-significant and positive
## (beta = 19.96, 95% CI [-11.04, 50.96], t(109) = 1.28, p = 0.205; Std. beta =
## 0.14, 95% CI [-0.32, 0.60])
##   - The effect of TOAccuracy × Time HH07 × 30pm is statistically significant and
## negative (beta = -61.15, 95% CI [-115.25, -7.04], t(109) = -2.24, p = 0.027;
## Std. beta = -0.51, 95% CI [-0.95, -0.06])
##   - The effect of TOAccuracy × Time HH08 × 30pm is statistically non-significant
## and negative (beta = -31.47, 95% CI [-87.83, 24.88], t(109) = -1.11, p = 0.271;
## Std. beta = -0.26, 95% CI [-0.73, 0.21])
## 
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.

Hmm, so it looks like something is happening with the 7:30PM time slot of the haunted house tours! The relationship between Temporal Memory Accuracy and the 7:30PM time slot leads to differences in Authenticity compared to the relationship between Temporal Memory Accuracy and the 6:30PM or 8:30PM time slots.

While linear regression is a very popular statistical analysis, sometimes it’s preferable to use a linear mixed effects model!

Linear Mixed Effects Models

Up until this point, we’ve been working with a dataset where each participant only has 1 row. However, most of the time, our research questions revolve around repeated measures, which involves each participant having multiple rows. For example, we’re not just interested in how participants remembered the Ghostly Grounds segment; we actually care about how they remembered all the haunted house segments.

Linear mixed effect models involve multiple measures per subject – each person has multiple temporal memory accuracy responses.

This violates the independence assumption: Multiple responses from the same subject cannot be regarded as independent from each other.

Since every person has a slightly different memory store, this is going to be an idiosyncratic factor that affects all responses from the same subject, thus rendering these different responses inter-dependent rather than independent.

The way we’re going to deal with this situation is to add a random effect for subject.

This allows us to resolve this non-independence by assuming a different ‘baseline’ memory value for each subject.

So subject 1 may have a mean temporal memory accuracy score of .5 across the haunted house segments and subject 2 may have a mean temporal memory accuracy score of .2

We can model these individual differences by assuming different random intercepts for each subject. By assuming different random intercepts for each subject, we are essentially saying that everyone’s “starting point” or “baseline” is different.

Linear mixed effects models rely on fixed effects and random effects. Fixed effects refers to traditional independent variables that we are interested in examining whether they predict dependent variables. Random effects allow us to account for individual difference. In other words, above and beyond individual differences per subject, does the fixed effect X predict Y?

However, linear mixed effects models work a bit differently than linear regressions or multiple linear regressions. For linear mixed effect models, we need to include two models: a “baseline” model and a “testing” model. We’ll revisit this structure in the example below.

For this analyses, we will use a different dataset that includes multiple measures per subject.

QUESTION: Do differences in Threat lead to differences in Temporal Memory Accuracy?

HYPOTHESIS: People will remember the order of high-threat events better compared to low-threat events.

RELEVANT VARIABLES: Dependent: TOAccuracy (numeric) Independent: Threat (factor)

ANALYSIS: Mixed effects regression

Create data frame for analyses

For this analysis, let’s create a new data frame that subsets the following columns from the original df data frame:

#Create a new dataframe that includes the following columns: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone
df_mixed <- subset(df, select=c(PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone))

#Remove rows that have a missing value in any column
df_mixed <- df_mixed[complete.cases(df_mixed), ]

Let’s establish a baseline model includes our dependent variable (Temporal Memory Accuracy scores) and the random effect (1|PID). Let’s also establish a testing model that includes our dependent variable (Temporal Memory Accuracy scores), a fixed effect (Threat), and the random effect (1|PID). In this way, by building two models that differ by one variable (Threat), we can use the anova() function to determine whether the effect of Threat on Temporal Memory Accuracy scores is significant.

#Baseline model: Dependent variable and random effect
m1 <- lmer(TOAccuracy ~ (1|PID), data = df_mixed)

#Testing model: Dependent variable, fixed effect(s) and random effect
m2 <- lmer(TOAccuracy ~ Threat + (1|PID), data = df_mixed)

#Use the anova() function to determine the significance of the fixed effect
anova(m1, m2)
## refitting model(s) with ML (instead of REML)

It also may be helpful to recognize that in this example, Threat is the fixed effect, whereas (1|PID) is a random effect that assumes different random intercepts for each subject. We also see that our hypothesis was correct! People demonstrated greater temporal memory accuracy for high-threat segments (Devil’s Den, Ghostly Grounds) compared to low-threat segments (Infirmary, Asylum).

We can also add in more fixed effects!

QUESTION: Does temporal memory accuracy interact with threat and lead to differences in how Authentically someone communicates their memory?

HYPOTHESIS: Temporal memory accuracy will interact with threat to predict differences in how people Authentically recall their memory.

RELEVANT VARIABLES: Dependent: Authentic (numeric) Independent: TOAccuracy (numeric) Independent: Threat (factor)

ANALYSIS: Mixed effects regression

#Baseline model: Dependent variable, fixed effects, and random effect
m3 <- lmer(Authentic ~ TOAccuracy + Threat + (1|PID), data = df_mixed)

#Testing model: Dependent variable, interaction between fixed effects, and random effect
m4 <- lmer(Authentic ~ TOAccuracy*Threat + (1|PID), data = df_mixed)

#Use the anova() function to determine the significance of the fixed effect
anova(m3, m4)
## refitting model(s) with ML (instead of REML)

In this case, TOAccuracy and Threat are our fixed effects, whereas (1|PID) is our random effect. This time, our hypothesis did not pan out! The interaction between Temporal Memory Accuracy and Threat was unrelated to how Authentically people recalled their memories.

Week 10 Exercise: Analyzing Data w/ Continuous Independent Variables

1) Conduct a multiple linear regression assessing whether Time of Haunted house and Temporal Memory Accuracy independently predicted differences in how Analytically someone recalled their memory and store the model in an object called m1.

2) Focus on these relevant variables: Dependent: Analytic (numeric) Independent: Temporal Memory Accuracy (numeric), Independent: Time of Haunted House (Factor)

3) Use the summary() function to print the model output and try to interpret the model yourself

4) Use the report() function to print the model interpretation

'



'

Click for solution

Week 10 Assignment: Analyzing Data w/ Continuous Independent Variables

As part of the Week 10 R assignment, let’s explore a new research question.

QUESTION: Does temporal memory accuracy OR threat interact predict differences in how negatively someone communicates their memory?

HYPOTHESIS: Temporal memory accuracy will not predict differences in how negatively people recall their memory. Threat will predict differences in how negatively people recall their memory.

RELEVANT VARIABLES: Dependent: Tone (numeric) Independent: TOAccuracy (numeric) Independent: Threat (factor)

ANALYSIS: Mixed effects regression

1) Read in the frightnight_analyses CSV file

2) Create a new dataframe that includes the following columns: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone

3) Remove rows that have a missing value in any column

4) Run a linear mixed effects model that assesses the relationship between Tone and Temporal Memory Accuracy and Threat. Make sure to include a random effect that accounts for individual differences.

5) Store the model in a data object called “m5”

6) Print the summary of the model and try to interpret the model yourself

7) Use the report() function to print out an interpretation of the model and see if your interpretation matches

'



'

Week 11: Visualizing data: Intro to ggplot

For the Week 11 workshop, let’s read in the frightnight_analyses CSV file!

# For Mac
Path <- "/Users/tuh20985/Desktop/CABLAB-R-Workshop-Series-main/datasets/"

#set working directory
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory

df <- read.csv(file = "frightnight_analyses.csv") #Load in the fright night practice csv file

Before we start making any graphs, let’s create a new data frame with the following columns from the original df dataframe: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone

#Create a new dataframe that includes the following columns: PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone
df_plot <- subset(df, select=c(PID, Section, Stage, Threat, Group, Condition, Recall, Time_HH, Fear.rating, TOAccuracy, Analytic, Authentic, Clout, Tone))

#Remove rows that have a missing value in any column
df_plot <- df_plot[complete.cases(df_plot), ]

Visualizing data! Navigating ggplot2

Next we’ll start talking about how to make graphs in R!

ggplot2 is a popular plotting package in R that makes it fairly easy to create complex plots from data in a data frame.

ggplot2 refers to the name of the package itself, whereas we use the function ggplot() to generate the plots. We’re going to start off with building a very simple plot, and then we will add in some more lines to organize a plot like you would for a manuscript/publication!

Let’s revisit our t-test, where we were interested in whether there were differences in Temporal Memory Accuracy between Goal-assigned and Control participants.

ggplot(data = df_plot, aes(x = Group, y = TOAccuracy)) + #Plot the variables we care about
          geom_bar(stat="identity") #Generate a bar plot

Here, we see that we’re plotting how TOAccuracy varies according to Group (Control versus Goal-assigned).

geom_bar(stat = “identity”) is a function within ggplot that is used in order to create bar plots.

Now that we know how to plot data, let’s try and clean the plot up. Next, we’re going to apply three more arguments to ggplot that will add titles, axis labels, and legend labels.

ggplot(data = df_plot, aes(x = Group, y = TOAccuracy, fill = Group)) + #Using fill = Group allows us to color-code the plot according to Group
          geom_bar(stat="identity") + #Generate a bar plot
          labs(x = 'Experimental Group', y = 'Temporal Memory Accuracy', title = "Temporal Memory Accuracy by Group") + #Assign axis titles for the x- and y-axis
          scale_x_discrete(labels = c("Control", "goal-assigned")) + #Change the x-axis labels
          scale_fill_discrete("Experimental Group", labels = c("Control", "goal-assigned")) #Change the labels for the legend

It’s important to note that in the previous graphs, we have conflated the Temporal Memory Accuracy scores during the Immediate and One-Week Delay study visits. In other words, the bars that reflect the Temporal Memory Accuracy scores contain scores from both the Immediate and 1-week Delay study visit.

Luckily, we can use a pretty simple function called facet_wrap() to split graphs based on a variable.

Let’s apply the facet_wrap() function to visualize Temporal Memory Accuracy scores by Group during the Immediate and 1-week Delay study visits.

ggplot(data = df_plot, aes(x = Group, y = TOAccuracy, fill = Group)) + #Using fill = Group allows us to color-code the plot according to Group
          geom_bar(stat="identity") + #Generate a bar plot
          labs(x = 'Experimental Group', y = 'Temporal Memory Accuracy', title = "Temporal Memory Accuracy by Group") + #Define a plot title, an x-axis title, and a y-axis title
          scale_x_discrete(labels = c("Control", "goal-assigned")) + #Change the x-axis labels
          scale_fill_discrete("Experimental Group", labels = c("Control", "goal-assigned")) + #Change the labels for the legend
          facet_wrap(~Stage) #Split the graphs based on the Stage variable

Week 11 Exercise: Visualizing data: Intro to ggplot

1) Create a bar plot visualizing how Temporal Memory Accuracy scores differ by experimental condition across The Immediate and 1-week Delay study visits

2) Add a plot title, x-axis title, and y-axis title

'



'

Click for solution

So we already learned how to add a plot title, an x-axis title, and a y-axis title, as well as how to change x-axis text labels and legend text labels. We can still customize additional plot aesthetics that we’ll talk about below.

Let’s work on changing the background of the plot, the colors of the bars in the bar plot, and changing the size and color of the text.

ggplot(data = df_plot, aes(x = Group, y = TOAccuracy, fill = Group)) + #Using fill = Group allows us to color-code the plot according to Group
          geom_bar(stat="identity") + #Generate a bar plot
          scale_fill_manual(values = c("#E69F00", "#56B4E9"), name = "Experimental Group", labels = c("Control", "goal-assigned")) + #Customize the colors of the bars in the bar plot, add a legend title, and change the legend labels
          labs(x = 'Experimental Group', y = 'Temporal Memory Accuracy', title = "Temporal Memory Accuracy by Group") + #Assign axis titles for the x- and y-axis
          scale_x_discrete(labels = c("Control", "goal-assigned")) + #Change the x-axis labels
theme_classic() + ## We can use a theme customize the background of the plot. theme_classic makes the background white and removes gridlines
theme(
plot.title = element_text(size=15, face = "bold", color="red"), #customize plot title
axis.title.x = element_text(size=15, face = "italic", color="green"), #Customize x-axis title
axis.title.y = element_text(size=15, face = "bold", color="#5C5CD1"), #Customize y-axis title
axis.text.x = element_text(size=15, face = "italic", color="#D15CD1"), #Customize x-axis text labels
axis.text.y = element_text(size=15, face = "bold", color="black"), #Customize y-axis text labels
legend.title = element_text(size = 13, face = "italic", color = "#ED7557"), #Customize legend title
legend.text = element_text(size = 13, face = "bold", color = "#ABA7A6") #Customize legend text labels
)

As you can see, there is a lot we can adjust when it comes to text and graphics using ggplot 2 and it’s fairly straightforward once you get the hang of it. When it comes to colors in R, you can either use 6-digit HEX codes that reference a range of colors, or you can reference colors by name!

Here’s a helpful resource for color names in R: https://www.datanovia.com/en/blog/awesome-list-of-657-r-color-names/ Here’s another resource for HEX color codes in R: https://r-charts.com/colors/

I like to use this website which lets you pick the color visually and then outputs the corresponding HEX code: https://ssc.wisc.edu/shiny/users/jstruck2/colorpicker/

Okay, now that we’ve played around with customizing aesthetics in R, we’re going to put that aside and focus on best practices for visualizing summary statistics in R.

Plotting the mean and standard deviation in R

So far, our y-axis has reflected an aggregate score for temporal memory accuracy between groups, rather than the average score between groups. For example, we’re not necessarily interested in what the combined score is across Control and goal-assigned groups, but rather whether there are differences in average temporal memory accuracy scores.

#Research Question: On avearge, are there differences in Temporal Memory Accuracy by Group?
model1 <- t.test(x = df_plot$TOAccuracy[df_plot$Group == "Goal-assigned"],
                y = df_plot$TOAccuracy[df_plot$Group == "Control"],
                paired = FALSE,
                alternative = "two.sided")

report(model1)
## Warning: Unable to retrieve data from htest object.
##   Returning an approximate effect size using t_to_d().
## Effect sizes were labelled following Cohen's (1988) recommendations.
## 
## The Welch Two Sample t-test testing the difference between
## df_plot$TOAccuracy[df_plot$Group == "Goal-assigned"] and
## df_plot$TOAccuracy[df_plot$Group == "Control"] (mean of x = 0.54, mean of y =
## 0.55) suggests that the effect is negative, statistically not significant, and
## very small (difference = -0.01, 95% CI [-0.05, 0.02], t(645.74) = -0.83, p =
## 0.409; Cohen's d = -0.07, 95% CI [-0.22, 0.09])
#Plot the average Temporal Memory Accuracy by Group
ggplot(data = df_plot, aes(x = Group, y = TOAccuracy, fill = Group)) + #Using fill = Group allows us to color-code the plot according to Group
          geom_bar(stat='summary', fun='mean') + #plot the mean of temporal memory accuracy
          labs(x = 'Experimental Group', y = 'Temporal Memory Accuracy', title = "Average Temporal Memory Accuracy by Group") #Assign axis titles for the x- and y-axis

By plotting averages in Temporal Memory Accuracy, this is the correct visualization of the t-test that we ran earlier assessing whether there were differences in average temporal memory accuracy between goal-assigned and Control participants.

As the graph shows, this difference is not in large at all, and the t-test affirms that the difference is statistically not significant (p = .409).

Adding standard deviation error bars

A traditional way to better represent how the mean difference varies according to the independent variable (i.e., Group) is by adding error bars that reflect the standard deviation. For example. maybe the standard deviation for the average temporal memory accuracy for Control participants is .2, whereas the standard deviation forr the average temporal memory accuracy for Experimental Groups is .9. These differences in standard deviation can help contextualize how to interpret the findings.

Let’s apply a function in order to plot the mean and standard deviation per group using the data_summary() function described here: http://www.sthda.com/english/wiki/ggplot2-barplots-quick-start-guide-r-software-and-data-visualization

# Don't change anything to this function, just run it as is!

#+++++++++++++++++++++++++
# Function to calculate the mean and the standard deviation
  # for each group
#+++++++++++++++++++++++++
# data : a data frame
# varname : the name of a column containing the variable
  #to be summariezed
# groupnames : vector of column names to be used as
  # grouping variables
data_summary <- function(data, varname, groupnames){
  require(plyr)
  summary_func <- function(x, col){
    c(mean = mean(x[[col]], na.rm=TRUE),
      sd = sd(x[[col]], na.rm=TRUE))
  }
  data_sum<-ddply(data, groupnames, .fun=summary_func,
                  varname)
  data_sum <- rename(data_sum, c("mean" = varname))
 return(data_sum)
}


#Apply the data_summary() function on our dataset and create a new data object called df_summarized
#varname = dependent variable
#groupsnames = independent variable(s) 
df_summarized <- data_summary(df_plot, varname="TOAccuracy", 
                    groupnames=c("Group"))
## Loading required package: plyr
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:purrr':
## 
##     compact
#Print df_summarized to see that we capture the mean and SD of Temporal Memory Accuracy per group
df_summarized
#The function geom_errorbar() can be used to produce a bar graph with error bars.

# Plot the standard deviation of the mean as error bars
ggplot(df_summarized, aes(x=Group, y=TOAccuracy, fill=Group)) + #plot the variables we are interested in from the df_summarized data object
   geom_bar(stat="identity", position=position_dodge()) + #generate a bar plot
  geom_errorbar(aes(ymin=TOAccuracy-sd, ymax=TOAccuracy+sd), width=.2, #use the geom_errorbar() function to plot the standard deviation error bars
                 position=position_dodge(.9))

Here, we were able to apply the data_summary() function (which someone else kindly created so that it works with any dataset to produce the mean and SD of variables that you are interested in) and we used the geom_errorbar() function in order to plot those standard deviation error bars.

Now, while this is helpful, most researchers aren’t actually interested in standard deviation error bars, and instead prefer to see standard error error bars.

Next, we will apply a similar function to generate and plot standard error bars

Adding standard error error bars

The summarySE function appropriate when working with between-subjects variables. If you have within-subjects variables and want to adjust the error bars so that inter-subject variability is removed as suggested in Loftus and Masson (1994), then the other two functions, normDataWithin and summarySEwithin must also be added to your code; summarySEwithin will then be the function that you call.

## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
##   data: a data frame.
##   measurevar: the name of a column that contains the variable to be summariezed
##   groupvars: a vector containing names of columns that contain grouping variables
##   na.rm: a boolean that indicates whether to ignore NA's
##   conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
                      conf.interval=.95, .drop=TRUE) {
    library(plyr)

    # New version of length which can handle NA's: if na.rm==T, don't count them
    length2 <- function (x, na.rm=FALSE) {
        if (na.rm) sum(!is.na(x))
        else       length(x)
    }

    # This does the summary. For each group's data frame, return a vector with
    # N, mean, and sd
    datac <- ddply(data, groupvars, .drop=.drop,
      .fun = function(xx, col) {
        c(N    = length2(xx[[col]], na.rm=na.rm),
          mean = mean   (xx[[col]], na.rm=na.rm),
          sd   = sd     (xx[[col]], na.rm=na.rm)
        )
      },
      measurevar
    )

    # Rename the "mean" column    
    datac <- rename(datac, c("mean" = measurevar))

    datac$se <- datac$sd / sqrt(datac$N)  # Calculate standard error of the mean

    # Confidence interval multiplier for standard error
    # Calculate t-statistic for confidence interval: 
    # e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
    ciMult <- qt(conf.interval/2 + .5, datac$N-1)
    datac$ci <- datac$se * ciMult

    return(datac)
}



# Apply the summarySE() function on our dataset to create a data object called df_summarize.se
df_summarize.se <- summarySE(df_plot, measurevar="TOAccuracy", groupvars ="Group")


#Print the df_summarize.se to confirm that we captured the mean and standard error of Temporal Memory Accuracy by group
df_summarize.se
# Error bars represent standard error of the mean
ggplot(df_summarize.se, aes(x=Group, y=TOAccuracy, fill=Group)) + #plot the variables we are interested in from the df_summarized data object
    geom_bar(position=position_dodge(), stat="identity") + #generate a bar plot
    geom_errorbar(aes(ymin=TOAccuracy-se, ymax=TOAccuracy+se), #Plot standard error bars
                  width=.2,                    # Width of the error bars
                  position=position_dodge(.9))

Violin plots!

Violin plots are a great way to check the distribution of the data (e.g., outliers)

Let’s create a violin plot reflecting how Temporal Memory Accuracy differs by Group.

Just like we used geom_bar() to generate a bar plot, we will use geom_violin to generate a violin plot.

One thing that we haven’t talked about yet is that we can also store plots in a data object, just like we do with statistical models!

Storing a plot in a data object can improve the readability of the plot code. In other words, rather than putting all of the code in one giant chunk, you can add in customizations one line at a time.

Below, we will create a violin plot that shows the distribution of Temporal Memory Accuracy scores by group.

#Create violin plots and store it in an object called "p"
p <- ggplot(data=df_plot, aes(x=Group, y=TOAccuracy, fill=Group)) + 
  geom_violin(trim=FALSE) 

#Add in the jittered points that reflect each individual participant
p + geom_jitter(shape=16, position=position_jitter(0.2))

While we didn’t add too many customization to this plot, hopefully you can see why some people prefer to store plots in data objects and add in their customizations one line a time, rather than all at once.

I’ll also provide some additional information about the violin plot arguments below.

geom_jitter() will plot the individual data points as dots. Shape = the type of shape that the dots will be. 16 is a black circle. 2 is a Triangle. position_jitter = the degree of jitter in x direction.

While we’ve only covered bar plots and violin plots in this section, there are tons of other plots that may work better depending on which research question you’re interested in visualizing!

Week 11 Assignment: Visualizing data: Intro to ggplot

Let’s create a violin plot that plots Temporal Memory Accuracy Scores by Condition! Let’s customize this plot to meet the following criteria:

1) Read in the frightnight_analyses CSV file.

2) Create a new data frame with the following columns from the original df dataframe: PID, Section, Stage, Threat, Group, Condition, TOAccuracy

3) Remove rows that have a missing value in any column

3) Plot Temporal Memory Accuracy Score by Condition

4) Use the geom_violin() and geom_jiter() arguments to generate a violin plot

5) Update the colors of each violin to reflect the following colors: “#555599”, “#66BBBB”, “#DD4444”

6) Change the title of the legend to “Experimental Condition”

7) Change the legend labels to: “Baseline Condition”, “Share Condition”, “Test Condition”

8) Apply the theme_minimal() background to the plot

9) Change the plot title to “Temporal Memory Accuracy by Condition”, the x-axis title to “Experimental Condition”, and the y-axis title to “Temporal Memory Accuracy”.

10) Change the x-axis text labels to “1) Baseline”, “2) Share”, “3) Test”

11) Change the plot title, x-axis title, y-axis title, x-axis text, y-axis text font size to 17

12) Change the legend title and legend text font size to 15

'



'

Final Project?

TBD :)

Conclusion

Congrats! You made it through an entire R workshop series! Hopefully you feel like you’re starting to pick up on some things! It’s also totally okay if things aren’t clicking just yet, but regardless of how you may be feeling, I will always be happy to work with you 1 on 1 if you have any specific questions as you move forward on your coding journey. My email is or you can always send me a Slack!