Issue
I'm trying to run grep across multiple columns to create a new binary variable in my dataset. I can't share my real dataset, but I've created a sample one to demonstrate my issue:
breakfast <- c("apple orange", "orange banana", "apple")
lunch <- c("orange", "apple orange", "apple banana")
df <- data.frame(breakfast, lunch)
In this example, my goal is to create a new binary variable in this dataframe called "apple" that is 1 if either the "breakfast" or "lunch" columns contain "apple" and 0 if they do not.
I can achieve this by using nested ifelse statements and grepl:
df$apple <- ifelse(grepl("apple", df$breakfast), 1,
ifelse(grepl("apple", df$lunch), 1, 0))
In my real dataset though, I need to scan more than just two columns and repeat the process for multiple strings, so I'm hoping to create a function that will run it through the columns for me. What's the best way to do this?
I've found several posts that address similar questions, but many of them are based on variables with single values to match to rather than concatenated strings (== "apple" rather than contains "apple"). I'm also struggling with how to adapt existing examples to then create the binary variable I'm looking for.
Solution
A general solution would be to (i) define a vector with all possible fruits
fruits <- c("apple", "orange", "banana", "lemon")
and (ii) to run a for
loop that detects whether each fruit
token is present in each of the columns and that creates for each fruit
type a new column:
library(stringr)
for(i in fruits){
df[i] <- +str_detect(apply(df, 1, paste0, collapse = " "), fruits[which(fruits == i)])
}
df
breakfast lunch apple orange banana lemon
1 apple orange orange 1 1 0 0
2 orange banana apple orange 1 1 1 0
3 apple apple banana 1 0 1 0
For more solutions see Detecting key words across multiple columns and flagging them each in new columns
Answered By - Chris Ruehlemann Answer Checked By - Mary Flores (WPSolving Volunteer)