Issue
I have the following df
A B
"Axon guidance" 1
"Chemical carcinogenesis - reactive oxygen species" 2
"Electron Transport Chain (OXPHOS system in mitochondria)" 3
"The citric acid (TCA) cycle and respiratory electron transport" 4
Using
grep(paste0("^", df[3,1], "$"), df[,1]))
Gives 0
Using
grep(paste0("^", df[2,1], "$"), df[,1]))
Finds the exact match (integer which is the line containing the match)
Why grep can't get an exact match when using with strings that contains parentheses?
Solution
As already noted, the problem here is that round brackets are control characters used to define capture groups in RegEx search patterns.
Two approaches you may wish to consider are:
- Sanitise the text being searched and the text used to create search patterns of the relevant characters
- Double escape the RegEx control characters in the search patterns
Generate Sample Data
df <- data.frame(A=c("Axon guidance",
"Chemical carcinogenesis - reactive oxygen species",
"Electron Transport Chain (OXPHOS system in mitochondria)",
"The citric acid (TCA) cycle and respiratory electron transport"),
B=1:4)
Demonstrate problem
grep(paste0("^", df[2,1], "$"), df[,1]) # <- the OP has an extra bracket here
grep(paste0("^", df[3,1], "$"), df[,1])
Option 1
Here we sanitise both the text being searched & the patterns used to search
Here we are just sanitising for round brackets but there are other special characters in regex (and cases where complex unicode characters also create problems)
df$sanitised_text <- gsub("[()]*", "", df$A)
Demonstrate Solution
grep(paste0("^", df[2, "sanitised_text"], "$"), df[,"sanitised_text"])
grep(paste0("^", df[3,"sanitised_text"], "$"), df[,"sanitised_text"])
Option 2 - Double escape the regex control characters
sanitise_search_patterns <- function(x){
y <- gsub("\\(", "\\\\(", x)
gsub("\\)", "\\\\)", y)
}
df$sanitised_search_patterns <- sanitise_search_patterns(df$A)
Demonstrate Solution
grep(paste0("^", df[2, "sanitised_search_patterns"], "$"), df[,"A"])
grep(paste0("^", df[3,"sanitised_search_patterns"], "$"), df[,"A"])
You could use either approach here but there are cases where non-control characters can create similar types of false negatives - e.g. a multiplicity of unicode characters for whitespace, hyphens and complex characters formed from more than one glyph - so sanitising the search text might still be usefully considered alongside double escaping.
Answered By - Sef Answer Checked By - Robin (WPSolving Admin)