Sunday, September 18, 2022

[SOLVED] Using grep can't find exact match when string contains parentheses ()

September 18, 2022 grep, r

Issue

I have the following df

A                                                                          B
"Axon guidance"                                                            1                                                                                                
"Chemical carcinogenesis - reactive oxygen species"                        2                                                           
"Electron Transport Chain (OXPHOS system in mitochondria)"                 3                                                                                                                                           
"The citric acid (TCA) cycle and respiratory electron transport"           4

Using

 grep(paste0("^", df[3,1], "$"), df[,1]))

Gives 0

Using

 grep(paste0("^", df[2,1], "$"), df[,1]))

Finds the exact match (integer which is the line containing the match)

Why grep can't get an exact match when using with strings that contains parentheses?

Solution

As already noted, the problem here is that round brackets are control characters used to define capture groups in RegEx search patterns.

Two approaches you may wish to consider are:

Sanitise the text being searched and the text used to create search patterns of the relevant characters
Double escape the RegEx control characters in the search patterns

Generate Sample Data

df <- data.frame(A=c("Axon guidance", 
                     "Chemical carcinogenesis - reactive oxygen species", 
                     "Electron Transport Chain (OXPHOS system in mitochondria)",
                     "The citric acid (TCA) cycle and respiratory electron transport"),
                 B=1:4)

Demonstrate problem

grep(paste0("^", df[2,1], "$"), df[,1]) # <- the OP has an extra bracket here
grep(paste0("^", df[3,1], "$"), df[,1])

Option 1

Here we sanitise both the text being searched & the patterns used to search

Here we are just sanitising for round brackets but there are other special characters in regex (and cases where complex unicode characters also create problems)

df$sanitised_text <- gsub("[()]*", "", df$A)

Demonstrate Solution

grep(paste0("^", df[2, "sanitised_text"], "$"), df[,"sanitised_text"]) 
grep(paste0("^", df[3,"sanitised_text"], "$"), df[,"sanitised_text"])

Option 2 - Double escape the regex control characters

sanitise_search_patterns <- function(x){
  y <- gsub("\\(", "\\\\(", x)
  gsub("\\)", "\\\\)", y)
}
 
df$sanitised_search_patterns <- sanitise_search_patterns(df$A)

Demonstrate Solution

grep(paste0("^", df[2, "sanitised_search_patterns"], "$"), df[,"A"]) 
grep(paste0("^", df[3,"sanitised_search_patterns"], "$"), df[,"A"])

You could use either approach here but there are cases where non-control characters can create similar types of false negatives - e.g. a multiplicity of unicode characters for whitespace, hyphens and complex characters formed from more than one glyph - so sanitising the search text might still be usefully considered alongside double escaping.

Answered By - Sef

Answer Checked By - Robin (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Sunday, September 18, 2022

[SOLVED] Using grep can't find exact match when string contains parentheses ()

Issue

Solution

Generate Sample Data

Demonstrate problem

Option 1

Demonstrate Solution

Option 2 - Double escape the regex control characters

Demonstrate Solution

Popular Posts

Labels