Friday, January 26, 2024

[SOLVED] Why bracket expressions [\w] and [\d] behave differently in GNU sed

Issue

I would like to know why using bracket expressions [\w] and [\d] in GNU sed produces different results:

$ echo '\weds4'|sed -E 's/[\w]/_/g'
__eds4
$ echo '\weds4'|sed -E 's/[\d]/_/g'
\we_s4
$ echo '\weds4'|sed -E 's/[\s]/_/g'
_wed_4

I expected that echo '\weds4'|sed -E 's/[\d]/_/g' would produce _we_s4 and not \we_s4

Here described that it should match both \ and d, like I'm expecting.

So in POSIX, the regular expression [\d] matches a \ or a d.

Why is it happening?

Demo here.

Side note: using BRE instead of ERE doesn't change anything.


Solution

From https://www.gnu.org/software/sed/manual/sed.html :

Regex syntax clashes (problems with backslashes)

....

In addition, this version of sed supports several escape characters (some of which are multi-character) to insert non-printable characters in scripts (\a, \c, \d, \o, \r, \t, \v, \x). These can cause similar problems with scripts written for other seds.

And

5.8 Escape Sequences - specifying special characters

[...]

\dxxx

Produces or matches a character whose decimal ASCII value is xxx.

Why is it happening?

When writing \d and there is nothing after it, the https://github.com/mirror/sed/blob/master/sed/compile.c#L1345 case matches and executes https://github.com/mirror/sed/blob/master/sed/compile.c#L1356 convert_number() which in case the buffer is empty just assigns *result = *buf the character to the result value https://github.com/mirror/sed/blob/master/sed/compile.c#L275 instead of converting the digits after d.

This will happen to all the cases in the switch, so \d \x \o with nothing behind will match d x and o. I would count /\d/ as undefined behavior in GNU sed - \d has to be followed by 3 decimals. I would say GNU sed documentation does not specify what should happen on \d or \x or \c or \o not followed by digits or followed by invalid characters.

why do I need second slash?

In POSIX sed all https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sed.html I think all three of your commands are invalid / undefined behavior. Sed does not specify what should happen on \d \s or \w, these are invalid escape sequences, so you can't expect them to work. Your commands are invalid. If you want to match \ you have to escape it \\, see https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap05.html#tagtcjh_2 .

But it would be nicer to get an error messages from GNU sed like in the case of \c.



Answered By - KamilCuk
Answer Checked By - Timothy Miller (WPSolving Admin)