Issue
SLES features a supportconfig
command that was found to be very slow when creating the "Updates". First examination showed that sed
is using "100% CPU" for minutes, so I suspected something rather inefficient as sed
is usually quite efficient.
Here is my isolated test case (the file was copied from a temporary file that supportconfig
would create):
# time sed -i -e ' s/\(.*[P|p]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*[P|p]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*[P|p]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*[P|p]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g; s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g; s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g; s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g; s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g' /tmp/uuu
real 6m53.283s
user 6m45.909s
sys 0m0.129s
# wc /tmp/uuu
12234 123937 2711538 /tmp/uuu
So it took almost 7 minutes to process a 2.7MB text file. The only special thing may be that the text lines are rather long.
I suspect that the .*
is causing massive backtracking when matching, and maybe the programmer was just a bit lazy providing better regular expressions.
While sed
was running, I also did an strace
, but that basically just showed brk
system calls (memory allocation).
So what might cause the terrible performance, and is there a way to improve it?
I can't provide the original input file as it seems to contain URLs for non-free downloads, unfortunately.
But I could try proposals for a different sed
invocation on the file I saved.
sed
version being used is sed-4.2.2-7.3.1.x86_64
.
Transformation for readability
Fixing the [P|p]
to [Pp]
, a more readable version of the sed
command will look like this:
sed -i -e '
s/\(.*[Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*[Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
s/\(.*_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
s/\(.*password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(.*password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g'
So obviously the purpose is to remove sensitive information like passwords from the log file.
Note: it seems the code block does parse *
as markup in some cases...
Trying a solution
Doing the modifications suggested in https://stackoverflow.com/a/77820459/6607497, I got these results:
### original version of regex
# /tmp/uuu.sh
real 6m39.697s
user 6m39.092s
sys 0m0.032s
### modified version of regex
# /tmp/uuu.sh
real 0m0.237s
user 0m0.226s
sys 0m0.009s
When reducing the original input file to a smaller test case (about 9kB), the effect is still there, but less dramatic:
### original regexes
# /tmp/uuu.sh /tmp/eeee
real 0m0.119s
user 0m0.115s
sys 0m0.004s
### improved regexes
# /tmp/uuu.sh /tmp/eeee
real 0m0.007s
user 0m0.007s
sys 0m0.000s
Here is the obfuscated and trimmed test input file as BASE64 (due to very long lines); pipe input into base64 -d | gzip -d
to decode:
# gzip < /tmp/eeee | base64
H4sIAH/cp2UAA9VZbVPbuBb+vP4VZ9yZ7m6JnThQyk0J99JAu2x5yRDKbKfTySi2kmiQLY8kA7nL
7m+/R/ILSYAUmnRv6zKNrJfjo+ccPedIetZuf4KOiGOSRPC5/ajnmfMM6pmS9QFL6v+dpCmV4HmJ
SDyWaCpJqNkltTXeKB154ZiGFwpSKaIs1Mo5pUNJ1ZglI1BUXrKQws+9D739/iFLsuv+vpGRSqZo
v4fNVPaHQvZ7u93+bppyFhLNRKL6QbPf677sX29t9jc3fvadQ0EiI1LSVCimhZxARDTxfR8/mDex
RGnCOY0gJeEFGVFlWnsAN3B6O6p8buDAaJIQDsckpkVdVXzqcwPnVCpUHUu7MhznX1DwhijqeJ63
5t3z3Fd5b8fHPWt3S1XBYWuo0H96E6VpPKu44tSL0Xj4w+mIhBOsO8wLR7b6aTgETa+Rl3LjGVSF
bXNsbe9w3/ttF3uhfb2uELzUYUxmBRmnAes0cOs08BsbjWH3kjBOBowzPYH9a00TC33QBBRqdfBf
Gi3u0cHigDoUCqDj5UrYyp5xxC/rkDsuoOOC6T/tuAt1+EjVDA69GRTKynlAF+pQfO5BWzyIg/Ps
H6MGz7uOucj0LUds/3soxIBIZ1tpSUm842zHVClcs6AnKW27LBkKd+ceKnlOcCW/XpJPciH+dr34
6AOff5hzvjRyISVNDy4Q8ThT+vYVEqShtpsvChcuc2Zpu8acLqrDKbJK2224gJqFY1si0hTy+Zkh
SSRk281957ADz7l+PdY6Va16/erqyleZon4o4vrzkX7tgsrQD+SkHPD0RedalPLxS1imdYcd3NJr
OEtwzuY7IwSLygoG1H4spM4xw/HlcBeGnFwaFLonJ4cuMDWw/YeEK+re2qas2dk2oA05G6IxWUz7
GhF/td5YX29uNRBhTa+xptlobnhBw1sPzhpBq9HAv7VG4NZ3tkvF8q/u0BKh7fpcy3ZEVShZaua/
82TA0X+mx2/XC3zuuE/JZ9+DA32ZMVfoQHPU/nQP8m4l/H+dqH6frxwcf/jj4UhwH8RP8ZjvyFtW
7hi9pXiltzpaqRyh8UibzyMkhkO0EhBATFNJx4YaLqlTRn6VMW06VREXBhnjGpCuCSiMTJjUhSKi
YFT3q2FnYwopJxrRjIFEEQZfRc1YHIJFSCiNFAyliKshGoeo2ExYaXxhiRdyRhMNETXxWoEWts+V
kDzK466qBscCB6XiCqeecRgj53mYUZivkySkdm6ZRm2rASZViQnGWGnydGVtr/z7M6RqUIEVSotx
/igaQ29sVNToBsrK1DQcJ4KL0aQaFVKp2bAiKRJKgQjoKYRqZmjVnxJM+svlxBTQShFEhCj1z9P2
Ckm52DfMrRnMT9TXsune+d7t0tEym105tuKHptI7m6oZcy5lzJm92RctNK/FYnPlsr1S9mOZ7kF7
LSK6nVyzxemRCSSpCRrOLUXNbk7HlKcKJiKDmI0kMXwBZNqVDV9NUUQhaMrqQcOSQBDYH4FrXIKy
cFr6un/xBM0aDEq6QNAvmc34yzwfrsYMCYFIChFToUhQrwz3AchAi9gqd8AaGtsMVi3kuCRCxuM1
oxESlJeManDw5gh+J5dk01IQUnqSxQNUWgwLMZwNJJEM1fjFODG9JnGKUIkU44TiXsP/l7/1q+/8
ZNCt5/D+NI/0bhgazi/4exZz5DeWhDyLcEYsMeDLR0wLZzWoLOtbU8aFDYmJZREzTG2Y2fhQOAmx
RY9JstB4hWimFeVDH7ol/qiW0RtDkiq/YoyRqQx9trSbytIUfR9nYWAiOo9IeiwphQklNrbkXXIy
KIxhUS+NAJuFsCvGOdK+RaRHU3QfY5NmI3hlNcGogTFyBvSHyaQ+tzGsl/vkfNt+TmUVmx6xdzfb
9tPuEe79NRyUq7YFXA2qY4BkyEaZzAW+ZQjWIqn2GIDq0B4B2P8wt0uGzicTmz87ccbNOUB+HNYu
FgdVren6Xy6oTCj/daazn1cqHJSjVct/vKAmsyQxmcAqFA4ljdDLGNKYH9WPO52pCgcTVWm5EF6c
7h+dnO/vwZuP6IHd7snpWefk+O3BuxdOigEdM5poYaevt5VMY/DOrdvjRBOKftDw1/31TW/dD7b8
DT+PEE7Pf+njc+Yb97PnMrjw6zl1RcVv3aTeYSHlgtKUcITbNwQt8VMz2vU00RmSzl6xEE3+9lZk
mOWs0u1kNpg0/cAzv5iJeUZBr9BwyWMpZc6lpmADz1N2Ts6nP11mbTxkVLqt2yyr5hbOh5U2z6q5
JgrjWxGGa24uAmtOi0BFI6w1TmIEPTW1qpmzOG9zowaBh3mGCC8o5udXTI/hQ8JZzAwdnTOpkajg
CLNKk3vXoCuZkOYgoDdFojVkaVw9uUHe0wmqhcHUZPZG/5knn4fUqk80NhpaMolZIwCbM+EffDjr
YC96nTJM/MtuzZdeYwuzq/lu01zerxDa7ZwdnO9jszkNw3fM67n7V20O/LtpyawVFtnAONOUHf76
vCQjTHmLY8/JE1wumaQtsGmOk0negnIrq8Kw2sp+AyrqfSsqyqmgr8UFTR7DWauc03Lb9tXz8Tez
19/ft65fZ4f1H8IOXze3jR9hbt86PVnmBqjYAZI8BuAmwCQMnsIuN7DLGVH330s96lniLraSsJ+Q
gbl8uYF33XfQMbdR9jLY3iph6QyDFDhL3bneuXpdXsLaolonKG4El+DUlUgo7jKxZNQqEb+tRYEA
TrO48J06ki/vJJ9gx+UlFLeec9qW5RuYEKWbK81yTZpQ5ikrWGo0uUQ1R5Km4DGzpbqeVO+X0Dna
Ozw43neOT/rd05M/Pra5CAkf42a2FjRf+Q38F9Re/P7h+P0Lx+QyfSugbYqY1eQNrfWguVV3hlWr
0zvpvO8VAh2FiaoqW0YiHaOjFG82O1ogMhFF4wKt/gfLVc0SqSIAAA==
Another Experiment
Inspired by https://stackoverflow.com/a/77833314/6607497, I tried to find out the actual timing differences, so I created a file with one million random printable ASCII characters (assuming it will be large enough to show a difference).
Not knowing what's actually in the file, I tried the "foo variants" using grep
(assuming grep
would show similar effects):
> time grep -q 'foo' /dev/shm/random.txt
real 0m0.005s
user 0m0.004s
sys 0m0.001s
> time grep -q '.*foo' /dev/shm/random.txt
real 0m0.005s
user 0m0.000s
sys 0m0.006s
> time grep -q 'foo.*' /dev/shm/random.txt
real 0m0.005s
user 0m0.004s
sys 0m0.000s
> time grep -q '.*foo.*' /dev/shm/random.txt
real 0m0.005s
user 0m0.004s
sys 0m0.001s
Surprisingly there was no difference (and "foo" is not present in the file).
However there are matches for fo.
(position:match) like this:
763:foH
3440:foY
13789:fo;
...
958405:foE
965938:foF
967142:foD
The complete list is:
> grep -o fo. /dev/shm/random.txt | tr '\n' ' '
foH foY fo; fof fom foa foj foV fo9 foy fo2 foP fo9 fo4 foC for fox foG foh foa foj fo} fod foq fo} foR fow fok foy foh fo> foy foK fo` foW fo2 foJ foL fo+ for fo~ foP fo` foq fo[ foi fos fo~ fo: fo. foy foD fo@ fo= fo+ foT foO fo] fok fo^ foa foh fod foA fo8 foJ foX fo` foN fo3 foS fo* fog foK fo6 fof fo4 fog foE fo* foO fo] foB fo< fo# fo" fow fo8 fo> foA fo+ foO fog foj foD fo: for fo1 fo. fo' foi foO foW foE foF foD >
So I tried to use sed
actually:
> time sed -n -e 's/foo/XXX/p' /dev/shm/random.txt
real 0m0.005s
user 0m0.001s
sys 0m0.004s
> time sed -n -e 's/.*foo/XXX/p' /dev/shm/random.txt
real 1m20.032s
user 1m19.818s
sys 0m0.016s
> time sed -n -e 's/foo.*/XXX/p' /dev/shm/random.txt
real 0m0.006s
user 0m0.000s
sys 0m0.006s
> time sed -n -e 's/.*foo.*/XXX/p' /dev/shm/random.txt
real 1m19.490s
user 1m19.425s
sys 0m0.017s
Hunting for the constant string showed the same performance as grep
had, but prefixing .*
showed a performance of almost 80000 times worse!
In contrast suffixing .*
had almost no effect, and using both, prefix and suffix .*
is basically the same as just using the prefix variant.
Versions being used:
grep-2.16-4.6.1.x86_64
sed-4.2.2-7.3.1.x86_64
Not confusing enough?
Then I had the idea to quantify the performance hit of using prefix .*
, so I decided to truncate the random file to just using the first 1000 characters:
> head -c1000 /dev/shm/random.txt >/dev/shm/random-1000.txt
> time sed -n -e 's/foo/XXX/p' /dev/shm/random-1000.txt
real 0m0.001s
user 0m0.001s
sys 0m0.000s
> time sed -n -e 's/.*foo/XXX/p' /dev/shm/random-1000.txt
real 0m0.001s
user 0m0.001s
sys 0m0.000s
> time sed -n -e 's/foo.*/XXX/p' /dev/shm/random-1000.txt
real 0m0.001s
user 0m0.001s
sys 0m0.000s
> time sed -n -e 's/.*foo.*/XXX/p' /dev/shm/random-1000.txt
real 0m0.003s
user 0m0.003s
sys 0m0.000s
To my big surprise, there was no noticeable difference now!?
I was thinking:
- Is it because of the missing end-of-line character?
- Does
sed
have some internal limited buffer that allows fast matching, but when exceeding that matching will be very slow?
So I repeated the tests after having added an end-of-line character: Still it was about the same fast performance!
Then I re-did the experiments with sizes of 10000 and 100000 characters:
For 10000 characters the .*
prefix was only 9 times slower than the 1000 character variant, but the 100000 characters was about 90 times slower than as the 10000 character variant.
So my conclusion is that sed
has some fixed internal "fast buffer".
Is that correct?
Solution
As suspected, those .*
matches were very bad for performance:
After removing those, the performance did improve enormously (like factor 1500):
real 0m0.230s
user 0m0.225s
sys 0m0.005s
For reference this is the canonical command used (I also dropped -i
to preserve the input file for repeatable tests):
time sed -e '
s/\([Pp]ass"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]assword"\?:\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]ass[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\([Pp]assword[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(PASS=\).*/\1*REMOVED BY SUPPORTCONFIG*/g;
s/\(_PASSWORD[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s!\(<user_password>\).*\(</user_password>\)!\1*REMOVED BY SUPPORTCONFIG*\2!g;
s/\(^ProxyUser[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^credentials[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(secret[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\({'\''[s]*password'\''}[[:space:]]*=[[:space:]]*'\''\).*\('\'';\)/\1*REMOVED BY SUPPORTCONFIG*\2/g;
s/\(password[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(password_in[[:space:]]*=\).*/\1 *REMOVED BY SUPPORTCONFIG*/g;
s/\(^echo -n\).*\(> \/sys\/kernel\/config\/target\/.*auth\/password.*\)/\1 *REMOVED BY SUPPORTCONFIG* \2/g
' "${1:-/tmp/uuu}" >"${2:-/dev/null}"
Answered By - U. Windl Answer Checked By - Clifford M. (WPSolving Volunteer)