Issue
I have a file that looks like this:
>alks|keep1|aoiuor|lskdjf
ldkfj
alksj
asdflkj
>jhoj_kl|keep2|kjghoij|adfjl
aldskj
alskj
alsdkj
I would like to edit just the lines starting with >
, ideally in-place, to get a file:
>keep1
ldkfj
alksj
asdflkj
>keep2
aldskj
alskj
alsdkj
I know that in principle this is achievable with various combinations of sed/awk/cut, but I haven't been able to figure out the right combination. Ideally it should be fast - the file has many millions of lines, and many of the lines are also very long.
Key things about the lines I want to edit:
- Always start with
>
- The bit I want to keep is always between the first and second pipe symbol
|
(hence thinkingcut
is going to help - The bit I want to keep has alphanumeric symbols and sometimes underscores. The rest of the string on the same line can have any symbols
What I've tried that seems helpful
(Most of my sed attempts are pure garbage)
cut -d '|' -f 2 test.txt
Gets me the bit of the string that I want, and it keeps the other lines too. So it's close, but (of course) it doesn't preserve the initial >
on the lines where cut
applies, so it's missing a crucial part of the solution.
Solution
With sed
:
sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
/^>/
to select lines starting with>
, not strictly necessary for given sample but sometimes this provides faster result than usings
alone^[^|]+\|
this will match non|
characters from the start of line([^|]+)
capture the second field.*
rest of the line>\1
replacement string where\1
will have the contents of([^|]+)
If your input has only ASCII character, this would give you much faster results:
LC_ALL=C sed -E '/^>/ s/^[^|]+\|([^|]+).*/>\1/'
Timing
- Checking the timing results by creating a huge file from given input sample,
awk
is much faster andmawk
is even faster - However, OP reports that the
sed
solution is faster for the actual data
Answered By - Sundeep