Issue
I have a file containing the following text:
>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC
If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.
Expected output:
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
I have tried sed -r '/^.{,5}$/d'
, but it also deletes the lines with ">".
Solution
With a GNU sed, you can use
sed -E '/>/N;/\n[^>].{0,4}$/d'
Details:
/>/
- finds lines with>
(if it must be at the start, add^
before>
)N
- reads the line and appends it to the pattern space with a leading newline\n[^>].{0,4}$
- a newline, a char other than a>
(as the first char should not be>
) and then zero to four chars till end of the stringd
removes the value in pattern space.
See the online demo:
#!/bin/bash
s='>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC'
sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"
Output:
>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC
Answered By - Wiktor Stribiżew Answer Checked By - Katrina (WPSolving Volunteer)