Friday, July 29, 2022

[SOLVED] Delete lines shorter than a certain length and the one above it (remove short sequences in a FASTA file)

Issue

I have a file containing the following text:

>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC

If a line that doesn't start with ">" is shorter than 5 characters, I want to delete it and the one right above it.

Expected output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC

I have tried sed -r '/^.{,5}$/d', but it also deletes the lines with ">".


Solution

With a GNU sed, you can use

sed -E '/>/N;/\n[^>].{0,4}$/d'

Details:

  • />/ - finds lines with > (if it must be at the start, add ^ before >)
  • N - reads the line and appends it to the pattern space with a leading newline
  • \n[^>].{0,4}$ - a newline, a char other than a > (as the first char should not be >) and then zero to four chars till end of the string
  • d removes the value in pattern space.

See the online demo:

#!/bin/bash
s='>seq1
GAAAT
>seq2
CATCTCGGGA
>seq3
GAC
>seq4
ATTCCGTGCC'
sed -E '/>/N;/\n[^>].{0,4}$/d' <<< "$s"

Output:

>seq2
CATCTCGGGA
>seq4
ATTCCGTGCC


Answered By - Wiktor Stribiżew
Answer Checked By - Katrina (WPSolving Volunteer)