Friday, March 25, 2022

[SOLVED] How to find line index then rewrite it in bash

March 25, 2022 awk, bash, grep, linux

Issue

Hello I have a simple problem I need to find specific lines in txt file they have to contain 'LG' which look like this:

>NC_037638.1 Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence

then I need to replace number in this case NC_037638.1 with LG1 The LG and number will differ in each line

the result should look like this:

>LG1, Apis mellifera strain DH4 linkage group LG1, Amel_HAv3.1, whole genome shotgun sequence

I have like 3 mil of lines in a file and I need to find only those with LG followed by some number like in the example LG1

So basically i need to get from this:

To this:

I wrote something like this:

#!/bin/bash
while IFS= read -r line; do
    if [[ $line =~ "LG" ]]; then
        echo $line | awk ' { t = $1; $1 = $8; print; } '  | sed -e 's/^/>/' >> nowy.txt
    else
        echo $line >> nowy.txt
    fi
done < kopia_pliku_docelowego

and it works but its ultra slow it takes like 3 minutes for the script to end

I thought out about solution and i figured i can grep for line index and change only those lines then swap old lines on the same index as new rewritten one.

I know how to find index (grep -n) and i know how to change the line (talking about swaping number with LG) but I don't know how to put it all together.

I would really appreciate some help

Solution

I don't really understand the problem description. It sounds like you just want to replace the first column with the 8th column in any line that contains LG. If that's the case, just do:

awk '/LG/{ $1 = $8 }1' kopia_pliku_docelowego > nowy.txt

but perhaps you want to restrict the match so that you only do the replacements when 'LG' appears in the 8th column. You could do that with:

awk '$8 ~ /LG/{ $1 = $8 }1'

If you require that LG be followed by a string of digits, use:

awk '$8 ~ /LG[0-9]+/{ $1 = $8 }1'

If you have lines in which the 8th column is LGxxxAAA (non string values following the digits) and you only want to replace the first column with that portion of the string that matches LG[0-9+], you could use:

awk 'match($8,/LG[0-9]+/){ $1 = substr($8,0,RLENGTH) }1'

awk can undoubtedly solve your problem, but you need to clarify exactly what you're trying to match. Your sed solution seems to be inserting a leading > which does not seem necessary according to your description. More specificity is required.

Answered By - William Pursell

Answer Checked By - Robin (WPSolving Admin)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, March 25, 2022

[SOLVED] How to find line index then rewrite it in bash

Issue

Solution

Popular Posts

Labels