Tuesday, July 26, 2022

[SOLVED] Removing everything after the second "_" but keeping other columns

Issue

I'm trying to format the family IDs on a fam file whose sample and family IDs are the same, and coded in the following way:

Continent_Breed_Ind-ID

The idea would be to transform column 1 into something that only contains continent+breed, but keeping the other columns.

Mock dataset:

Continent1_Breed1_Ind-ID1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2_Ind-ID2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1_Ind-ID1 Continent2_Breed1_Ind-ID1 0 0 0 -9

Desired outcome:

Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9

I have tried using sed as follows:

sed -r 's/_[^_]*//2g' file.fam

But that only gives me the first column.

Any ideas?


Solution

You may use this simple sed command:

sed 's/_[^_]* / /' file

Continent1_Breed1 Continent1_Breed1_Ind-ID1 0 0 0 -9
Continent1_Breed2 Continent1_Breed2_Ind-ID1 0 0 0 -0
Continent2_Breed1 Continent2_Breed1_Ind-ID1 0 0 0 -9

Online Code Demo

Here:

  • _[^_]* : Matches _ followed by 0 or more non-_ characters followed by a space
  • We replace this match by a space to get the space between first and second column back

PS: Note that there is no global flag used here.



Answered By - anubhava
Answer Checked By - Pedro (WPSolving Volunteer)