Issue
I have a very large tab separated file formated like below (with roughly 300 columns and 1.2 million rows). The columns are all formated the same after column 4 (just with different numbers, but are all formated with #/#
or ./.
)
The headings seem to be to long resulting in a bit of shifted look
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393475 G [A] ./. ./. ./. ./. ./.
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
1J_LG1 393762 A [C] 0/0 ./. ./. 2/2 ./.
I am trying to remove all rows with the values ./.
in 60% or more of every column after column 4 (not including column 4, so column 5 and onward). Essentially in the example above rows 3 and 5 would be removed. It should produce an output as seen below.
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393482 T [C] 0/0 ./. ./. 2/2 ./.
I am very new to programming and bioinformatics and would really appreciate the help. From doing some research I think that I should be using something like awk
or sed
, but I'm not sure.
Also I'm sorry if the formatting isn't great.
Solution
Since ./.
and 0/0
only ever appear in the fields after #4 and are always a full field, all you need is:
$ awk -F'\t' 'gsub(/\.\/\./,"&") < ((NF-4)*.6)' file
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393472 G [A, C] 5/6 ./. ./. 1/0 0/0
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
$ awk -F'\t' 'gsub(/\.\/\.|0\/0/,"&") < ((NF-4)*.6)' file
CHROM POS REF ALT LarsenP1_1_PI173627.sort LarsenP1_1_PI173630.sort LarsenP1_1_PI204384.sort LarsenP1_1_PI204386.sort LarsenP1_1_PI204550.sort
1J_LG1 393466 A [C] ./. 1/2 1/1 0/0 0/1
1J_LG1 393482 T [C] 0/0 1/1 1/2 2/2 ./.
Answered By - Ed Morton