Issue
I am struggling to grasp the sed command.
I am working with gene annotation files. In particular, I convert gff3 to gtf files needed to execute cellranger-arc mkref. Both gffread and agat fail to do so perfectly on gff3 files from ncbi. My agat-gtf file doesn't contain 'transcript_id' as is.
The gtf format is a tab delimited format, with the final column being for attributes. The attributes are separated using semicolons. Currently, my agat-gtf file has 'locus_tag' descriptors which I want to replace as 'transcript_id' with necessary quote marks around the name of the transcript. As an example, I want
... ; locus_tag AbcdE_f1 ; ...
to be replaced with
... ; transcript_id "AbcdE_f1" ; ...
I have tried
sed -i.bak "s/locus_tag\([0-9a-zA-Z ,._-]{1,}\);/transcript_id \"1\";/g" myFile.gtf
, but it does nothing.
Thanks for any help.
As per request (I'll include two lines as input) typical input
sample:
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; locus_tag PhpapaC_p1 ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; locus_tag PhpapaCp002
Desired output:
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; transcript_id "PhpapaC_p1" ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; transcript_id "PhpapaCp002"
Solution
Using GNU sed
$ sed -E 's/\<locus_tag\>[ \t]([^ \t]*)/transcript_id "\1"/' input_file
ChrPT RefSeq exon 956 981 . + . Dbxref "GeneID:38831453" ; ID "nbis-exon-1" ; Parent PhpapaC_p1 ; gbkey exon ; gene "3' rps12" ; transcript_id "PhpapaC_p1" ; product "ribosomal protein S12" <br>
ChrPT RefSeq gene 1033 1500 . + . Dbxref "GeneID:2546745" ; ID "nbis-gene-17" ; Name rps7 ; gbkey Gene ; gene rps7 ; gene_biotype protein_coding ; transcript_id "PhpapaCp002"
Answered By - HatLess Answer Checked By - Robin (WPSolving Admin)