Wednesday, October 20, 2021

[SOLVED] Finding a numbered file by range that falls within it - pattern matching - from command line

October 20, 2021 bash, command-line, linux

Issue

Folder with chunked files (>1000 files) with format:

text_chr[A]_[numberB]_[numberC]_text.vcf.gz

numberB to numberC is a range.

Example:

main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz 
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz 
..
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz 
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz 
etc

File with names of genes in 4th column, chromosome (col1) and coordinates(col2 and 3) (>20000 entries), few lines below:

chr1 1000848 3959403 HAT1 
chr2 83523 85382 JKLP

Format: A B C Gene_name, B to C is also a range

The info about that gene is within one of the files in folder 1 so I need to pattern match the gene location within the range of the file name. As an example I would like to know the file that will contain the genes HAT1 and JKLP, the answering being main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz and main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz respectively, at present I am manually doing this. I need these file names to input into some downstream analysis.

So I need to match A and from the gene list then select the chunk that encompasses the range B to C.

Is there a way of doing this from the command line?

Many thanks

Solution

UPDATE: OP has updated the question with a more representative set of *.vcf.gz file names.

At this point I'm going to assume the file names are of the format ...

some_text_chrXXX_number1_number2_more_text.vcf.gz
          ^^^^^^^^^^^^^^^^^^^^^^

... and we're interested in this part of the file name: chrXXX_number1_number2.

In the proposed code (below) we're going to loop through these file names, parsing the name into chunks and then processing those chunks. Steps we'll be using to parse the file names:

f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=${f//*_chr/chr}                                    # strip off 'some_text_'
h=${g//.vcf.gz}                                      # strip off '.vcf.gz'
echo "f=${f}"
echo "g=${g}"
echo "h=${h}"

IFS='_' read -r cx n1 n2 stuff <<< "${h}"            # break $h into 4 variables

echo "cx=${cx}"
echo "n1=${n1}"
echo "n2=${n2}"
echo "stuff=${stuff}"                                # catch all for rest of '$h'

This generates:

f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=chrXXX_number1_number2_more_text.vcf.gz
h=chrXXX_number1_number2_more_text
cx=chrXXX
n1=number1
n2=number2
stuff=more_text

Assumptions:

details hashed out in the comments section for this question apply to this question
a given gene may show up in file1 multiple times
a given range of numbers from file1 may match more than 1 *.vcf.gz filename
all files of interest are in the current directory (OP can add appropriate cd commands to scripting as needed)

Sample data:

$ cat file1
chr1 1000848 3959403 HAT1 
chr2 83523 85382 JKLP
chr3 20000 40000 STEV
chrX 23456 78901 WXYZ

$ ls -1v *vcf.gz
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz  # match HAT1

main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz           # match JKLP
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz 

main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz            # match STEV
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz        # match STEV
main_programme_ver2_chr3_80001_100000_VEPannot.vcf.gz

main_programme_ver2_chrX_100000_999999_VEPannot.vcf.gz

One idea using a couple bash loops:

$ cat gene.bash
#!/usr/bin/bash

read -p "Gene to search for: " gene

echo "+++++++++++ gene: ${gene}"

found=0

while read -r chr b c stuff                        # read fields from file1
do
    for f in *_${chr}_*                            # for all files that match 'chr' string from file1 ...
    do
        g=${f//*_chr/chr}
        h=${g//.vcf.gz}

        IFS='_' read -r cx n1 n2 stuff <<< "${h}"  # break 'h' into chunks based on delimiter '_'

        # check each value from file1 (b,c) for inclusion in filename ranges (n1-n2)

        [[ "${b}" -ge "${n1}" ]] && [[ "${b}" -le "${n2}" ]] && echo "${f}" && found=1 && continue
        [[ "${c}" -ge "${n1}" ]] && [[ "${c}" -le "${n2}" ]] && echo "${f}" && found=1
    done

done < <(grep -w "${gene}" file1)                  # search file1 for rows containing 'gene'

[[ "${found}" -ne 1 ]] && echo "WARNING: no files found for gene = '${gene}'"

Test runs:

$ ./gene.bash
Gene to search for: HAT1
+++++++++++ gene: HAT1
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: JKLP
+++++++++++ gene: JKLP
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: STEV
+++++++++++ gene: STEV
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz

$ ./gene.bash
Gene to search for: WXYZ
+++++++++++ gene: WXYZ
WARNING: no files found for gene = 'WXYZ'

$ ./gene.bash
Gene to search for: ZZZZ
+++++++++++ gene: ZZZZ
WARNING: no files found for gene = 'ZZZZ'

Answered By - markp-fuso

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Wednesday, October 20, 2021

[SOLVED] Finding a numbered file by range that falls within it - pattern matching - from command line

Issue

Solution

Popular Posts

Labels