Issue
Folder with chunked files (>1000 files) with format:
text_chr[A]_[numberB]_[numberC]_text.vcf.gz
numberB to numberC is a range.
Example:
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
..
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
etc
File with names of genes in 4th column, chromosome (col1) and coordinates(col2 and 3) (>20000 entries), few lines below:
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
Format: A B C Gene_name, B to C is also a range
The info about that gene is within one of the files in folder 1 so I need to pattern match the gene location within the range of the file name. As an example I would like to know the file that will contain the genes HAT1
and JKLP
, the answering being main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
and main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
respectively, at present I am manually doing this. I need these file names to input into some downstream analysis.
So I need to match A
and from the gene list then select the chunk that encompasses the range B
to C
.
Is there a way of doing this from the command line?
Many thanks
Solution
UPDATE: OP has updated the question with a more representative set of *.vcf.gz
file names.
At this point I'm going to assume the file names are of the format ...
some_text_chrXXX_number1_number2_more_text.vcf.gz
^^^^^^^^^^^^^^^^^^^^^^
... and we're interested in this part of the file name: chrXXX_number1_number2
.
In the proposed code (below) we're going to loop through these file names, parsing the name into chunks and then processing those chunks. Steps we'll be using to parse the file names:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=${f//*_chr/chr} # strip off 'some_text_'
h=${g//.vcf.gz} # strip off '.vcf.gz'
echo "f=${f}"
echo "g=${g}"
echo "h=${h}"
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break $h into 4 variables
echo "cx=${cx}"
echo "n1=${n1}"
echo "n2=${n2}"
echo "stuff=${stuff}" # catch all for rest of '$h'
This generates:
f=some_text_chrXXX_number1_number2_more_text.vcf.gz
g=chrXXX_number1_number2_more_text.vcf.gz
h=chrXXX_number1_number2_more_text
cx=chrXXX
n1=number1
n2=number2
stuff=more_text
Assumptions:
- details hashed out in the comments section for this question apply to this question
- a given gene may show up in
file1
multiple times - a given range of numbers from
file1
may match more than 1*.vcf.gz
filename - all files of interest are in the current directory (OP can add appropriate
cd
commands to scripting as needed)
Sample data:
$ cat file1
chr1 1000848 3959403 HAT1
chr2 83523 85382 JKLP
chr3 20000 40000 STEV
chrX 23456 78901 WXYZ
$ ls -1v *vcf.gz
main_programme_ver2_chr1_1_1000000_VEPannot.vcf.gz
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz # match HAT1
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz # match JKLP
main_programme_ver2_chr2_875833_100098325_VEPannot.vcf.gz
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz # match STEV
main_programme_ver2_chr3_80001_100000_VEPannot.vcf.gz
main_programme_ver2_chrX_100000_999999_VEPannot.vcf.gz
One idea using a couple bash
loops:
$ cat gene.bash
#!/usr/bin/bash
read -p "Gene to search for: " gene
echo "+++++++++++ gene: ${gene}"
found=0
while read -r chr b c stuff # read fields from file1
do
for f in *_${chr}_* # for all files that match 'chr' string from file1 ...
do
g=${f//*_chr/chr}
h=${g//.vcf.gz}
IFS='_' read -r cx n1 n2 stuff <<< "${h}" # break 'h' into chunks based on delimiter '_'
# check each value from file1 (b,c) for inclusion in filename ranges (n1-n2)
[[ "${b}" -ge "${n1}" ]] && [[ "${b}" -le "${n2}" ]] && echo "${f}" && found=1 && continue
[[ "${c}" -ge "${n1}" ]] && [[ "${c}" -le "${n2}" ]] && echo "${f}" && found=1
done
done < <(grep -w "${gene}" file1) # search file1 for rows containing 'gene'
[[ "${found}" -ne 1 ]] && echo "WARNING: no files found for gene = '${gene}'"
Test runs:
$ ./gene.bash
Gene to search for: HAT1
+++++++++++ gene: HAT1
main_programme_ver2_chr1_1000001_987325982_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: JKLP
+++++++++++ gene: JKLP
main_programme_ver2_chr2_1_875832_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: STEV
+++++++++++ gene: STEV
main_programme_ver2_chr3_1_25000_VEPannot.vcf.gz
main_programme_ver2_chr3_25001_80000_VEPannot.vcf.gz
$ ./gene.bash
Gene to search for: WXYZ
+++++++++++ gene: WXYZ
WARNING: no files found for gene = 'WXYZ'
$ ./gene.bash
Gene to search for: ZZZZ
+++++++++++ gene: ZZZZ
WARNING: no files found for gene = 'ZZZZ'
Answered By - markp-fuso