Issue
I have a text file chunk_names.txt
that looks like this:
chr1_12334_64321
chr1_134435_77474
chr10_463252_74754
chr10_54265_423435
chr13_5464565_547644567
This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*
.
I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt
:
In Linux I have tried :
for i in {1..22}
do
grep chr$i chunk_names.txt > chr$i.txt
done
However, the chr1.txt
output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).
How would I modify this script to separate out the chromosomes?
I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately
Things I have tried :
grep -o gives me just "chr$i" as an output
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem
Many thanks for your time.
Solution
If you include the _
following the number you can distinguish between chr1_
and e.g. chr10_
. To include X
and Y
, simply include these in the loop
for i in {1..22} X Y
do
grep "chr${i}_" chunk_names.txt > chr$i.txt
done
To search at the beginning of the line only you can add a leading ^
to the pattern
grep "^chr${i}_" chunk_names.txt > chr$i.txt
Explanation about your attempts:
grep chr$i
searches for the pattern anywhere in the line. The shell replaces $i
with the value of the variable i
, so you get chr1
, chr2
etc.
If you enclose the pattern in double quotes as grep "chr$i"
the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.
If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i
(instead of chr1
etc.) which does not occur in your file.
Explanation about quotes:
The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.
Example:
If your file would contain a chr1*
instead of the chr1_
, the pattern chr${i}*
would be replaced by the list of matching files.
When you already created your output files chr1.txt
etc., try these commands
$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*
In the first case, the grep
command
grep chr${i}* chunk_names.txt
would be expanded as
grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt
which would search for the pattern chr10.txt
in files chr11.txt
... chr1.txt
and chunk_names.txt
.
Answered By - Bodo Answer Checked By - David Goodson (WPSolving Volunteer)