Thursday, February 17, 2022

[SOLVED] Grep for specific numbers within a text file and output per number text file

Issue

I have a text file chunk_names.txt that looks like this:

chr1_12334_64321
chr1_134435_77474   
chr10_463252_74754
chr10_54265_423435 
chr13_5464565_547644567

This is an example but all chromosomes are represented (1...22, X and Y). All entries follow the same formatchr{1..22, X or Y}_*string of numbers*__*string of numbers*.

I would like to split these into per chromosome files e.g. all of the chunks starting chr10 to be put into a file called chr10.txt:

In Linux I have tried :

for i in {1..22}
do 
    grep chr$i chunk_names.txt > chr$i.txt 
done 

However, the chr1.txt output file now contains all the chromosome chunks with 1 in them (1,10,11,12, etc).

How would I modify this script to separate out the chromosomes?

I also haven't tackled how to include chromosome X or Y within the same script and am currently running that separately

Things I have tried :

grep -o gives me just "chr$i" as an output 
grep 'chr$i' gives me blank files
grep "chr$i" has the initial problem 

Many thanks for your time.


Solution

If you include the _ following the number you can distinguish between chr1_ and e.g. chr10_. To include X and Y, simply include these in the loop

for i in {1..22} X Y
do 
    grep "chr${i}_" chunk_names.txt > chr$i.txt 
done 

To search at the beginning of the line only you can add a leading ^ to the pattern

    grep "^chr${i}_" chunk_names.txt > chr$i.txt 

Explanation about your attempts:

grep chr$i searches for the pattern anywhere in the line. The shell replaces $i with the value of the variable i, so you get chr1, chr2 etc.

If you enclose the pattern in double quotes as grep "chr$i" the shell will not do any file name globbing or splitting of the string, but still expand variables. In your case it is the same as without quotes.

If you use single quotes, the shell takes the literal string as is, so you always search for a line that contains chr$i (instead of chr1 etc.) which does not occur in your file.

Explanation about quotes:

The quotes in my proposed solution are not necessary in your case, but it is a good habit to quote everything. If your pattern would contain spaces or characters that are special to the shell, the quoting will make a difference.

Example:

If your file would contain a chr1* instead of the chr1_, the pattern chr${i}* would be replaced by the list of matching files.

When you already created your output files chr1.txt etc., try these commands

$ i=1; echo chr$i*
chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt
$ i=1; echo "chr$i*"
chr1*

In the first case, the grepcommand

    grep chr${i}* chunk_names.txt

would be expanded as

    grep chr10.txt chr11.txt chr12.txt chr13.txt chr14.txt chr15.txt chr16.txt chr17.txt chr18.txt chr19.txt chr1.txt chunk_names.txt

which would search for the pattern chr10.txt in files chr11.txt ... chr1.txt and chunk_names.txt.



Answered By - Bodo
Answer Checked By - David Goodson (WPSolving Volunteer)