Issue
I am trying to create a script to count certain patterns such as aac tgt ttg. Generally, the DNA file should be a text file containing a valid DNA string with no newline characters or white space characters of any kind within it. (It will be terminated with a newline character.) This DNA file should contain nothing but a sequence of the bases a, c, g, and t in any order.
#! /bin/bash
diffchar=$(grep -cv 'aac\|gtt\|tgt\|cag' $1 )
if [[ $diffchar -eq 1 ]]
then
echo "error"
elif [[ $diffchar -ne 1 ]]
then
count=$(grep -o 'aac\|gtt\|tgt\|cag' $1 | sort -k1,1nr -k2,2 | uniq -c)
# newcount=$(tr -d '\n' $count | awk -f histogram.awk | sort | uniq -c)
# echo "$newcount"
echo $count
fi
Solution
Since all your search patterns are 3 characters long you can split your file into 3-character blocks and use the idiom sort | uniq -c
to create a histogram.
sed '1s/.../\n&/g' yourFile | sed 1d | sort | uniq -c
Above command prints something like
1 aac
5 cag
1 gtt
3 tgt
1 XYZ
For everything that isn't one of your patterns (XYZ
in this case) you can automatically throw an error using
histogram=$(sed '1s/.../\n&/g' yourFile | sed 1d | sort | uniq -c)
if echo "$histogram" | grep -vEq '^[0-9 ]+(aac|gtt|tgt|cag)$'; then
echo error
else
echo "$histogram"
fi
Answered By - Socowi Answer Checked By - Senaida (WPSolving Volunteer)