Issue
I have a list of ~28,000 gene transcripts, e.g.:
4R79.1b
4R79.2b
AC3.1a
AC3.2
AC3.3
AC3.5a
I need to get gene names by removing the last character only if it's a letter. I've been googling for days and haven't found a solution that would remotely help, I must have missed something.
I thought there must be a simple solution but my best attempt was sed 's/[[:alpha:]]$//' transcripts.txt > genes.txt
but it did not seem to do anything and the size of the file has not changed from the original.
Solution
With awk:
$ echo '4R79.1b 4R79.2b AC3.1a AC3.2 AC3.3 AC3.5a' |
awk '{for(i=1;i<=NF;i++) sub(/[[:alpha:]]$/,"",$i)} 1'
Prints:
4R79.1 4R79.2 AC3.1 AC3.2 AC3.3 AC3.5
Or sed:
sed -E 's/[[:alpha:]]([[:space:]]|$)/\1/g'
For a new file, just redirect:
sed -E 's/[[:alpha:]]([[:space:]]|$)/\1/g' file > new_file
If you want to replace inplace you can use sed:
sed -i bak -E 's/[[:alpha:]]([[:space:]]|$)/\1/g' file
Or awk by redirecting to a new temp file then overwriting the original (which is what sed -i
is doing...):
awk '{for(i=1;i<=NF;i++) sub(/[[:alpha:]]$/,"",$i)} 1' file > TEMP_FILE && mv -f TEMP_FILE file
You can also use GNU awk which has an inplace option as well.
Answered By - dawg Answer Checked By - Mary Flores (WPSolving Volunteer)