Issue
I have a tab delimited file, and in the 3rd column I want to remove the first word if the following are true: 1. the first word is all in lowercase, 2. the second word is capitalized. And following that trim off anything longer than 2 words unless the whole string is lowercase. Ideally I want to do this with awk, but something like sed is also possible.
I managed to get it working with sed and awk together but only with the specific column alone, but I would like to get the whole line. I wasn't sure how to get sed to find the pattern only on that column and from what I read, awk doesn't allow backreferencing and I wasn't sure how to do it without that.
cat text.txt |
awk -F"\t" '{print $3}' |
sed -E 's/^[a-z]* ([A-Z])/\1/' |
awk '{if($1 ~ /^[[:upper:]]/) {print $1, $2} else {print $0}}'
What I have:
col1 col2 col3
123 a string James jones MD MSc
154 string mister George smith
163 String mrs anne jones
193 String john
157 big string 1 dude George
What I want to get:
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
Solution
You may use this awk
solution:
awk '
BEGIN {FS=OFS="\t"}
NR > 1 && tolower($3) != $3 && split($3, a, / +/) >= 2 {
p = (a[1] == tolower(a[1]) ? 2 : 1)
$3 = a[p] (p < length(a) ? " " a[p+1] : "")
}
1' file
col1 col2 col3
123 a string James jones
154 string George smith
163 String mrs anne jones
193 String john
157 big string 1 George
Answered By - anubhava Answer Checked By - Dawn Plyler (WPSolving Volunteer)