Issue
Is it possible to sort text horizontally? For e.g. I have this hunspell file that has all the English words followed by tags. (It may contain unicode text and millions of words)
test/BACac
this/QPR
line/MNP
again/Xx
I need to sort tags (preferably: small letters first and then capital) Expected:
test/acABC
this/PQR
line/MNP
again/xX
I can do this in pandas. But I will like to know if I can complete the task using only linux commands!
import pandas as pd
df = pd.read_csv('test.csv', sep='/', header=None)
df.columns = ['word', 'tags']
df['tags']=df['tags'].map(lambda x: ''.join(sorted([i for i in x])))
df['final'] = df['word'] + '/' + df['tags']
df['final'].to_csv('result.csv', index=False, header=None)
Solution
With GNU awk for "sorted_in" and splitting a string into chars when a null separator is specified:
$ cat tst.awk
BEGIN {
FS=OFS="/"
PROCINFO["sorted_in"] = "@val_str_asc"
}
{
split($2,lets,"")
$2 = ""
for (i in lets) {
$2 = $2 lets[i]
}
print
}
$ awk -f tst.awk file
test/ABCac
this/PQR
line/MNP
again/Xx
To get output where lower case letters sort before upper case you'd have to find a locale with such a collation order and set LC_ALL=<that locale>
before running the script or convert all upper case to lower case and vice versa first, then do the sort, then convert them back before printing or do something similar by putting a decorator char in front of each real char such as all lower case letters get a leading A
while upper get a leading a
to again force a different order, e.g.:
$ cat tst.awk
BEGIN {
FS=OFS="/"
PROCINFO["sorted_in"] = "@val_str_asc"
}
{
split($2,lets,"")
for (i in lets) {
lets[i] = ( lets[i] ~ /[[:lower:]]/ ? "A" : "a" ) lets[i]
}
$2 = ""
for (i in lets) {
$2 = $2 substr(lets[i],2)
}
print
}
$ awk -f tst.awk file
test/acABC
this/PQR
line/MNP
again/xX
Answered By - Ed Morton Answer Checked By - Clifford M. (WPSolving Volunteer)