Monday, January 15, 2024

[SOLVED] BASH: File sorting according to file name

January 15, 2024 awk, bash, sed, shell, sorting

Issue

I need to sort 12000 files into 1000 groups, according to its name and create for each group a new folder containing files of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varied from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were split on 12 separate parts (second column).

Here is an example for a small subset based on 3 systems divided by 12 parts, totally 36 files.

7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg

7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg

7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg

I need to group these files based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contain 12 files for each system in the following manner:

 Folder1, name: 7000_lig_cne_1, it contains 12 files:   7000_{this is from 01 to 12}_lig_cne_1.dlg

 Folder2, name: 7000_lig_cne_2, it contains 12 files 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
 Folder1000, name: 7000_lig_cne_1000, it contains 12 files 7000_{this is from 01 to 12}_lig_cne_1000.dlg

Assuming that all *.dlg files are present within the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:

#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"

#name of the output
output=${home}/sub_folders_to_analyse

#now here all magic comes
rm -r ${output}
mkdir ${output}

# sed solution
for i in ${FILES}/*.dlg        # define this better to suit your needs
do 
    n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
    # move the file to proper dir
    mkdir -p ${output}/"${experiment}_lig$n"
    cp "$i" ${output}/"${experiment}_lig$n"
done

Note: there I indicated beginning of the name of each folder as ${experiment} to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the copied files?

Manually it could be achieved via skipping the second column in the name of the folder

 cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987

Solution

Iterate over files. Extract the destination directory name from the filename. Move the file.

for i in *.dlg; do
    # extract last number with your favorite tool
    n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
    # move the file to proper dir
    echo mkdir -p "folder$n"
    echo mv "$i" "folder$n"
done

Notes:

Do not use upper case variables in your scripts. Use lower case variables.
Remember to quote variables expansions.
Check your scripts with http://shellcheck.net
Tested on repl

update: for OP's foldernaming convention:

for i in *.dlg; do
    foldername="$HOME/output/${i%%_*}_${i#*_*_}"
    echo mkdir -p "$foldername"
    echo mv "$i" "$foldername"
done

Answered By - KamilCuk

Answer Checked By - David Marino (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Monday, January 15, 2024

[SOLVED] BASH: File sorting according to file name

Issue

Solution

Popular Posts

Labels