Issue
I need to sort 12000 files into 1000 groups, according to its name and create for each group a new folder containing files of this group. The name of each file is given in multi-column format (with _ separator), where the second column is varied from 1 to 12 (number of the part) and the last column ranged from 1 to 1000 (number of the system), indicating that initially 1000 different systems (last column) were split on 12 separate parts (second column).
Here is an example for a small subset based on 3 systems divided by 12 parts, totally 36 files.
7000_01_lig_cne_1.dlg
7000_02_lig_cne_1.dlg
7000_03_lig_cne_1.dlg
...
7000_12_lig_cne_1.dlg
7000_01_lig_cne_2.dlg
7000_02_lig_cne_2.dlg
7000_03_lig_cne_2.dlg
...
7000_12_lig_cne_2.dlg
7000_01_lig_cne_3.dlg
7000_02_lig_cne_3.dlg
7000_03_lig_cne_3.dlg
...
7000_12_lig_cne_3.dlg
I need to group these files based on the second column of their names (01, 02, 03 .. 12), thus creating 1000 folders, which should contain 12 files for each system in the following manner:
Folder1, name: 7000_lig_cne_1, it contains 12 files: 7000_{this is from 01 to 12}_lig_cne_1.dlg
Folder2, name: 7000_lig_cne_2, it contains 12 files 7000_{this is from 01 to 12}_lig_cne_2.dlg
...
Folder1000, name: 7000_lig_cne_1000, it contains 12 files 7000_{this is from 01 to 12}_lig_cne_1000.dlg
Assuming that all *.dlg files are present within the same dir, I propose bash loop workflow, which only lack some sorting function (sed, awk ??), organized in the following manner:
#set the name of folder with all DLG
home=$PWD
FILES=${home}/all_DLG/7000_CNE
# set the name of protein and ligand library to analyse
experiment="7000_CNE"
#name of the output
output=${home}/sub_folders_to_analyse
#now here all magic comes
rm -r ${output}
mkdir ${output}
# sed solution
for i in ${FILES}/*.dlg # define this better to suit your needs
do
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
mkdir -p ${output}/"${experiment}_lig$n"
cp "$i" ${output}/"${experiment}_lig$n"
done
Note: there I indicated beginning of the name of each folder as ${experiment}
to which I add the number of the final column $n at the end. Would it be rather possible to set up each time the name of the new folder automatically based on the name of the copied files?
Manually it could be achieved via skipping the second column in the name of the folder
cp ./all_DLG/7000_*_lig_cne_987.dlg ./output/7000_lig_cne_987
Solution
Iterate over files. Extract the destination directory name from the filename. Move the file.
for i in *.dlg; do
# extract last number with your favorite tool
n=$( <<<"$i" sed 's/.*[^0-9]\([0-9]*\)\.dlg$/\1/' )
# move the file to proper dir
echo mkdir -p "folder$n"
echo mv "$i" "folder$n"
done
Notes:
- Do not use upper case variables in your scripts. Use lower case variables.
- Remember to quote variables expansions.
- Check your scripts with http://shellcheck.net
- Tested on repl
update: for OP's foldernaming convention:
for i in *.dlg; do
foldername="$HOME/output/${i%%_*}_${i#*_*_}"
echo mkdir -p "$foldername"
echo mv "$i" "$foldername"
done
Answered By - KamilCuk Answer Checked By - David Marino (WPSolving Volunteer)