Issue
Given array1 I want to find every unique, first occurrence of each csv entry. The array is already ordered by date. So the first occurrence will be the most recent.
array1=(url://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ url://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ url://root/sub1/sub2/2022-07-22/ a.csv/)
I want to return an array with every unique, most recent occurrence of each csv entry (with the full paths)
array2=(url://root/sub1/sub2/2022-10-22/a.csv/ url://root/sub1/sub2/2022-10-22/b.csv/ url://root/sub1/sub2/2022-08-22/c.csv/ url://root/sub1/sub2/2022-08-22/d.csv/)
an array of all the duplicate entries (with the full paths)
array3=(url://root/sub1/sub2/2022-09-22/a.csv/ url://root/sub1/sub2/2022-09-22/b.csv/ url://root/sub1/sub2/2022-08-22/a.csv/ url://root/sub1/sub2/2022-08-22/b.csv/ url://root/sub1/sub2/2022-07-22/a.csv/)
My thought process is as follows - Loop through the array, if the element is a url path check the preceding elements and write the url path and csv files to a new array. Stop when the preceding element is another url path. If the following url path contains the same csv files write to a duplicate array. If the following url path contains new csv files append to the new array.
Solution
Would you please try the following:
#!/bin/bash
declare -A seen # check if the csv element has appeared
array1=(s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/ s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/ s3://root/sub1/sub2/2022-07-22/ a.csv/)
array2=(); array3=()
while read -r first others; do # split the line into "s3:.." and others
read -r -a ary <<< "$others" # split others into list of csv's
dup=(); new=() # temporary arrays
for i in "${ary[@]}"; do # loop over the csv's
(( seen[$i]++ )) && dup+=( "$i" ) || new+=( "$i" ) # sort the csv's depending on the history
done
for i in "${new[@]}"; do # loop over the array of unique entries
array2+=( "${first}${i}" ) # append the full path to array2
done
for i in "${dup[@]}"; do # loop over the array of duplicate entries
array3+=( "${first}${i}" ) # append the full path to array3
done
done < <(sed -E 's# (s3://)#'\\$'\n''\1#g' <<< "${array1[*]}") # construct 2-d structure from array1
echo "${array2[@]}"
echo "${array3[@]}"
Output:
s3://root/sub1/sub2/2022-10-22/a.csv/ s3://root/sub1/sub2/2022-10-22/b.csv/ s3://root/sub1/sub2/2022-08-22/c.csv/ s3://root/sub1/sub2/2022-08-22/d.csv/
s3://root/sub1/sub2/2022-09-22/a.csv/ s3://root/sub1/sub2/2022-09-22/b.csv/ s3://root/sub1/sub2/2022-08-22/a.csv/ s3://root/sub1/sub2/2022-08-22/b.csv/ s3://root/sub1/sub2/2022-07-22/a.csv/
As array1
looks like having a 2-d structure, I've first rearranged the elements with sed
into:
s3://root/sub1/sub2/2022-10-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-09-22/ a.csv/ b.csv/
s3://root/sub1/sub2/2022-08-22/ a.csv/ b.csv/ c.csv/ d.csv/
s3://root/sub1/sub2/2022-07-22/ a.csv/
then process them line by line.
Answered By - tshiono Answer Checked By - Clifford M. (WPSolving Volunteer)