Issue
I have a file ("dump_file") containing a list of paths (generated from a hadoop fs -ls
output), formatted this way :
d hdfs 0 2021-06-01-13:14 /dir1
d hdfs 0 2021-06-01-13:14 /dir1/dir2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir3
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir3/file1
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir3/file2
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4
d hdfs 0 2021-06-01-13:14 /dir1/dir2/dir4/dir5
- abcdef 1201 2021-06-01-13:15 /dir1/dir2/dir4/file11
- abcdef 78441 2021-06-01-13:16 /dir1/dir2/dir4/file22
d hdfs 0 2021-06-01-13:14 /dir1/dir6/dir7
My goal is to extract 1st level children of any given node. So far this is what I got (example with "dir1") :
grep -Eio "/dir1.[^\/]+" < dump_file | sort -u | awk -F "/" '{ print $NF }'
dir2
dir6
But I'd like to have also the first fields of the matching lines, like this :
d hdfs 0 2021-06-01-13:14 dir2
d hdfs 0 2021-06-01-13:14 dir6
"dir1/dir2" as value should return :
d hdfs 0 2021-06-01-13:14 dir3
d hdfs 0 2021-06-01-13:14 dir4
"dir1/dir2/dir4 :
d hdfs 0 2021-06-01-13:14 dir5
- abcdef 1201 2021-06-01-13:15 file11
- abcdef 78441 2021-06-01-13:16 file22
Do you have an idea of how I can do this ? Thanks !
Solution
With your shown samples, please try following awk
code. Pass value of string which you want to look for in your Input_file inside value variable of this awk
program.
awk -v value="dir1" '
BEGIN{ len=length(value) }
match($0,"/"value"/[^/]*"){
matVal=substr($0,RSTART+len+2,RLENGTH-len-2)
if(!arr[matVal]++){
print substr($0,1,RSTART-1) matVal
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -v value="dir1" ' ##Starting awk program from here, setting value to string which we want to look for.
BEGIN{ len=length(value) } ##Creating len which has length of value here in BEGIN section.
match($0,"/"value"/[^/]*"){ ##Using match function to match given string along with next level of it here.
matVal=substr($0,RSTART+len+2,RLENGTH-len-2) ##Creating matVal which has matched value sub string here.
if(!arr[matVal]++){ ##Checking condition if value already does not exist in array then do following.
print substr($0,1,RSTART-1) matVal ##printing rest of line and matched value(only directory level) here.
}
}
' Input_file ##Mentioning Input_file name here.
EDIT: With OP's samples of passing dir1/dir2
OR dir1
OR dir1/dir2/dir3
and as per comments to ignore paths such as foo/dir1/dir2
where passed value is in sub directory mode then one could try following, beware this will fail if your path contains regexp metachars(I will try to fix it in sometime, if I could).
awk -v value="dir1/dir2" '
BEGIN{ len=length(value) }
match($0,"[[:space:]]+/"value"/[^/]*"){
matVal=substr($0,RSTART,RLENGTH)
sub(/^[[:space:]]+/,"",matVal)
sub("^/"value"/","",matVal)
if(!arr[matVal]++){
print substr($0,1,RSTART-1) OFS matVal
}
}
' Input_file
Answered By - RavinderSingh13 Answer Checked By - Willingham (WPSolving Volunteer)