Tuesday, October 4, 2022

[SOLVED] Add values from a "key" JSON file to other files based on partial string matching using JQ

Issue

The goal is to compare a JSON file against a "key" of standard values and add those values to objects in another JSON file if certain strings match. The purpose is to merge two sets of analytics that have complementary data.

The condition I have been trying to match on is when href from index-of-pages.json includes the string in url in key.json.

index-of-pages.json

[
    {
      "href": "articles/guide1/page1.html",
      "name": "Page 1",
      "views": "204"
    },
    {
      "href": "articles/guide2/page2.html",
      "name": "Page 2",
      "views": "180"
    },
    {
      "href": "articles/guide2/page3.html",
      "name": "Page 3",
      "views": "121"
    },
    {
      "href": "apis/apiguide1/subguide1/page4.html",
      "name": "Page 4",
      "views": "101"
    },
    {
      "href": "apis/apiguide2/subguide2/page5.html",
      "name": "Page 5",
      "views": "103"
    },
    {
      "href": "articles/guide1/about.html",
      "name": "Page 6",
      "views": "103"
    },
    {
      "href": "index.html",
      "name": "Page 7",
      "views": "400"
    }
]

key.json

[
    {
        "url": "/guide1/",
        "guide": "Guide 1",
        "tag": "how-to"
    },
    {
        "url": "/guide2/",
        "guide": "Guide 2",
        "tag": "how-to"
    },
    {
        "url": "/apiguide1/subguide1/",
        "guide": "API Guide 1",
        "subguide": "Subguide 1",
        "tag": "api"
    },
    {
        "url": "/guide1/about",
        "guide": "Guide 1",
        "tag": "about"
    }
]

Note there is no trailing slash on url in the last object.

Desired result:

[
    {
        "href": "articles/guide1/page1.html",
        "name": "Page 1",
        "views": "204",
        "url": "/guide1/",
        "guide": "Guide 1",
        "tag": "how-to"
    },
    {
        "href": "articles/guide2/page2.html",
        "name": "Page 2",
        "views": "180",
        "url": "/guide2/",
        "guide": "Guide 2",
        "tag": "how-to"
    },
    {
        "href": "articles/guide2/page3.html",
        "name": "Page 3",
        "views": "121"
    },
    {
        "href": "apis/apiguide1/subguide1/page4.html",
        "name": "Page 4",
        "views": "101",
        "url": "/apiguide1/",
        "guide": "API Guide 1",
        "subguide": "Subguide 1",
        "tag": "api"
    },
    {
        "href": "apis/apiguide2/subguide2/page5.html",
        "name": "Page 5",
        "views": "103"
    },
    {
        "href": "articles/guide1/about.html",
        "name": "Page 6",
        "views": "103",
        "url": "/guide1/about",
        "guide": "Guide 1",
        "tag": "about"
    },
    {
        "href": "index.html",
        "name": "Page 7",
        "views": "400"
    }
]

Objects in index-of-files.json that do not match anything in the key would still be included in the desired output.

Whether it is desirable for all keys to be included in the output objects even when they are empty, I'm not sure what is best practice.

This has brought me closest, but I cannot figure out how to incorporate a step to match on the key:

jq --argfile uid key.json '
 ($uid | INDEX(.url)) as $dict
 | map( $dict[.href] + del(.href) )
 ' index-of-files.json

Other attempts such as the following do not result in a 1:1 match of objects; rather, it produces a huge list of every possible combination of every key (the output was nested so I labeled it key; all desired output keys are not shown in this script):

(.[].href/"/"?|{key: ("/" + .[-2] + "/")}) as $abc | {name: .[].name, level: $abc}

I have also tried variations on while if loops with no success:

jq -r '.[] | "\(.url)|\(.guide)|\(.tag)|\(.subguide)"' key.json |
while IFS="|" read -r url guide tag subguide; do
cat index-of-files.json | jq --arg url "$url" --arg guide "$guide" --arg subguide "$subguide" '.[] | if (.href | contains('\"$url\"')) then . + {guide: '\"$guide\"', tag: '\"$tag\"', subguide: '\"$subguide\"'} else . end'
done

Thank you for any insight or guidance.


Solution

I don't think INDEX can help here.

What I'd do instead is this:

sort_by(.url | -length) as $c | inputs | map(. + (.href as $s | first($c[] | select(.url as $ss | $s | index($ss))) // {}))

In case it's unclear, the JQ invocation will look like so:

jq '...' key.json index-of-pages.json

Online demo



Answered By - oguz ismail
Answer Checked By - Willingham (WPSolving Volunteer)