Monday, April 4, 2022

[SOLVED] How to grep to get multiple lines with a specific format, but if one of the lines has a specific word, do not include the lines as a result

Issue

I have a large directory of files, which I need to look through for specific lines, because they need to be updated.

The format I am looking for always starts with <topicref, and then after that, it must have href="../, but will have some text after it. For example: href="../example.md". After that, it might have scope="peer", some other lines, and will end off with either > or />.

So far, I've come up with a regex that address finding the lines I want:

pcregrep -HnM '<topicref(.*) href="..\/(.*).dita(.*)[^>]*'

However, I'm having trouble filtering out the results that have scope="peer". I tried doing

pcregrep -HnM '<topicref(.*) href="..\/(.*).dita(.*)[^>]*' directory | pcregrep - Mv 'scope="peer" > file

But the results from this would strictly show all the lines that don't have 'scope="peer"' in it from the overall result from the previous pcregrep, so there would be random results that shouldn't be included, and also I am unable to track which files these results are from.

Is it possible to see all the <topicref href="../... > mentions without scope="peer"?

Three examples of lines with scope="peer":

<topicref href="../cat.md" scope="peer"
something />

<topicref href="../cat.md"
something scope="peer"
something />

<topicref href="../cat.md"
scope="peer"
something></topicref><map>

Solution

You can use

pcregrep -HnM '<topicref(?![^>]*\sscope="peer")(?:\s[^>]+)?\shref="\.\./([^"]*)\.dita[^>]*>' file

Details

  • <topicref - a literal string
  • (?![^>]*\sscope="peer") - no whitespace + scope="peer" allowed after any zero or more chars other than > immediately to the right of the current position
  • (?:\s[^>]+)? - an optional whitespace, one or more chars other than >
  • \shref="\.\./ - whitespace, href="../ string
  • ([^"]*) - Group 1: zero or more chars other than "
  • \.dita - .dita string (replace with \.md if you need to match .md)
  • [^>]*> - zero or more chars other than > and then a >.


Answered By - Wiktor Stribiżew
Answer Checked By - Senaida (WPSolving Volunteer)