Issue
Say I have the below simplified type of XML file and need to extract all of the string data that is within the <innerElement>
and </innerElement>
tags only for the Id 1234.
<outerTag>
<innerElement>
<Id>1234</Id>
<fName>Kim</fName>
<lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
<position>North</position>
<title/>
</innerElement>
<innerElement>
<Id>5678</Id>
<fName>Brian</fName>
<lName>Davis</lName>
<customData3>value3</customData3>
<customData4>value4</customData4>
<customData5>value5</customData5>
<position>South</position>
<title/>
</innerElement>
</outerTag>
My expected output is:
<innerElement>
<Id>1234</Id>
<fName>Kim</fName>
<lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
<position>North</position>
<title/>
</innerElement>
Using what I have read on other posts I have tried using grep -z to match multiline strings (treating the file contents as a single line) and -o to print only exact matches, but when I use the .* wildcard after the Id element it ends matching everything up to the end of the file instead of stopping on the fist occurrence.
grep -zo '<innerElement>.*<Id>1234</Id>.*</innerElement>' myfile.xml
How can I make the pattern match up to only the fist occurrence or the tag after Id 1234?
Solution
Don't use sed
nor regex
to parse XML
you cannot, must not parse any structured text like XML/HTML with tools designed to process raw text lines. If you need to process XML/HTML, use an XML/HTML parser. A great majority of languages have built-in support for parsing XML and there are dedicated tools like xidel
, xmlstarlet
or xmllint
if you need a quick shot from a command line shell.
With a proper XML
parser
Using xidel
xidel --xml -e '//innerElement[Id="1234"]' file.xml
Using xmlstarlet
xmlstarlet sel -t -c '//innerElement[Id="1234"]' file.xml
Using xmllint
xmllint --xpath '//innerElement[Id="1234"]' file.xml
Output
<innerElement>
<Id>1234</Id>
<fName>Kim</fName>
<lName>Scott</lName>
<customData1>Value1</customData1>
<customData2>Value2</customData2>
<position>North<position><title/></position>
</position>
</innerElement>
Answered By - Gilles Quénot Answer Checked By - Mary Flores (WPSolving Volunteer)