Issue
I want to parse the html with bash where tr containing class as error like below in my whole html page.
<tr class="error">
<td>
<a href="https://exmple.com">Test failed for AAA</a>
</td>
<td>1</td>
<td>1</td>
<td>0</td>
<td>1</td>
<td>0</td>
<td>0.0%</td>
<td>3.640 seconds</td>
</tr>
Output like "Test failed for AAA"
I tried few things with sed but not working as expected & getting NULL values.
Any input could be helpful
Solution
As always, using line based regular-expression oriented tools for working with tree-based documents like HTML and XML is the wrong approach. Use tools aware of the format; much easier, less error prone and simpler to maintain to accommodate any potential future changes in the input data.
For example, using xmllint
and an XPath query:
$ xmllint --html --xpath '//tr[@class="error"]/td[1]/a/text()' input.html
Test failed for AAA
Or with W3C's HTML-XML Utils package and CSS selectors:
$ hxselect -c 'tr.error td:first-child a' < input.html
Test failed for AAA
(These might not print a trailing newline at the end, which might be confusing if used interactively instead of capturing the result in a variable or whatever)
Answered By - Shawn Answer Checked By - Cary Denson (WPSolving Admin)