Friday, February 11, 2022

[SOLVED] How can I pdfgrep a pdf so that only bold matches are shown?

February 11, 2022 grep, pdf

Issue

I am trying to list all occurrences of a bold string in a pdf with its page number. But I don't want to list those occurrences where it is not bold.

So far I have:

pdfgrep -n -o "String" Input.pdf

But I don't know how to catch the bold aspect...

Link to the pdf: https://ilarisblog.files.wordpress.com/2021/07/ilaris.pdf (direct download, not my website)

Solution

If you are lucky in some rare cases you might get a means to say page x uses Fonts like CID Bold & Normal e.g they could be different fonts or thicknesses, lets take one example, It is contrived so not that uncommon, but illustrates several points.

so there are commandline tools to dig into the fonts and text and provide fine detail

name                                 type              emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
CIDFont+F1                           CID TrueType      yes no  yes     11  0

[List All Fonts], The number of fonts in this PDF file is: 1
CIDFont+F1                           CID TrueType      yes no  yes     11  0

    line:
      word: x=42.48..51.65 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: '1.'
      word: x=54.78..97.35 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'Surgical'
      word: x=100.55..133.15 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'rooms'
      word: x=136.37..158.21 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'and'
      word: x=161.46..203.55 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'services'
      word: x=206.77..212.91 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: '–'
      word: x=217.78..229.70 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'To'
      word: x=232.69..272.34 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'include'
      word: x=275.41..316.39 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'surgical'
      word: x=319.33..347.75 y=103.29..121.01 base=115.32 fontSize=11.04 space=0: 'suites'

line: x=42.48..347.75 y=103.29..121.01 base=115.32 '1. Surgical rooms and services – To include surgical suites'

However none of that shows the difference midway in the line.

There is only one Font here (F1) and all the text is one stream in 3 parts.

So looking closely we can see on the second line the thicker looking glyphs on the left are 1 point thick (but so is the rest of that line) using exactly the same font The Font name is CID+F1 throughout, I could have let it use another name but the point is neither half of that line is "Bold". Well what about that 1 point Border Width,can I test for that? No, both halves of line 2 are 1 point Border Width the difference is the Darker half is more Opaque (Stroke Opacity is 100%) than The right half so just appears to be thicker. and there is no way you could grep that externally without using a PDF Library to disassemble the whole font structure and its application.

I am not saying you cant use an extraction tool to extract the text as "BOLDER" and "not so bold" or report that some text has a name with bold in it, But what use is a plain text stream, that says some letters may be bold and some letters may not, you can see that from an image extraction.

You will need a library that can analyse the various ways bolder text can be illustrated, then use that with a plain text grepper.

[Later Edit]

You provided a complex example where certainly the fonts are defined with many different names for example MinionPro-Bold, so there is some hope, however the style is appled to every page number so we can say every page (even the blank one has BOLD :-) so how to pull that page number and any other text with that, is well provided with many examples on SO most using Python with mixed results (often PDFMiner) but I would not use that by first choice, unless you know it well.

There may well be conflicting reports by some extractors, since the Primary blocks of "body text" report overall they are bold containers even when most internal text is not, so may need secondary division into sub groups, perhaps discarding that which is classed as MinionPro-Regular and keeping Aniron-Bold-SC700.

So whats the best answer

Well you need a format where inline changes of font are the norm such as XML or HTML and those can be found in conversion to FB2 or ePub, so my first approach would be to try a conversion to ePub and extract the HTML pages to grep those, and there are many ways to commandline convert PDF 2 ePuB or HTML.

Answered By - K J

Answer Checked By - Marilyn (WPSolving Volunteer)

This Answer collected from stackoverflow and tested by PythonFixing community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0

Friday, February 11, 2022

[SOLVED] How can I pdfgrep a pdf so that only bold matches are shown?

Issue

Solution

Popular Posts

Labels