Issue
I am trying to list all occurrences of a bold string in a pdf with its page number. But I don't want to list those occurrences where it is not bold.
pdfgrep -n -o "String" Input.pdf
But I don't know how to catch the bold aspect...
Link to the pdf: https://ilarisblog.files.wordpress.com/2021/07/ilaris.pdf (direct download, not my website)
Solution
If you are lucky in some rare cases you might get a means to say page x uses Fonts like CID Bold & Normal e.g they could be different fonts or thicknesses, lets take one example, It is contrived so not that uncommon, but illustrates several points.
so there are commandline tools to dig into the fonts and text and provide fine detail
name type emb sub uni object ID
------------------------------------ ----------------- --- --- --- ---------
CIDFont+F1 CID TrueType yes no yes 11 0
[List All Fonts], The number of fonts in this PDF file is: 1
CIDFont+F1 CID TrueType yes no yes 11 0
line:
word: x=42.48..51.65 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: '1.'
word: x=54.78..97.35 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'Surgical'
word: x=100.55..133.15 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'rooms'
word: x=136.37..158.21 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'and'
word: x=161.46..203.55 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'services'
word: x=206.77..212.91 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: '–'
word: x=217.78..229.70 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'To'
word: x=232.69..272.34 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'include'
word: x=275.41..316.39 y=103.29..121.01 base=115.32 fontSize=11.04 space=1: 'surgical'
word: x=319.33..347.75 y=103.29..121.01 base=115.32 fontSize=11.04 space=0: 'suites'
line: x=42.48..347.75 y=103.29..121.01 base=115.32 '1. Surgical rooms and services – To include surgical suites'
However none of that shows the difference midway in the line.
There is only one Font here (F1) and all the text is one stream in 3 parts.
So looking closely we can see on the second line the thicker looking glyphs on the left are 1 point thick (but so is the rest of that line) using exactly the same font The Font name is CID+F1 throughout, I could have let it use another name but the point is neither half of that line is "Bold". Well what about that 1 point Border Width,can I test for that? No, both halves of line 2 are 1 point Border Width the difference is the Darker half is more Opaque (Stroke Opacity is 100%) than The right half so just appears to be thicker. and there is no way you could grep that externally without using a PDF Library to disassemble the whole font structure and its application.
I am not saying you cant use an extraction tool to extract the text as "BOLDER" and "not so bold" or report that some text has a name with bold in it, But what use is a plain text stream, that says some letters may be bold and some letters may not, you can see that from an image extraction.
You will need a library that can analyse the various ways bolder text can be illustrated, then use that with a plain text grepper.
[Later Edit]
You provided a complex example where certainly the fonts are defined with many different names for example MinionPro-Bold, so there is some hope, however the style is appled to every page number so we can say every page (even the blank one has BOLD :-) so how to pull that page number and any other text with that, is well provided with many examples on SO most using Python with mixed results (often PDFMiner) but I would not use that by first choice, unless you know it well.
There may well be conflicting reports by some extractors, since the Primary blocks of "body text" report overall they are bold containers even when most internal text is not, so may need secondary division into sub groups, perhaps discarding that which is classed as MinionPro-Regular and keeping Aniron-Bold-SC700.
So whats the best answer
Well you need a format where inline changes of font are the norm such as XML or HTML and those can be found in conversion to FB2 or ePub, so my first approach would be to try a conversion to ePub and extract the HTML pages to grep those, and there are many ways to commandline convert PDF 2 ePuB or HTML.
Answered By - K J Answer Checked By - Marilyn (WPSolving Volunteer)