Extracting coloured parts of my document

Raphael2304 · April 2021

Hey all

I have a sample of approx. 15.000 press releases where different parts were coloured. Is there any possible solution that I extract these parts to a table where every release is a row and the different coloured parts are the coloums?

Thanks a lot in advance and have a great weekend!

Best regards from Germany
Raphael

kayman · April 2021

Hi @Raphael2304, what do you exactly mean with coloured part? Are these like colored fields in your original excel (like for instance conditional formatting), or different colors in Rapidminer?

In case of the latter it means Rapidminer considers these special attributes (like an ID or a label) and you can bypass these by either selecting 'include special attributes' in the operators sensitive for this, or make them regular using the Set Role operator.

In case of source data colors, please share so we can have some better understanding.

Raphael2304 · April 2021

Hi @kayman, The certain parts of interest in my word file are indeed just highlighted (the background colour is different). I attached a file of two different press releases so you can get to know what I mean. I would love to extract the yellow parts from each release in an e.g. excel file and (if possible) put every release in another row with the information in coloums. Do you think this would be somehow possible?

Thanks a lot in advance!

kayman · April 2021

Hi @Raphael2304, There is no simple way to do this, as pdf's store their logic a bit different compared to for instance excel etc.

The normal read document will therefore not help as it strips all of the layout from the pdf and you're left with the text only by default.

So this leaves it to patterns, if there is a designated word or sentence in the yellow part it becomes relatively easy. At first glance it appears that all paragraphs that contain the word Rorsted are marked, so there is your pattern.

-> Load your text, split in paragraphs (basically is there one or more empty line between text) and if one of these contains Rorsted it was yellow otherwise white. I've attached a basic example coming close to give you the basic idea.

If this is not the case and the marking would be at random it becomes a whole lot more complex and you're left with 2 other options (to my awareness), but both pretty advanced and they do require python knowledge.

- PDF's are in essence a form of XML behind the scenes that construct your page in boxes, telling the location and part of the markup. So oversimplified the code behind the document will be something like 'box with x-y coordinates containing yellow overlay with text' but than in XML format. There are quite some python pdf to text packages that can deal with this, pdfminer is one of them. This will convert the pdf to the XML format, and then you can use XPath to get the colored areas versus non colored areas. If you're familiar with these you can basically do whatever you want with pdf's, but this works best with continuous text as in your sample.

- Another option would be to use computer vision (like opencv) where you split your pdf in smaller pdf's based on the background. So if you have a pdf starting with a white background square, followed by for instance yellow, white, yellow, white and so on backgrounds you could split these and deal with them this way.

ceaperez · April 2021

Hi @Raphael2304,

also check these resources. I worked with them with good results

Extract annotations and highlighted passages from PDF files - Steve Powell's blog (pogol.net)
How To: Extract Highlighted Text from a PDF File | francisco morales
GitHub - Samathy/pdfcommentextractor: Extracts highlighted text from PDF documents.

Regards

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Extracting coloured parts of my document

Answers