What is the "Extract Information" Operator capable of?

RaffiH · March 2020

I've recently startet using RapidMiner with respect to my bachelor thesis. I want to use RapidMiner to analyze websites of specific companys and it would be nice if someone could explain the "Extract Information" Operator to me. I don't really understand in which cases I can use it.

Thank you very much in advance!

[Deleted User] · March 2020

@RaffiH
Hello

This operator extracts information from a document with structured content. The purpose of this operator is to extract informations from the structured content of a document.

The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document

regards
mbs

MartinLiebig · March 2020

Hey,

Extract Information is actually one of the hidden gems, because it adds some tools you may want to get for advanced parsing. Most importantly it offers you the option to use JSONPath.

Best,

Martin

RaffiH · March 2020

@mschmitz
What needs to be done to use those tools?

In my process I first use "Get Page" with an URL. Then I use "Extract Content". After filtering the stopwords I want to use "Extract information" but I don't know how.

Thank you very much for your answer!

[Deleted User] · March 2020

@RaffiH

Hello

You can find more information here in these two links

https://community.rapidminer.com/discussion/55982/parsing-json-in-rapidminer-using-the-webautomation-extension-by-old-world-computing

https://community.rapidminer.com/discussion/56323/parsing-json-with-owcs-webautomation-extension-extracting-two-or-more-relational-example-sets

I hope this helps
mbs

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

What is the "Extract Information" Operator capable of?

Answers