Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
What is the "Extract Information" Operator capable of?
I've recently startet using RapidMiner with respect to my bachelor thesis. I want to use RapidMiner to analyze websites of specific companys and it would be nice if someone could explain the "Extract Information" Operator to me. I don't really understand in which cases I can use it.
Thank you very much in advance!
Thank you very much in advance!
2
Answers
Hello
This operator extracts information from a document with structured content. The purpose of this operator is to extract informations from the structured content of a document.
The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document
regards
mbs
Dortmund, Germany
What needs to be done to use those tools?
In my process I first use "Get Page" with an URL. Then I use "Extract Content". After filtering the stopwords I want to use "Extract information" but I don't know how.
Thank you very much for your answer!
Hello
You can find more information here in these two links
https://community.rapidminer.com/discussion/55982/parsing-json-in-rapidminer-using-the-webautomation-extension-by-old-world-computing
https://community.rapidminer.com/discussion/56323/parsing-json-with-owcs-webautomation-extension-extracting-two-or-more-relational-example-sets
I hope this helps
mbs