Options

What is the "Extract Information" Operator capable of?

RaffiHRaffiH Member Posts: 9 Learner I
I've recently startet using RapidMiner with respect to my bachelor thesis. I want to use RapidMiner to analyze websites of specific companys and it would be nice if someone could explain the "Extract Information" Operator to me. I don't really understand in which cases I can use it. 

Thank you very much in advance!

Answers

  • Options
    [Deleted User][Deleted User] Posts: 0 Learner III
    edited March 2020
    @RaffiH
    Hello

    This operator extracts information from a document with structured content. The purpose of this operator is to extract informations from the structured content of a document.

     The extracted information will be added as meta data to the document and if wished might be added as attribute later. There are several options available for specifying which information should be extracted. In String Matching mode you may specify a start String and an end String, if both are found in the document, the characters between are extracted. Regular Expressions let you specify any expression and will use the first matching group as extraction. If it's to difficult to include the intermediate characters into the expression in a well defined way, you might find Regular Region mode useful, where you can define two regular expressions. As on String Matching mode, the first defines the start and the last the end and anything intermediate will be extracted. The most sophisticated variant is the XPath mode, where you can enter an arbitrary XPath expression. This proves usefull, especially when trying to extract information from a website. Since XPath expressions are only available for XML files, you will have to take care, that the documents are well defined XML. This might be ensured by the assume_html parameter of the Document Processing operator, that will use a special parser to correct errors in the HTML. It is also possible to extract informations from a JSON document with a JSONPath expression. As with the XPath mode, you will have to take care, that the document provided is a valid JSON document       


    regards
    mbs
  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,510 RM Data Scientist
    Hey,
    Extract Information is actually one of the hidden gems, because it adds some tools you may want to get for advanced parsing. Most importantly it offers you the option to use JSONPath.

    Best,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • Options
    RaffiHRaffiH Member Posts: 9 Learner I
    @mschmitz
    What needs to be done to use those tools? 

    In my process I first use "Get Page" with an URL. Then I use "Extract Content". After filtering the stopwords I want to use "Extract information" but I don't know how.

    Thank you very much for your answer!
Sign In or Register to comment.