Options

How to read a text in Rapid Miner after

KarunKarun Member Posts: 13 Newbie
Hi,
I am new to Rapid miner, I have a requirement parse a web page ( able to do), then read a content after certain word for e.g web page contains data

Heading

Paragraph 1 

Automobiles stocks are A1,A2,A3,A4.

I want to read A1,A2,A3,A4 which comes after string "Automobiles stocks are"

Please help!!

Thanks
Tagged:

Answers

  • Options
    kaymankayman Member Posts: 662 Unicorn
    The most efficient way would be to use xpath, as this will allow you to nicely pick the content of your tags. 

    If you have no experience in XPath (it's less complex as it looks) you have the option to use regex in combination with generate attributes.

    In both cases you just open your crawled webpage with the read document operator, for xpath you keep the tags, for regex you might be better off with selecting 'text only' in the operators settings. 
  • Options
    KarunKarun Member Posts: 13 Newbie
    Hi,

    Thanks a lot for your response, positionig of paragraph can change so i feel using regex is better option, please let me know which ETL stage shall I use to impliment this regex and attributes so that i can fetch the required information.

    Thanks
  • Options
    kaymankayman Member Posts: 662 Unicorn
    I'd say something like this :

    Read url -> Read document (extract text only) -> Documents to data -> generate attributes (using regex)
  • Options
    KarunKarun Member Posts: 13 Newbie
    I am using 

    Get Page --> Read Document but getting an error "Expected File Object but received Document, please help also not using read url as it is expecting a csv file with comma separated values. Please correct my understanding If I am wrong
  • Options
    kaymankayman Member Posts: 662 Unicorn
    edited February 2021
    My error, I mixed read url and read webpage.
    If you already have a webpage in document format you can skip the first step and attach it directly to Documents to Data.
  • Options
    KarunKarun Member Posts: 13 Newbie
    Hi,

    Thanks for revert

    I have created a process

    Read Excel-> Get Pages -> Data to Doc -> Documents to Data -> Generate attrbute

    Can you please let me know which attribute will have html body (content of webpage) so that i can parse the same.

    Regards,
    Karun
  • Options
    KarunKarun Member Posts: 13 Newbie
    I am getting attributes like
    1 URL
    2. Response Code
    3 Response Message

    ......

    But not able to find attribute that has html body(Content of We Page) to parse the data
  • Options
    KarunKarun Member Posts: 13 Newbie
    Hi ,

    I am able to read the attributes now, lastly please help me on regex front in terms of how to get data between two words in data miner  
    Word 1 : " Automobile stocks are"
    Word 2 : "."

    Thanks
  • Options
    kaymankayman Member Posts: 662 Unicorn
    In regex that would be something like 

    (?s)^.*stocks are (.*?)\..*$

    So start at the beginning, ignore whitespace and linebreaks until you find 'stocks are' and then keep everything until the first dot. 
  • Options
    KarunKarun Member Posts: 13 Newbie
    I am facing an issue while using Cut stage, I am able to trim the attribute with attribute filter type as single and attribute "Link" but if I use attribute filter type as regular expression I am not able to do a mapping of attribute and regular expression.

    Please help.
  • Options
    kaymankayman Member Posts: 662 Unicorn
    Can you share something? 
  • Options
    KarunKarun Member Posts: 13 Newbie
    Hi I am attaching the rmp file, I am doing very basic Proof of concept now, hitting stack overflow url and fetching a substring between 2 substrings in response

    Regards,
    Karun
  • Options
    KarunKarun Member Posts: 13 Newbie
    Also regex feature doesnot seems to be working

    I have tried using 

    1. Where(.*)Learn
    2. (?=Where).*(?=Learn)

    to fetch string Where developers learn but no luck

    Please help
  • Options
    KarunKarun Member Posts: 13 Newbie
    To me looks like regex is working on attributes not on its content, please correct me if I am wrong
Sign In or Register to comment.