web research

alphabetoalphabeto Member Posts: 8 Contributor II
edited November 2018 in Help
Hi,
Can rapid miner do a automated regular research (say daily) of a list of words in a list of url, and get each page link?
I have a list of  words and I want to regularly get every web link where any of these words appears in any of the web url from my predefined urls list.


Eg. wordlist : qwe, rty
url list: www.asd.com, www.zxc.com

What is the process path in order to get daily and automated each web link where words "qwe" and/or "rty" apear in the www.asd.com and/or www.zxc.com


Many thanks
Dan

Answers

  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Dan,

    you can use the Get Pages operator to get the contents of a number of websites whose links you provide in a data table.
    You can then use the text processing extension to count the words that appear in the different sites. Our websites provides some links to video tutorials for the text mining extension: http://rapid-i.com/content/view/189/212/lang,en/
    To focus on the contents of the websites and remove all html tags you can use the Extract Content operator.

    Finally, to execute the job regularly, you should use the RapidAnalytics server, also available on our website.

    Best regards,
    Marius
  • alphabetoalphabeto Member Posts: 8 Contributor II
    Hi Marius,

    Thank you. I'm almost there. But in order to solve this and get the job done, after I extract words with "extract content" as you say, I further need to get a doc. list or a folder with the pages (the url links in a doc., or html pages in a floder, etc.) for every word extracted. How can I do this?

    Thanks,
    Dan
  • alphabetoalphabeto Member Posts: 8 Contributor II
    In other words,  my job would be to filter a pre-defined list of sites (with the filter being a list of varios words) AND THE RESULT must be to get the specific WEB LINKS to the pages where those words appear the predefined sites.
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Dan,

    after the Process Documents operator you should have a table that contains the occurrences of each word (columns) in each document (rows), alongside with the URL of the page in the URL attribute.

    Now you can iterate your target words and use Filter Examples to keep only those rows where the column for the current word contains a value greater than zero. Then you can Write the URLs of the matching documents to the harddisk, e.g. with the Write Excel or Write CSV operator.

    Does that help? If you have any questions left, please attach the XML of your process such that we can use it as a base for our answer.

    Best regards,
    Marius
  • alphabetoalphabeto Member Posts: 8 Contributor II
    Hello Marius,

    I sent you by email the xml of my process as you mentioned. Can I count on your answer to my email regarding making the process work head-to tail?

    Many thanks!
  • MariusHelfMariusHelf RapidMiner Certified Expert, Member Posts: 1,869 Unicorn
    Hi Dan,

    please post your process publicly to this thread - it may also be interesting for other users.

    Best regards,
    Marius
  • alphabetoalphabeto Member Posts: 8 Contributor II

    Hi Marius,

    Bellow is the precess, as far as I could go. Can I count on you to make it work an finalize this job (actually, and finally get the url list of the pages where the researched words appear in the predefined list of websites)?

    Thanks again and hoping for the best,
    Dan

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="5.3.013" expanded="true" height="60" name="Read Excel" width="90" x="45" y="30">
            <parameter key="excel_file" value="C:\xxx\Links.xls"/>
            <parameter key="imported_cell_range" value="A1:B6"/>
            <list key="annotations">
              <parameter key="0" value="Name"/>
            </list>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="links.true.file_path.attribute"/>
            </list>
          </operator>
          <operator activated="true" class="web:retrieve_webpages" compatibility="5.3.001" expanded="true" height="60" name="Get Pages" width="90" x="179" y="30">
            <parameter key="link_attribute" value="links"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="5.3.002" expanded="true" height="60" name="Data to Documents" width="90" x="313" y="30">
            <parameter key="select_attributes_and_weights" value="true"/>
            <list key="specify_weights">
              <parameter key="eurpoa" value="1.0"/>
            </list>
          </operator>
          <operator activated="true" class="write_excel" compatibility="5.3.013" expanded="true" height="76" name="Write Excel" width="90" x="380" y="210"/>
          <operator activated="true" class="text:process_documents" compatibility="5.3.002" expanded="true" height="94" name="Process Documents" width="90" x="380" y="210">
            <parameter key="keep_text" value="true"/>
            <process expanded="true">
              <operator activated="true" class="web:extract_html_text_content" compatibility="5.3.001" expanded="true" height="60" name="Extract Content" width="90" x="45" y="30"/>
              <operator activated="true" class="text:filter_documents_by_content" compatibility="5.3.002" expanded="true" height="76" name="Filter Documents (by Content)" width="90" x="246" y="30">
                <parameter key="string" value="europa"/>
              </operator>
              <connect from_port="document" to_op="Extract Content" to_port="document"/>
              <connect from_op="Extract Content" from_port="document" to_op="Filter Documents (by Content)" to_port="documents 1"/>
              <connect from_op="Filter Documents (by Content)" from_port="documents" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Get Pages" to_port="Example Set"/>
          <connect from_op="Get Pages" from_port="Example Set" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Process Documents" to_port="documents 1"/>
          <connect from_op="Process Documents" from_port="word list" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
  • alphabetoalphabeto Member Posts: 8 Contributor II
    hello, I really need to know whether I can count on support here on this matter.

    Best regards and thanks again,
  • Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    just a friendly reminder, this is a community forum where members of the community can help each other out. Sometimes, when time allows, we do chip in and provide answers to some questions. However there is never a guarantee that we will answer in this forum. If you do need support with fixed answering times, please contact us and inquire about enterprise support.

    Regards,
    Marco
  • alphabetoalphabeto Member Posts: 8 Contributor II
    Ok, I am sorry if I was somewhat pushy or too much inquiry.
    However, in case someone has some idea for this, it would be of great support, as I need it to finalize some work with it.
    Regards and a very nice day!
Sign In or Register to comment.