🎉 🎉   RAPIDMINER 9.5 BETA IS OUT!!!   🎉 🎉

GRAB THE HOTTEST NEW BETA OF RAPIDMINER STUDIO, SERVER, AND RADOOP. LET US KNOW WHAT YOU THINK!

CLICK HERE TO DOWNLOAD

🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

Split preformatted text (of web page) on paragraphs

philipp25philipp25 Member Posts: 3 Contributor I
edited June 19 in Help
Hello,
I crawled multiple pages with the "get Pages"-Operator (Webmining Extension). All of the text of the website is in a <pre>-HTML-tag.
I want to cut the preformatted-text by the paragraphs. The "get Content"-Operator extracts the text perfectly, but destroys the formatting.

Any solution?

Thanks!

Best Answer

Answers

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,257   Unicorn
    You could split the text up by paragraph using the Cut Document operator and then use Extract Content operator on the resulting paragraphs, and then join everything back together.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
    philipp25mbs
  • philipp25philipp25 Member Posts: 3 Contributor I
    edited May 15
    Thanks! But first I have to extract the text from the <pre> don't I?  I would post a Link, but my RM-Account seems to be too new :)

    <body>
        <pre>
                  preformatted text
                  paragaph
                 performatted text
        </pre>
    </body>

    Okay I got further. I extracted the text into an attribute and it looks like this:

    text
    text

    text2
    text2

    How do I split the text on each empty line / paragraph ?



  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,257   Unicorn
    Cut Document should do the trick if you use regex for line breaks, or you could also do the cut before you remove the html and use the html code instead.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,525  Community Manager
    @philipp25 I just boosted your user acct - you can post links now.

    Scott

    ----------------------
    Don't forget to submit your great ideas for Wisdom 2020! Deadline is November 15.

    Wisdom 2020 – Call for Speakers Form 

    philipp25
  • philipp25philipp25 Member Posts: 3 Contributor I
    edited May 22
    It just does not work...

    Here is my process:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.2.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.2.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="text:create_document" compatibility="8.1.000" expanded="true" height="68" name="Create Document" width="90" x="112" y="34">
            <parameter key="text" value="text&#10;text&#10;&#10;text&#10;text&#10;&#10;"/>
            <parameter key="add label" value="false"/>
            <parameter key="label_type" value="nominal"/>
          </operator>
          <operator activated="true" class="text:cut_document" compatibility="8.1.000" expanded="true" height="68" name="Cut Document" width="90" x="313" y="34">
            <parameter key="query_type" value="Regular Expression"/>
            <list key="string_machting_queries"/>
            <parameter key="attribute_type" value="Nominal"/>
            <list key="regular_expression_queries">
              <parameter key="line_breaks" value="\n\s"/>
            </list>
            <list key="regular_region_queries"/>
            <list key="xpath_queries"/>
            <list key="namespaces"/>
            <parameter key="ignore_CDATA" value="true"/>
            <parameter key="assume_html" value="true"/>
            <list key="index_queries"/>
            <list key="jsonpath_queries"/>
            <process expanded="true">
              <connect from_port="segment" to_port="document 1"/>
              <portSpacing port="source_segment" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Create Document" from_port="output" to_op="Cut Document" to_port="document"/>
          <connect from_op="Cut Document" from_port="documents" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 341   Unicorn
    Cut Documents with XPath is also a great option, as it takes advantage of the HTML structure. Both regex and XPath are kind of unruly tools, so you will have to choose your poison.

    Regards
    Sebastian
Sign In or Register to comment.