RapidMiner

Contributor I hbuggled
Contributor I

Split text into paragraphs

Hi guys,

I have an excel file which consist article from Wikipedia. I want to split the text into paragraphs. I tried the Tokenize operator but there are no option to tokenize my text into paragraphs.  I also tried the Cut Document Operator with the xPath query type. I used the query expression //h: p, but it doesn't work. Is there any posibilities to tokenize/split my text into paragraphs?

 

Thank you in advance.

8 REPLIES
Community Manager Community Manager
Community Manager
Solution

Re: Split text into paragraphs

hello @hbuggled - welcome to the community.  I think you were on the right track with tokenize but I would choose the regex option in the parameters pane and try using \n as a expression.

 

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Highlighted
Guru
Guru

Re: Split text into paragraphs

regular expressions are your friend indeed. It just depends on how your content is structured. The linebreak (\n) could work, but it will not really break up into paragraphs but in sentences.

Typically paragraphs are created by a double (or more) linebreaks, so if you split on \n{2,} you may get them nicely by paragraph (in theory...)

Community Manager Community Manager
Community Manager

Re: Split text into paragraphs

ah yes well said @kayman - good catch.  Smiley Happy

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor I hbuggled
Contributor I

Re: Split text into paragraphs

Thank you very much for your help. My text was strucutred in linebreaks. So I could use the \n expression to tokenize it.

 

Now I only need the last paragraphs for tokenizing it in an non-letters structure. Unfortunately, I don't know how to realize it. Is there maybe an other expression to tokenize it or can I filter all paragraphs except the last one? Do you have an idea?

 

I am not sure, if it's allowed to write my next question after you solved my first problem. Please let me know if I should write it in a new post Smiley Happy

 

Thank you in advance.

Community Manager Community Manager
Community Manager

Re: Split text into paragraphs

hello @hbuggled - hmm I am rather unclear by what you mean by "need the last paragraphs for tokenizing it in an non-letters structure".  Could you please explain?

 

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor I hbuggled
Contributor I

Re: Split text into paragraphs


sgenzer wrote:

hello @hbuggled - hmm I am rather unclear by what you mean by "need the last paragraphs for tokenizing it in an non-letters structure".  Could you please explain?

 

Scott



hi  

 

"RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures.

According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each operator performs a single task within the process, and the output of each operator forms the input of the next one. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. RapidMiner provides learning schemes, models and algorithms and can be extended using R and Python scripts.

 

RapidMiner functionality can be extended with additional plugins which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create data analysis algorithms and publish them to the community. With version 7.0, RapidMiner included updates to its getting started materials, an updated user interface, and improvements to its data preparation capabilities."

 


Community Manager Community Manager
Community Manager
Solution

Re: Split text into paragraphs

hello @hbuggled - ok I understand.  This is likely not the most elegant solution but it will do what you're looking for.

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="text:create_document" compatibility="7.5.000" expanded="true" height="68" name="Create Document" width="90" x="45" y="34">
        <parameter key="text" value="RapidMiner uses a client/server model with the server offered as either on-premise, or in public or private cloud infrastructures.&#10;According to Bloor Research, RapidMiner provides 99% of an advanced analytical solution through template-based frameworks that speed delivery and reduce errors by nearly eliminating the need to write code. RapidMiner provides data mining and machine learning procedures including: data loading and transformation (Extract, transform, load (ETL)), data preprocessing and visualization, predictive analytics and statistical modeling, evaluation, and deployment. RapidMiner is written in the Java programming language. RapidMiner provides a GUI to design and execute analytical workflows. Those workflows are called “Processes” in RapidMiner and they consist of multiple “Operators”. Each operator performs a single task within the process, and the output of each operator forms the input of the next one. Alternatively, the engine can be called from other programs or used as an API. Individual functions can be called from the command line. RapidMiner provides learning schemes, models and algorithms and can be extended using R and Python scripts.&#10;RapidMiner functionality can be extended with additional plugins which are made available via RapidMiner Marketplace. The RapidMiner Marketplace provides a platform for developers to create data analysis algorithms and publish them to the community. With version 7.0, RapidMiner included updates to its getting started materials, an updated user interface, and improvements to its data preparation capabilities."/>
      </operator>
      <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34">
        <parameter key="mode" value="regular expression"/>
        <parameter key="expression" value="\n+"/>
      </operator>
      <operator activated="true" class="text:extract_token_number" compatibility="7.5.000" expanded="true" height="68" name="Extract Token Number" width="90" x="313" y="34"/>
      <operator activated="true" class="text:documents_to_data" compatibility="7.5.000" expanded="true" height="82" name="Documents to Data" width="90" x="447" y="34">
        <parameter key="text_attribute" value="text"/>
      </operator>
      <operator activated="true" class="split" compatibility="7.6.001" expanded="true" height="82" name="Split" width="90" x="581" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="text"/>
        <parameter key="split_pattern" value="\n"/>
      </operator>
      <operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="715" y="34">
        <parameter key="macro" value="tokenNumber"/>
        <parameter key="macro_type" value="data_value"/>
        <parameter key="attribute_name" value="token_number"/>
        <parameter key="example_index" value="1"/>
        <list key="additional_macros"/>
      </operator>
      <operator activated="true" class="generate_macro" compatibility="7.6.001" expanded="true" height="82" name="Generate Macro" width="90" x="849" y="34">
        <list key="function_descriptions">
          <parameter key="att" value="concat(&quot;text_&quot;,%{tokenNumber})"/>
        </list>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes" width="90" x="983" y="34">
        <parameter key="attribute_filter_type" value="single"/>
        <parameter key="attribute" value="%{att}"/>
      </operator>
      <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="1117" y="34">
        <parameter key="select_attributes_and_weights" value="true"/>
        <list key="specify_weights">
          <parameter key="%{att}" value="1.0"/>
        </list>
      </operator>
      <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="1251" y="34"/>
      <operator activated="true" class="text:tokenize" compatibility="7.5.000" expanded="true" height="68" name="Tokenize (2)" width="90" x="1385" y="34">
        <parameter key="expression" value="\n+"/>
      </operator>
      <connect from_op="Create Document" from_port="output" to_op="Tokenize" to_port="document"/>
      <connect from_op="Tokenize" from_port="document" to_op="Extract Token Number" to_port="document"/>
      <connect from_op="Extract Token Number" from_port="document" to_op="Documents to Data" to_port="documents 1"/>
      <connect from_op="Documents to Data" from_port="example set" to_op="Split" to_port="example set input"/>
      <connect from_op="Split" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
      <connect from_op="Extract Macro" from_port="example set" to_op="Generate Macro" to_port="through 1"/>
      <connect from_op="Generate Macro" from_port="through 1" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
      <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
      <connect from_op="Combine Documents" from_port="document" to_op="Tokenize (2)" to_port="document"/>
      <connect from_op="Tokenize (2)" from_port="document" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
Contributor I hbuggled
Contributor I

Re: Split text into paragraphs

Thank you very much for your help. That's help me a lot

Twitter Feed