"Text Mining Questions"

MockingBirdMockingBird Member Posts: 2 Contributor I
edited June 2019 in Help
Hello,

I'm using Rapidminer for the first time and I'm currently struggling with the following issue:
  • I have several texts which I want to split into sentences. Each of the texts is stored in a single cell of a column in an Excel file.
  • After that, I want to extract frequently occurring terms from these sentences.
  • As third step I want to automatically categorize the sentences depending on the terms respectively a combinations of the terms.
  • Finally I want to be able to select for example the term "colours" and subsequently I want to get shown all sentences containing this term.
Can you tell me if this is generally possible using Rapidminer and give me some directions how to proceed?

Thanks a lot,
Adrian

Answers

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hello MockingBird,

    you might have a look at this tutorial. It should help you to start with Rapidminer:
    http://vancouverdata.blogspot.de/2011/02/how-to-web-scraping-xpath-html-google.html

    If you have further questions, feel free to ask.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • MockingBirdMockingBird Member Posts: 2 Contributor I
    Hi Martin,

    Thanks for the link.

    I'm currently trying to split the texts, which I imported from an Excel sheet, into sentences and I have absolutely no idea what I'm doing wrong here. I tried the "Tokenize" operator of the Text Processing addon as well as the "SentenceTokenizer" of the Information Extraction addon. None of these is working. The code you can find below. I'm grateful for any hint.

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.3.013">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="5.3.013" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="5.3.013" expanded="true" height="60" name="Read Excel" width="90" x="45" y="75">
            <parameter key="excel_file" value="C:\Users\a.ressel\Desktop\RapidMiner Test\Input\1 - Reports\report_list_henkel_export-2014-05-07(all three tests).xlsx"/>
            <parameter key="imported_cell_range" value="BI2:BI163"/>
            <parameter key="first_row_as_names" value="false"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information">
              <parameter key="0" value="Reports.true.text."/>
            </list>
          </operator>
          <operator activated="true" class="multiply" compatibility="5.3.013" expanded="true" height="94" name="Multiply" width="90" x="179" y="75"/>
          <operator activated="true" class="information_extraction:sentence_tokenizer" compatibility="1.0.000" expanded="true" height="76" name="SentenceTokenizer" width="90" x="380" y="120">
            <parameter key="optionalAttribute" value="Reports"/>
            <parameter key="new token-name" value="Sentences"/>
          </operator>
          <operator activated="true" class="text:process_document_from_data" compatibility="5.3.002" expanded="true" height="76" name="Process Documents from Data" width="90" x="380" y="30">
            <parameter key="create_word_vector" value="false"/>
            <parameter key="add_meta_information" value="false"/>
            <list key="specify_weights"/>
            <process expanded="true">
              <operator activated="true" class="text:tokenize" compatibility="5.3.002" expanded="true" height="60" name="Tokenize" width="90" x="179" y="30">
                <parameter key="mode" value="linguistic sentences"/>
              </operator>
              <connect from_port="document" to_op="Tokenize" to_port="document"/>
              <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
              <portSpacing port="source_document" spacing="0"/>
              <portSpacing port="sink_document 1" spacing="0"/>
              <portSpacing port="sink_document 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Read Excel" from_port="output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
          <connect from_op="Multiply" from_port="output 2" to_op="SentenceTokenizer" to_port="example set input"/>
          <connect from_op="SentenceTokenizer" from_port="example set output" to_port="result 2"/>
          <connect from_op="Process Documents from Data" from_port="example set" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
        </process>
      </operator>
    </process>
    Thank you,
    Adrian

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    Hi again,

    usually process documents from Text with split on linguistic senteces should be fine. So it is hard to predict anything w/o the data.

    I will write you a mail on that matter.

    Cheers,
    Martin
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,507 RM Data Scientist
    For the record: The problem was most likely more related to Excel specifica.

    We will take care of this.
    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.