"Survey analysys from Excel source"

hennessy · February 2012

Hi,

I hope you guys can help or direct me to a right direction. I just find RapidMiner and spent the last two days going through tutorials.
I have over 2000 surveys loaded into an excel file and I need to make a statistic for frequently mentioned phrases and brake it down by ID’s. The file contains few additional information, but the data and the ID columns are the most important. I was doing this breakdown in excel, but it’s not the ideal tool for this analysis. Can anybody help creating a solution?

Thanks a lot

MariusHelf · February 2012

Hi,

if I understand your problem correctly, you should have a look at the Text Processing Extension.

Best,
Marius

hennessy · February 2012

That's right I can load the excel file, just dont find the way to process the right column and break down the result by ID.I was able to work with the data transfered to text, but then i loose the advane of the excel record reference.

MariusHelf · February 2012

What exactly do you mean by breaking down by Id? Do you have several rows with the same Id in the excel file, and you want to combine the statistics for rows with similar Ids?

It would probably be helpful to post your current process setup and if possible some sample data. Post your process by clicking the XML tab in RapidMiner above the process view, and copy the XML code into your next post. Please use the "#" button above the input field of this forum.

Best,
Marius

hennessy · February 2012

Hi Marius,

Thanks for the prompt replies.

The excel file has a unique ID column and there are multiple rows belong with the same ID. Here is the header of the Excel file looks like:

ID (text); Comment (text); Description (text); Code (text); Reference# (integer); Code Description (text); Resolution Description (text)

My target is to analyze the frequent words, phrases in the comment column and display the result in a chart. Including and excluding word and phase list would be great, but not essential.

I was able to create a working model with manual pre-sorting the source in excel and transferring it to a txt file, but I’m completely newbie. I couldn’t find the basics of working with excel files. I’m also having difficulties to export the wordlist into an Excel or a txt file and creating a chart from the results.
Here is the xml code just in case:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="531" width="637">
<operator activated="true" class="text:read_document" compatibility="5.1.004" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="E:\Comment Analysis\Dev\Description light.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="176" y="31">
<process expanded="true" height="531" width="655">
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="255"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="390"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="165">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>

I already can see that RapidMiner is a great tool for this purpose; I just have to find the way to get the things done.

Thank you in advance,
Peter

MariusHelf · February 2012

Hi Peter,

your process looks fine so far, but you still did not explain what you want to do with the IDs. Do you want to sum up the term frequencies of rows with identical IDs?

-Marius

hennessy · February 2012

Hi Marius,

Yes, that's the idea. Each ID belongs to a certain product. I need to find problematic areas mentioned in the surveys and create a chart with the most problematic products.

Best regards,
Peter

MariusHelf · February 2012

Heya,

try something like the process below. The main work is done by the aggregation operator. The Rename by Replace just removes the function names from the attribute names (set a breakpoint after the aggregation to see what I mean). I tested this process on a simple dataset like this:


id;text
1;cat dog fish
1;fish dog
2;cat cheese mouse
2;dog
2;cat dog
3;dog

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
    <process expanded="true" height="531" width="915">
      <operator activated="false" class="text:process_documents" compatibility="5.2.000" expanded="true" height="76" name="Process Documents" width="90" x="447" y="210">
        <process expanded="true" height="633" width="705">
          <operator activated="false" class="text:transform_cases" compatibility="5.2.000" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="30"/>
          <operator activated="false" class="text:tokenize" compatibility="5.2.000" expanded="true" height="60" name="Tokenize" width="90" x="180" y="30"/>
          <operator activated="false" class="text:filter_stopwords_english" compatibility="5.2.000" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="315" y="30"/>
          <operator activated="false" class="text:filter_by_length" compatibility="5.2.000" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="450" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="false" class="text:generate_n_grams_terms" compatibility="5.2.000" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="585" y="30">
            <parameter key="max_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
          <connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="read_csv" compatibility="5.1.017" expanded="true" height="60" name="Read CSV" width="90" x="45" y="30">
        <parameter key="csv_file" value="C:\Users\marius\Documents\development\test.csv"/>
        <parameter key="first_row_as_names" value="false"/>
        <list key="annotations">
          <parameter key="0" value="Name"/>
        </list>
        <parameter key="encoding" value="windows-1252"/>
        <list key="data_set_meta_data_information">
          <parameter key="0" value="id.true.integer.attribute"/>
          <parameter key="1" value="text.true.polynominal.attribute"/>
        </list>
      </operator>
      <operator activated="true" class="set_role" compatibility="5.1.017" expanded="true" height="76" name="Set Role" width="90" x="179" y="30">
        <parameter key="name" value="id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="5.1.017" expanded="true" height="76" name="Nominal to Text" width="90" x="313" y="30"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="5.2.000" expanded="true" height="76" name="Process Documents from Data" width="90" x="447" y="30">
        <parameter key="vector_creation" value="Term Occurrences"/>
        <list key="specify_weights"/>
        <process expanded="true" height="633" width="705">
          <operator activated="true" class="text:transform_cases" compatibility="5.2.000" expanded="true" height="60" name="Transform Cases (2)" width="90" x="45" y="30"/>
          <operator activated="true" class="text:tokenize" compatibility="5.2.000" expanded="true" height="60" name="Tokenize (2)" width="90" x="180" y="30"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="5.2.000" expanded="true" height="60" name="Filter Stopwords (2)" width="90" x="315" y="30"/>
          <operator activated="true" class="text:filter_by_length" compatibility="5.2.000" expanded="true" height="60" name="Filter Tokens (2)" width="90" x="450" y="30">
            <parameter key="min_chars" value="3"/>
            <parameter key="max_chars" value="999"/>
          </operator>
          <operator activated="true" class="text:generate_n_grams_terms" compatibility="5.2.000" expanded="true" height="60" name="Generate n-Grams (2)" width="90" x="585" y="30">
            <parameter key="max_length" value="3"/>
          </operator>
          <connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
          <connect from_op="Transform Cases (2)" from_port="document" to_op="Tokenize (2)" to_port="document"/>
          <connect from_op="Tokenize (2)" from_port="document" to_op="Filter Stopwords (2)" to_port="document"/>
          <connect from_op="Filter Stopwords (2)" from_port="document" to_op="Filter Tokens (2)" to_port="document"/>
          <connect from_op="Filter Tokens (2)" from_port="document" to_op="Generate n-Grams (2)" to_port="document"/>
          <connect from_op="Generate n-Grams (2)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="aggregate" compatibility="5.1.017" expanded="true" height="76" name="Aggregate" width="90" x="581" y="30">
        <parameter key="use_default_aggregation" value="true"/>
        <parameter key="default_aggregation_function" value="sum"/>
        <list key="aggregation_attributes"/>
        <parameter key="group_by_attributes" value="|id"/>
      </operator>
      <operator activated="true" class="rename_by_replacing" compatibility="5.1.017" expanded="true" height="76" name="Rename by Replacing" width="90" x="715" y="30">
        <parameter key="replace_what" value="sum\((.*)\)"/>
        <parameter key="replace_by" value="$1"/>
      </operator>
      <connect from_op="Read CSV" from_port="output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Aggregate" to_port="example set input"/>
      <connect from_op="Aggregate" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
      <connect from_op="Rename by Replacing" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

hennessy · February 2012

Huh, you just lost me there.

I'm not familiar with XML programing (I used only Visual Basic and Delphi before).
Do you mind to breake it down to steps?

I've created a new process and with the "Read Exlel" object and I loaded the excel file. What's next?

Thanks in advance,
Peter

MariusHelf · February 2012

Just create a new process in RapidMiner, copy the XML code I posted above into the XML view in RapidMiner from where you copied your process and press the green check mark button. It's no magic about XML programming

hennessy · February 2012

OK Marius,

It's working. I've created the test.csv and loaded into the process. How can I save the wordlist from "process document from data" into excel?

Best regards,
Peter

MariusHelf · February 2012

You can use the Write Excel operator to save Example Sets to Excel files. Probably you want to connect the exa output of process documents to that operator. You can also convert the wor output to an Example Set with the Wordlist to Data operator and then store that dataset. Just do what fits your needs best

Best, Marius

hennessy · February 2012

It's working!

Thanks a lot Marius.
I will look at the details and i'll try to tweak it to the original plan. Probably I'll have more question to you in the next days.

Thanks again
Best regards,
Peter

PS: What do you think, what's the best solution to dispay the results in a chart?

MariusHelf · February 2012

hennessy wrote:
PS: What do you think, what's the best solution to dispay the results in a chart?

That depends totally on what you want to show

Did you already update RapidMiner to the latest version 5.2 beta? It ships a completely new implementation of the plotting and chart framework. See here for the blog post. The manual for RapidMiner's new Advanced Charts will be online soon.

Best,
Marius

MariusHelf · February 2012

The manual is available here: http://rapid-i.com/content/view/26/84/lang,en/

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"Survey analysys from Excel source"

Answers