Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
"Survey analysys from Excel source"
Hi,
I hope you guys can help or direct me to a right direction. I just find RapidMiner and spent the last two days going through tutorials.
I have over 2000 surveys loaded into an excel file and I need to make a statistic for frequently mentioned phrases and brake it down by ID’s. The file contains few additional information, but the data and the ID columns are the most important. I was doing this breakdown in excel, but it’s not the ideal tool for this analysis. Can anybody help creating a solution?
Thanks a lot
I hope you guys can help or direct me to a right direction. I just find RapidMiner and spent the last two days going through tutorials.
I have over 2000 surveys loaded into an excel file and I need to make a statistic for frequently mentioned phrases and brake it down by ID’s. The file contains few additional information, but the data and the ID columns are the most important. I was doing this breakdown in excel, but it’s not the ideal tool for this analysis. Can anybody help creating a solution?
Thanks a lot
Tagged:
0
Answers
if I understand your problem correctly, you should have a look at the Text Processing Extension.
Best,
Marius
It would probably be helpful to post your current process setup and if possible some sample data. Post your process by clicking the XML tab in RapidMiner above the process view, and copy the XML code into your next post. Please use the "#" button above the input field of this forum.
Best,
Marius
Thanks for the prompt replies.
The excel file has a unique ID column and there are multiple rows belong with the same ID. Here is the header of the Excel file looks like:
ID (text); Comment (text); Description (text); Code (text); Reference# (integer); Code Description (text); Resolution Description (text)
My target is to analyze the frequent words, phrases in the comment column and display the result in a chart. Including and excluding word and phase list would be great, but not essential.
I was able to create a working model with manual pre-sorting the source in excel and transferring it to a txt file, but I’m completely newbie. I couldn’t find the basics of working with excel files. I’m also having difficulties to export the wordlist into an Excel or a txt file and creating a chart from the results.
Here is the xml code just in case:
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.1.017">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="5.1.017" expanded="true" name="Process">
<process expanded="true" height="531" width="637">
<operator activated="true" class="text:read_document" compatibility="5.1.004" expanded="true" height="60" name="Read Document" width="90" x="45" y="30">
<parameter key="file" value="E:\Comment Analysis\Dev\Description light.txt"/>
</operator>
<operator activated="true" class="text:process_documents" compatibility="5.1.004" expanded="true" height="94" name="Process Documents" width="90" x="176" y="31">
<process expanded="true" height="531" width="655">
<operator activated="true" class="text:transform_cases" compatibility="5.1.004" expanded="true" height="60" name="Transform Cases" width="90" x="45" y="75"/>
<operator activated="true" class="text:tokenize" compatibility="5.1.004" expanded="true" height="60" name="Tokenize" width="90" x="45" y="255"/>
<operator activated="true" class="text:filter_stopwords_english" compatibility="5.1.004" expanded="true" height="60" name="Filter Stopwords (English)" width="90" x="45" y="390"/>
<operator activated="true" class="text:filter_by_length" compatibility="5.1.004" expanded="true" height="60" name="Filter Tokens (by Length)" width="90" x="179" y="165">
<parameter key="min_chars" value="3"/>
<parameter key="max_chars" value="999"/>
</operator>
<operator activated="true" class="text:generate_n_grams_terms" compatibility="5.1.004" expanded="true" height="60" name="Generate n-Grams (Terms)" width="90" x="179" y="30">
<parameter key="max_length" value="3"/>
</operator>
<connect from_port="document" to_op="Transform Cases" to_port="document"/>
<connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Generate n-Grams (Terms)" to_port="document"/>
<connect from_op="Generate n-Grams (Terms)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<connect from_op="Read Document" from_port="output" to_op="Process Documents" to_port="documents 1"/>
<connect from_op="Process Documents" from_port="example set" to_port="result 1"/>
<connect from_op="Process Documents" from_port="word list" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>
I already can see that RapidMiner is a great tool for this purpose; I just have to find the way to get the things done.
Thank you in advance,
Peter
your process looks fine so far, but you still did not explain what you want to do with the IDs. Do you want to sum up the term frequencies of rows with identical IDs?
-Marius
Yes, that's the idea. Each ID belongs to a certain product. I need to find problematic areas mentioned in the surveys and create a chart with the most problematic products.
Best regards,
Peter
try something like the process below. The main work is done by the aggregation operator. The Rename by Replace just removes the function names from the attribute names (set a breakpoint after the aggregation to see what I mean). I tested this process on a simple dataset like this:
I'm not familiar with XML programing (I used only Visual Basic and Delphi before).
Do you mind to breake it down to steps?
I've created a new process and with the "Read Exlel" object and I loaded the excel file. What's next?
Thanks in advance,
Peter
It's working. I've created the test.csv and loaded into the process. How can I save the wordlist from "process document from data" into excel?
Best regards,
Peter
Best, Marius
Thanks a lot Marius.
I will look at the details and i'll try to tweak it to the original plan. Probably I'll have more question to you in the next days.
Thanks again
Best regards,
Peter
PS: What do you think, what's the best solution to dispay the results in a chart?
Best,
Marius