Options

Cutting reviews in phrases, while still knowing to what video gamthey belong to.

Nick595Nick595 Member Posts: 2 Contributor I
edited November 2018 in Help
Hi all.

I have multiple reviews from video games in a dataset, in which i want to cut into phrases. However, I still need to know to which video game they belong to. So lets say we have Game A and Game B. If Game A has 4 phrases, I want to chop up the document to those 4 phrases, while in the next column i can see to which game the sentence belong to.

I have tried some methods, but unfortunately my experience with rapidminer is too limited to get this done. :(

Answers

  • Options
    Marco_BoeckMarco_Boeck Administrator, Moderator, Employee, Member, University Professor Posts: 1,993 RM Engineering
    Hi,

    have a look at this process:

    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="6.6.000-SNAPSHOT">
     <context>
       <input/>
       <output/>
       <macros/>
     </context>
     <operator activated="true" class="process" compatibility="6.6.000-SNAPSHOT" expanded="true" name="Process">
       <process expanded="true">
         <operator activated="true" class="generate_data_user_specification" compatibility="6.6.000-SNAPSHOT" expanded="true" height="60" name="Empty Data" width="90" x="112" y="75">
           <list key="attribute_values"/>
           <list key="set_additional_roles"/>
         </operator>
         <operator activated="true" class="multiply" compatibility="6.6.000-SNAPSHOT" expanded="true" height="94" name="Multiply" width="90" x="246" y="75"/>
         <operator activated="true" class="generate_attributes" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="Generate Attributes (4)" width="90" x="447" y="120">
           <list key="function_descriptions">
             <parameter key="Review" value="&quot;Also a review! Not so good a game. Don't buy this!&quot;"/>
           </list>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="Generate Attributes (2)" width="90" x="581" y="120">
           <list key="function_descriptions">
             <parameter key="Title" value="&quot;Bad Simulator&quot;"/>
           </list>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="Generate Attributes (3)" width="90" x="447" y="30">
           <list key="function_descriptions">
             <parameter key="Review" value="&quot;This is a review. It is quite a good game. But I'm not really sure! Ask someone else.&quot;"/>
           </list>
         </operator>
         <operator activated="true" class="generate_attributes" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="Generate Attributes" width="90" x="581" y="30">
           <list key="function_descriptions">
             <parameter key="Title" value="&quot;Funny Game&quot;"/>
           </list>
         </operator>
         <operator activated="true" breakpoints="after" class="append" compatibility="6.6.000-SNAPSHOT" expanded="true" height="94" name="Append" width="90" x="715" y="75"/>
         <operator activated="true" class="split" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="Split" width="90" x="849" y="75">
           <parameter key="attribute_filter_type" value="single"/>
           <parameter key="attribute" value="Review"/>
           <parameter key="split_pattern" value="[.|!]\s"/>
         </operator>
         <operator activated="true" class="de_pivot" compatibility="6.6.000-SNAPSHOT" expanded="true" height="76" name="De-Pivot" width="90" x="983" y="75">
           <list key="attribute_name">
             <parameter key="Reviews" value="Review.*"/>
           </list>
           <parameter key="index_attribute" value="SentenceNumber"/>
         </operator>
         <connect from_op="Empty Data" from_port="output" to_op="Multiply" to_port="input"/>
         <connect from_op="Multiply" from_port="output 1" to_op="Generate Attributes (3)" to_port="example set input"/>
         <connect from_op="Multiply" from_port="output 2" to_op="Generate Attributes (4)" to_port="example set input"/>
         <connect from_op="Generate Attributes (4)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
         <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
         <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
         <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
         <connect from_op="Append" from_port="merged set" to_op="Split" to_port="example set input"/>
         <connect from_op="Split" from_port="example set output" to_op="De-Pivot" to_port="example set input"/>
         <connect from_op="De-Pivot" from_port="example set output" to_port="result 1"/>
         <portSpacing port="source_input 1" spacing="0"/>
         <portSpacing port="sink_result 1" spacing="0"/>
         <portSpacing port="sink_result 2" spacing="0"/>
       </process>
     </operator>
    </process>
    The first part is only about creating fake data for demo purposes. The real meat begins with "Split". You split on either . or ! followed by a whitespace. De-Pivot then grabs the resulting columns and converts them to rows.
    Note that I have added a breakpoint before "Split" so you can inspect the input data which is probably somewhat similar to what you have. Once you run the process, you will activate the breakpoint, pausing the process. After looking at the data, you can press the (now green) run button again to finish the process.

    Cheers,
    Marco
Sign In or Register to comment.