Performing Principal Component Analysis of a set of tweets.

rcamachorcamacho Member Posts: 3 Contributor I
edited November 2018 in Help

Hello! First and foremost, I apologize if this topic has been found somewhere. I have spent a considerable amount of time attempting to look for a method.

 

I have found 2 social science studies that utilized PCA of text data using Rapid Miner. They displayed in a table which words had the highest eigenvalue for a particular factors. I am interested in learning how to do this, but thus far I have been frustrated with a lack of process/steps. I also wonder if it is something so elementary that there are no methods that explain the process?

 

To be more specific, I am interested in analyzing an excel file containing 2000 tweets (for starters). Thank you in advance for your sincere assistance!

Best Answer

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn
    Solution Accepted

    Well without reading the whole thing, it's kind of hard to figure out what they did exactly.

     

    I suspect it must be something like this:

     

    <?xml version="1.0" encoding="UTF-8"?><process version="7.2.003">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="7.2.003" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter" width="90" x="112" y="34">
    <parameter key="connection" value="Twitter Connection"/>
    <parameter key="query" value="iphone"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes" width="90" x="246" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes" width="90" x="380" y="34">
    <list key="function_descriptions">
    <parameter key="label" value="&quot;iPhone&quot;"/>
    </list>
    </operator>
    <operator activated="true" class="social_media:search_twitter" compatibility="7.2.000" expanded="true" height="68" name="Search Twitter (2)" width="90" x="112" y="136">
    <parameter key="connection" value="Twitter Connection"/>
    <parameter key="query" value="samsung"/>
    <parameter key="language" value="en"/>
    </operator>
    <operator activated="true" class="select_attributes" compatibility="7.2.003" expanded="true" height="82" name="Select Attributes (2)" width="90" x="246" y="136">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="generate_attributes" compatibility="7.2.003" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="380" y="136">
    <list key="function_descriptions">
    <parameter key="label" value="&quot;samsung&quot;"/>
    </list>
    </operator>
    <operator activated="true" class="append" compatibility="7.2.003" expanded="true" height="103" name="Append" width="90" x="581" y="34"/>
    <operator activated="true" class="set_role" compatibility="7.2.003" expanded="true" height="82" name="Set Role" width="90" x="715" y="34">
    <parameter key="attribute_name" value="label"/>
    <parameter key="target_role" value="label"/>
    <list key="set_additional_roles"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="7.2.003" expanded="true" height="82" name="Nominal to Text" width="90" x="849" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="Text"/>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="7.2.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="983" y="34">
    <parameter key="prune_method" value="percentual"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="7.2.001" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="principal_component_analysis" compatibility="7.2.003" expanded="true" height="103" name="PCA" width="90" x="1117" y="34"/>
    <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
    <connect from_op="Select Attributes" from_port="example set output" to_op="Generate Attributes" to_port="example set input"/>
    <connect from_op="Generate Attributes" from_port="example set output" to_op="Append" to_port="example set 1"/>
    <connect from_op="Search Twitter (2)" from_port="output" to_op="Select Attributes (2)" to_port="example set input"/>
    <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
    <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Append" to_port="example set 2"/>
    <connect from_op="Append" from_port="merged set" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="PCA" to_port="example set input"/>
    <connect from_op="PCA" from_port="example set output" to_port="result 2"/>
    <connect from_op="PCA" from_port="preprocessing model" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>

     

    That said, I'm a bit cautious about the 100% accuracy of their model. :)

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Can you provide a link to where this was done? My initial thought is that the text was transformed into Word Vectors by using TFIDF or something.

  • rcamachorcamacho Member Posts: 3 Contributor I

    Hello! Here is one article that claims to do it . I apologize if I cannot provide the whole article, but to quote the specific portion..

     

    "We separated China from Philippine news reports, then extracted principal components from our two separate sets-of-words. This procedure is intuitively similar to what principal components analysis does to quantified variables." (Montiel et al., 2014)

  • rcamachorcamacho Member Posts: 3 Contributor I

    Thank you! I will attempt to make sense of this.

Sign In or Register to comment.