How to use LSI Operator

khazankhazan Member Posts: 23
edited December 2018 in Help

Hello
How do I get through the lsi (latent sentiment analysis)
Clustering the texts on the basis of meaning? How to use this method in the RapidMiner to find the related words based on the meaning through the lsi method? And also find keywords?
Thankful

Tagged:

Answers

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Hi @khazan, do you mean the latent semantic analysis ?

    https://en.wikipedia.org/wiki/Latent_semantic_analysis

  • khazankhazan Member Posts: 23

    I want to use nasa for clustering but it has an error. Is the order of using operators wrong? Please tell me how to use it for clustering thankslsa.JPG

  • khazankhazan Member Posts: 23

    hi

    i need help

    please help me

     

  • yyhuangyyhuang Administrator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 364 RM Data Scientist

    Hi @khazan,

     

    I do not have your NASA data. If possible can you please share the data that was used for your text processing?

    <?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="subprocess" compatibility="8.1.001" expanded="true" height="103" name="Get News Feeds" width="90" x="45" y="34">
    <process expanded="true">
    <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Top Stories" width="90" x="45" y="34">
    <parameter key="url" value="http://feeds.bbci.co.uk/news/rss.xml"/>
    </operator>
    <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Asia" width="90" x="45" y="85">
    <parameter key="url" value="http://feeds.bbci.co.uk/news/world/asia/rss.xml"/>
    </operator>
    <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Business" width="90" x="45" y="136">
    <parameter key="url" value="http://feeds.bbci.co.uk/news/business/rss.xml"/>
    </operator>
    <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Entertainment" width="90" x="45" y="187">
    <parameter key="url" value="http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml"/>
    </operator>
    <operator activated="true" class="append" compatibility="8.1.001" expanded="true" height="145" name="Append" width="90" x="179" y="34"/>
    <operator activated="true" class="generate_copy" compatibility="8.1.001" expanded="true" height="82" name="Generate Copy" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Title"/>
    <parameter key="new_name" value="Title2"/>
    </operator>
    <operator activated="true" class="text_to_nominal" compatibility="8.1.001" expanded="true" height="82" name="Text to Nominal" width="90" x="447" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Link|Title2"/>
    <description align="center" color="transparent" colored="false" width="126">Don't convert article link to document text.</description>
    </operator>
    <connect from_op="BBC Top Stories" from_port="output" to_op="Append" to_port="example set 1"/>
    <connect from_op="BBC Asia" from_port="output" to_op="Append" to_port="example set 2"/>
    <connect from_op="BBC Business" from_port="output" to_op="Append" to_port="example set 3"/>
    <connect from_op="BBC Entertainment" from_port="output" to_op="Append" to_port="example set 4"/>
    <connect from_op="Append" from_port="merged set" to_op="Generate Copy" to_port="example set input"/>
    <connect from_op="Generate Copy" from_port="example set output" to_op="Text to Nominal" to_port="example set input"/>
    <connect from_op="Text to Nominal" from_port="example set output" to_port="out 1"/>
    <portSpacing port="source_in 1" spacing="0"/>
    <portSpacing port="sink_out 1" spacing="0"/>
    <portSpacing port="sink_out 2" spacing="0"/>
    <portSpacing port="sink_out 3" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="subset"/>
    <parameter key="attributes" value="Content|Title"/>
    </operator>
    <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
    <parameter key="attribute_name" value="Id"/>
    <parameter key="target_role" value="id"/>
    <list key="set_additional_roles">
    <parameter key="Author" value="author"/>
    <parameter key="Link" value="link"/>
    <parameter key="Published" value="date"/>
    <parameter key="Title2" value="tittle2"/>
    </list>
    </operator>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
    <parameter key="prune_method" value="absolute"/>
    <parameter key="prune_below_absolute" value="2"/>
    <parameter key="prune_above_absolute" value="50"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="singular_value_decomposition" compatibility="8.1.001" expanded="true" height="103" name="SVD" width="90" x="581" y="34">
    <parameter key="dimensionality_reduction" value="keep percentage"/>
    <parameter key="percentage_threshold" value="0.6"/>
    </operator>
    <connect from_op="Get News Feeds" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
    <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="SVD" to_port="example set input"/>
    <connect from_op="SVD" from_port="example set output" to_port="result 1"/>
    <connect from_op="SVD" from_port="original" to_port="result 2"/>
    <connect from_op="SVD" from_port="preprocessing model" to_port="result 3"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    <portSpacing port="sink_result 4" spacing="0"/>
    </process>
    </operator>
    </process>

    In my attached process, I run tf-idf on the news feed and after that applied SVD on the document-term matrix.

    Why SVD?

    Latent Semantic Analysis is a technique for creating a vector representation of a document. Latent Semantic Analysis takes tf-idf one step further. "Latent Semantic Analysis (LSA)" and "Latent Semantic Indexing (LSI)" are the same thing, with the latter name being used sometimes when referring specifically to indexing a collection of documents for search ("Information Retrieval").

    LSA is quite simple, you just use SVD to perform dimensionality reduction on the tf-idf vectors–that’s really all there is to it!

     

    You can inspect LSA results (tf-idf + SVD) for the news feed data by checking  the very first component (SVD_1) of the SVD matrix, and look at the terms which are giving the highest weight (Abstract value of SVD Vector 1) by this component.

     

    SVD1_WORDS.PNG

     

    HTH

    YY

  • khazankhazan Member Posts: 23

    thanks a lot
    Thank you
    Only
    Can I send a photo of how to arrange the operators for me? Thank you for sending me a photo.
    thank you
    Because I do not know how to use xml code.

  • MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,503 RM Data Scientist

    Dear @khazan,

     

    you got a full blown solution from @yyhuang  up there. The only thing you need to do is google for something like "RapidMiner xml import" to get the right article. We are happy to help, but you also need to do a bit of the walking

     

    Please have a look at this post by ingo: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Import-XML-code-to-process/m-p/32606 it explains how you read in XML processes. Images of processes are way inferior compared to XML processes with the full details.

     

    Best,

    Martin

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany
  • khazankhazan Member Posts: 23

    thank you very much
    Is that correct?
    What should I get now?
    What do I mean by the column numbers of the profit?
    And is this method correct for clustering based on meanings?
    svd1.JPGsvd2.JPG

Sign In or Register to comment.