How to use LSI Operator

khazan · March 2018

Hello
How do I get through the lsi (latent sentiment analysis)
Clustering the texts on the basis of meaning? How to use this method in the RapidMiner to find the related words based on the meaning through the lsi method? And also find keywords?
Thankful

yyhuang · March 2018

Hi @khazan, do you mean the latent semantic analysis ?

https://en.wikipedia.org/wiki/Latent_semantic_analysis

khazan · April 2018

I want to use nasa for clustering but it has an error. Is the order of using operators wrong? Please tell me how to use it for clustering thanks

khazan · April 2018

hi

i need help

please help me

yyhuang · April 2018

Hi @khazan,

I do not have your NASA data. If possible can you please share the data that was used for your text processing?

<?xml version="1.0" encoding="UTF-8"?><process version="8.1.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.1.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="subprocess" compatibility="8.1.001" expanded="true" height="103" name="Get News Feeds" width="90" x="45" y="34">
        <process expanded="true">
          <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Top Stories" width="90" x="45" y="34">
            <parameter key="url" value="http://feeds.bbci.co.uk/news/rss.xml"/>
          </operator>
          <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Asia" width="90" x="45" y="85">
            <parameter key="url" value="http://feeds.bbci.co.uk/news/world/asia/rss.xml"/>
          </operator>
          <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Business" width="90" x="45" y="136">
            <parameter key="url" value="http://feeds.bbci.co.uk/news/business/rss.xml"/>
          </operator>
          <operator activated="true" class="web:read_rss" compatibility="7.3.000" expanded="true" height="68" name="BBC Entertainment" width="90" x="45" y="187">
            <parameter key="url" value="http://feeds.bbci.co.uk/news/entertainment_and_arts/rss.xml"/>
          </operator>
          <operator activated="true" class="append" compatibility="8.1.001" expanded="true" height="145" name="Append" width="90" x="179" y="34"/>
          <operator activated="true" class="generate_copy" compatibility="8.1.001" expanded="true" height="82" name="Generate Copy" width="90" x="313" y="34">
            <parameter key="attribute_name" value="Title"/>
            <parameter key="new_name" value="Title2"/>
          </operator>
          <operator activated="true" class="text_to_nominal" compatibility="8.1.001" expanded="true" height="82" name="Text to Nominal" width="90" x="447" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attributes" value="Link|Title2"/>
            <description align="center" color="transparent" colored="false" width="126">Don't convert article link to document text.</description>
          </operator>
          <connect from_op="BBC Top Stories" from_port="output" to_op="Append" to_port="example set 1"/>
          <connect from_op="BBC Asia" from_port="output" to_op="Append" to_port="example set 2"/>
          <connect from_op="BBC Business" from_port="output" to_op="Append" to_port="example set 3"/>
          <connect from_op="BBC Entertainment" from_port="output" to_op="Append" to_port="example set 4"/>
          <connect from_op="Append" from_port="merged set" to_op="Generate Copy" to_port="example set input"/>
          <connect from_op="Generate Copy" from_port="example set output" to_op="Text to Nominal" to_port="example set input"/>
          <connect from_op="Text to Nominal" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
          <portSpacing port="sink_out 3" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.1.001" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Content|Title"/>
      </operator>
      <operator activated="true" class="set_role" compatibility="8.1.001" expanded="true" height="82" name="Set Role" width="90" x="313" y="34">
        <parameter key="attribute_name" value="Id"/>
        <parameter key="target_role" value="id"/>
        <list key="set_additional_roles">
          <parameter key="Author" value="author"/>
          <parameter key="Link" value="link"/>
          <parameter key="Published" value="date"/>
          <parameter key="Title2" value="tittle2"/>
        </list>
      </operator>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
        <parameter key="prune_method" value="absolute"/>
        <parameter key="prune_below_absolute" value="2"/>
        <parameter key="prune_above_absolute" value="50"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="112" y="34"/>
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
          <connect from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="singular_value_decomposition" compatibility="8.1.001" expanded="true" height="103" name="SVD" width="90" x="581" y="34">
        <parameter key="dimensionality_reduction" value="keep percentage"/>
        <parameter key="percentage_threshold" value="0.6"/>
      </operator>
      <connect from_op="Get News Feeds" from_port="out 1" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Set Role" to_port="example set input"/>
      <connect from_op="Set Role" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="SVD" to_port="example set input"/>
      <connect from_op="SVD" from_port="example set output" to_port="result 1"/>
      <connect from_op="SVD" from_port="original" to_port="result 2"/>
      <connect from_op="SVD" from_port="preprocessing model" to_port="result 3"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <portSpacing port="sink_result 3" spacing="0"/>
      <portSpacing port="sink_result 4" spacing="0"/>
    </process>
  </operator>
</process>

In my attached process, I run tf-idf on the news feed and after that applied SVD on the document-term matrix.

Why SVD?

Latent Semantic Analysis is a technique for creating a vector representation of a document. Latent Semantic Analysis takes tf-idf one step further. "Latent Semantic Analysis (LSA)" and "Latent Semantic Indexing (LSI)" are the same thing, with the latter name being used sometimes when referring specifically to indexing a collection of documents for search ("Information Retrieval").

LSA is quite simple, you just use SVD to perform dimensionality reduction on the tf-idf vectors–that’s really all there is to it!

You can inspect LSA results (tf-idf + SVD) for the news feed data by checking the very first component (SVD_1) of the SVD matrix, and look at the terms which are giving the highest weight (Abstract value of SVD Vector 1) by this component.

HTH

YY

khazan · April 2018

thanks a lot
Thank you
Only
Can I send a photo of how to arrange the operators for me? Thank you for sending me a photo.
thank you
Because I do not know how to use xml code.

MartinLiebig · April 2018

Dear @khazan,

you got a full blown solution from @yyhuang up there. The only thing you need to do is google for something like "RapidMiner xml import" to get the right article. We are happy to help, but you also need to do a bit of the walking

Please have a look at this post by ingo: https://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Import-XML-code-to-process/m-p/32606 it explains how you read in XML processes. Images of processes are way inferior compared to XML processes with the full details.

Best,

Martin

khazan · April 2018

thank you very much
Is that correct?
What should I get now?
What do I mean by the column numbers of the profit?
And is this method correct for clustering based on meanings?

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to use LSI Operator

Answers