Set of unique strings - ways to organize and structure, group related elements?

far_in_outfar_in_out Member Posts: 3 Contributor I
edited December 2018 in Help

Hi. I need to automate a task.

I have a list of strings (where each line is a keyword, a search querie) that goes like this

 

cord to connect laptop to tv

how do i connect my laptop to my tv

cable to connect laptop to tv

how to connect laptop to smart tv

connect laptop to tv hdmi windows 10

...

 

Each of these strings is unique, as in none of them is an exact match to any other but most of them can be grouped by topic and most of the topics can be further split into subtopics and so on. That's what I want to do. And I want as many different ways of grouping as possible to see all the possible ways that these elements relate to each other. I would like to extract any information that would help to organize and structure this data set.

 

I already know how to calculate word frequency for my lists in RM. That's a start. I can use the most frequent relevant tokens as topic candidats for manual grouping. But I'm not sure where to go from there if I want to do it automatically. The problem is that all the examples of clustering that I find online are dealing with documents and not with lists of strings and I don't think that any of those techiques can be used in my case.

Answers

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist

    Hi,

     

    well you can just do clustering or topic detection on the Bag of Words of your short seach strings. There is absolutly no reason not do to this. The only question is how you define similarity between two Bag of Words.


    Attached is an example doing both, clustering and topic detection on your example data. It needs operator toolbox to run.

     

    BR,

    Martin

     

    <?xml version="1.0" encoding="UTF-8"?><process version="9.0.002">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.0.002" expanded="true" name="Process">
    <process expanded="true">
    <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.5.000-SNAPSHOT" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
    <parameter key="generator_type" value="comma_separated_text"/>
    <list key="function_descriptions"/>
    <list key="numeric_series_configuration"/>
    <list key="date_series_configuration"/>
    <list key="date_series_configuration (interval)"/>
    <parameter key="input_csv_text" value="text&#10;cord to connect laptop to tv&#10;how do i connect my laptop to my tv&#10;cable to connect laptop to tv&#10;how to connect laptop to smart tv&#10;connect laptop to tv hdmi windows 10"/>
    </operator>
    <operator activated="true" class="nominal_to_text" compatibility="9.0.002" expanded="true" height="82" name="Nominal to Text" width="90" x="179" y="34">
    <parameter key="attribute_filter_type" value="single"/>
    <parameter key="attribute" value="text"/>
    </operator>
    <operator activated="true" class="multiply" compatibility="9.0.002" expanded="true" height="103" name="Multiply (2)" width="90" x="313" y="34"/>
    <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="514" y="34">
    <parameter key="vector_creation" value="Term Occurrences"/>
    <parameter key="prune_method" value="percentual"/>
    <parameter key="prune_below_percent" value="1.0"/>
    <parameter key="prune_below_absolute" value="5"/>
    <parameter key="prune_above_absolute" value="50"/>
    <list key="specify_weights"/>
    <process expanded="true">
    <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="45" y="34"/>
    <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="179" y="34"/>
    <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
    <connect from_port="document" to_op="Tokenize" to_port="document"/>
    <connect from_op="Tokenize" from_port="document" to_op="Transform Cases" to_port="document"/>
    <connect from_op="Transform Cases" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
    <connect from_op="Filter Stopwords (English)" from_port="document" to_port="document 1"/>
    <portSpacing port="source_document" spacing="0"/>
    <portSpacing port="sink_document 1" spacing="0"/>
    <portSpacing port="sink_document 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="text:data_to_documents" compatibility="8.1.000" expanded="true" height="68" name="Data to Documents" width="90" x="514" y="238">
    <list key="specify_weights"/>
    </operator>
    <operator activated="true" class="operator_toolbox:lda" compatibility="1.5.000-SNAPSHOT" expanded="true" height="124" name="Extract Topics from Document (LDA)" width="90" x="648" y="238">
    <parameter key="number_of_topics" value="2"/>
    <parameter key="iterations" value="500"/>
    </operator>
    <operator activated="true" class="k_medoids" compatibility="9.0.002" expanded="true" height="82" name="Clustering" width="90" x="648" y="34">
    <parameter key="measure_types" value="NumericalMeasures"/>
    <parameter key="numerical_measure" value="CosineSimilarity"/>
    </operator>
    <connect from_op="Create ExampleSet" from_port="output" to_op="Nominal to Text" to_port="example set input"/>
    <connect from_op="Nominal to Text" from_port="example set output" to_op="Multiply (2)" to_port="input"/>
    <connect from_op="Multiply (2)" from_port="output 1" to_op="Process Documents from Data" to_port="example set"/>
    <connect from_op="Multiply (2)" from_port="output 2" to_op="Data to Documents" to_port="example set"/>
    <connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
    <connect from_op="Data to Documents" from_port="documents" to_op="Extract Topics from Document (LDA)" to_port="col"/>
    <connect from_op="Extract Topics from Document (LDA)" from_port="exa" to_port="result 2"/>
    <connect from_op="Clustering" from_port="clustered set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    <portSpacing port="sink_result 3" spacing="0"/>
    </process>
    </operator>
    </process>
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • far_in_outfar_in_out Member Posts: 3 Contributor I

    Ok, Thanks. That looks like something I'm after.

    So, I now have as a result the list of ID's of all strings with a cluster assigned to every ID.

    Now, how do I get a column with the actual string in that results table. ID's are not very helpful. Or am I missing something?

    Thanks again for your help.

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist

    Hi @far_in_out,

    i think you just need to tick "keep text" in the process documents. Or just join it back to the original table using the Join operator.

     

    BR,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
    sgenzer
  • far_in_outfar_in_out Member Posts: 3 Contributor I

    Ok, thanks. That worked.

    Are there clustering algorythms that can put one element into multiple clusters (for RapidMiner)?

  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,127  RM Data Scientist

    Hi,

     

    i think Expectation Maximization does it. But i would honeslty rather think about a LDA then. It's not exactly clustering but also assigns documents to multiple topics.

     

    BR,

    Martin

    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
Sign In or Register to comment.