Options

Classification - comparison of one attribute to others attributes

Serek91Serek91 Member Posts: 22 Contributor II
edited June 2019 in Help
Hi. I'm trying to classify authors of texts. I have 4 attributes containing the most commonly used words - attribute A B C and D. Attribute A is compared against A in rest of data, B against B in rest of data, etc.

But I want to check if attribute A exists in attributes A B C and D. For example:
1) row X has A with "example" value and B with "test" value
2) row Y  has A with "test" value and B with "qwerty" value
3) "test" value exists in both X and Y, so it should return true, so there is a bigger chance that author of X is the same as author of Y

How I can do that? I want to use it together with operators like Decision Tree, KNN, etc.
Tagged:

Answers

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    Hi @Serek91,

    How does your data look like? Do you mind to share a little example?

    There can be many ways to do this but it all depends on how your data looks like.

    Here is a picture of what I'm thinking:



    ...and here is the XML code for that operation.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">

      <context>

        <input/>

        <output/>

        <macros/>

      </context>

      <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">

        <parameter key="logverbosity" value="init"/>

        <parameter key="random_seed" value="2001"/>

        <parameter key="send_mail" value="never"/>

        <parameter key="notification_email" value=""/>

        <parameter key="process_duration_for_mail" value="30"/>

        <parameter key="encoding" value="UTF-8"/>

        <process expanded="true">

          <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">

            <parameter key="generator_type" value="comma separated text"/>

            <parameter key="number_of_examples" value="100"/>

            <parameter key="use_stepsize" value="false"/>

            <list key="function_descriptions"/>

            <parameter key="add_id_attribute" value="false"/>

            <list key="numeric_series_configuration"/>

            <list key="date_series_configuration"/>

            <list key="date_series_configuration (interval)"/>

            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>

            <parameter key="time_zone" value="SYSTEM"/>

            <parameter key="input_csv_text" value="Author,A,B,C,D&#10;Tolstoi,word1,word2,word3,word4&#10;Chejov,word4,word5,word2,word6&#10;Dostoievski,word7,word8,word9,word6&#10;Solzhenitsyn,word10,word11,word3,word12"/>

            <parameter key="column_separator" value=","/>

            <parameter key="parse_all_as_nominal" value="false"/>

            <parameter key="decimal_point_character" value="."/>

            <parameter key="trim_attribute_names" value="true"/>

          </operator>

          <operator activated="true" class="de_pivot" compatibility="9.3.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34">

            <list key="attribute_name">

              <parameter key="Word" value="\w"/>

            </list>

            <parameter key="index_attribute" value="Index"/>

            <parameter key="create_nominal_index" value="true"/>

            <parameter key="keep_missings" value="false"/>

            <description align="center" color="transparent" colored="false" width="126">With the De-Pivot operator, a list of words is obtained together with its nominal index from where was the word obtained.</description>

          </operator>

          <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34">

            <description align="center" color="transparent" colored="false" width="126">We use the Multiply operator so that we can prepare the case.</description>

          </operator>

          <operator activated="true" class="concurrency:join" compatibility="9.3.000" expanded="true" height="82" name="Join" width="90" x="447" y="34">

            <parameter key="remove_double_attributes" value="false"/>

            <parameter key="join_type" value="inner"/>

            <parameter key="use_id_attribute_as_key" value="false"/>

            <list key="key_attributes">

              <parameter key="Word" value="Word"/>

            </list>

            <parameter key="keep_both_join_attributes" value="false"/>

            <description align="center" color="transparent" colored="false" width="126">A simple inner join by words can show us what words are common among authors.</description>

          </operator>

          <operator activated="true" class="generate_attributes" compatibility="9.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">

            <list key="function_descriptions">

              <parameter key="Same?" value="Author == Author_from_ES2"/>

            </list>

            <parameter key="keep_all" value="true"/>

            <description align="center" color="transparent" colored="false" width="126">The Join gave us that author A is the same as author A. We will compare each attribute and mark it as &amp;quot;Same&amp;quot;...</description>

          </operator>

          <operator activated="true" class="filter_examples" compatibility="9.3.000" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">

            <parameter key="parameter_expression" value=""/>

            <parameter key="condition_class" value="custom_filters"/>

            <parameter key="invert_filter" value="false"/>

            <list key="filters_list">

              <parameter key="filters_entry_key" value="Same?.equals.false"/>

            </list>

            <parameter key="filters_logic_and" value="true"/>

            <parameter key="filters_check_metadata" value="true"/>

            <description align="center" color="transparent" colored="false" width="126">...so that we can filter these repeated similarities.</description>

          </operator>

          <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="849" y="34">

            <parameter key="attribute_filter_type" value="subset"/>

            <parameter key="attribute" value=""/>

            <parameter key="attributes" value="Author|Author_from_ES2|Index|Index_from_ES2|Word"/>

            <parameter key="use_except_expression" value="false"/>

            <parameter key="value_type" value="attribute_value"/>

            <parameter key="use_value_type_exception" value="false"/>

            <parameter key="except_value_type" value="time"/>

            <parameter key="block_type" value="attribute_block"/>

            <parameter key="use_block_type_exception" value="false"/>

            <parameter key="except_block_type" value="value_matrix_row_start"/>

            <parameter key="invert_selection" value="false"/>

            <parameter key="include_special_attributes" value="false"/>

            <description align="center" color="transparent" colored="false" width="126">Finally, we select only the attributes we need.</description>

          </operator>

          <connect from_op="Create ExampleSet" from_port="output" to_op="De-Pivot" to_port="example set input"/>

          <connect from_op="De-Pivot" from_port="example set output" to_op="Multiply" to_port="input"/>

          <connect from_op="Multiply" from_port="output 1" to_op="Join" to_port="left"/>

          <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>

          <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>

          <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>

          <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>

          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>

          <portSpacing port="source_input 1" spacing="0"/>

          <portSpacing port="sink_result 1" spacing="0"/>

          <portSpacing port="sink_result 2" spacing="0"/>

        </process>

      </operator>

    </process>

  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn
    edited June 2019
    Hi @Serek91,

    This process has a problem, though. Since the Join gave us this:

    Chéjov == Dostoievski
    Dostoievski == Chéjov.

    You can do something to eliminate those double sentences. I used the Generate Attributes to generate an attribute that says KEEP if the first author is less than the second (so Chéjov is less than Dostoievski, because it begins with C and C < D) and DELETE if the first author is greater than the second (Dostoievski is greater than Chéjov because D > C). This is the corrected process:

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.000" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="UTF-8"/>
        <process expanded="true">
          <operator activated="true" class="utility:create_exampleset" compatibility="9.3.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="45" y="34">
            <parameter key="generator_type" value="comma separated text"/>
            <parameter key="number_of_examples" value="100"/>
            <parameter key="use_stepsize" value="false"/>
            <list key="function_descriptions"/>
            <parameter key="add_id_attribute" value="false"/>
            <list key="numeric_series_configuration"/>
            <list key="date_series_configuration"/>
            <list key="date_series_configuration (interval)"/>
            <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss"/>
            <parameter key="time_zone" value="SYSTEM"/>
            <parameter key="input_csv_text" value="Author,A,B,C,D&#10;Tolstoi,word1,word2,word3,word4&#10;Chejov,word4,word5,word2,word6&#10;Dostoievski,word7,word8,word9,word6&#10;Solzhenitsyn,word10,word11,word3,word12"/>
            <parameter key="column_separator" value=","/>
            <parameter key="parse_all_as_nominal" value="false"/>
            <parameter key="decimal_point_character" value="."/>
            <parameter key="trim_attribute_names" value="true"/>
          </operator>
          <operator activated="true" class="de_pivot" compatibility="9.3.000" expanded="true" height="82" name="De-Pivot" width="90" x="179" y="34">
            <list key="attribute_name">
              <parameter key="Word" value="\w"/>
            </list>
            <parameter key="index_attribute" value="Index"/>
            <parameter key="create_nominal_index" value="true"/>
            <parameter key="keep_missings" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">With the De-Pivot operator, a list of words is obtained together with its nominal index from where was the word obtained.</description>
          </operator>
          <operator activated="true" class="multiply" compatibility="9.3.000" expanded="true" height="103" name="Multiply" width="90" x="313" y="34">
            <description align="center" color="transparent" colored="false" width="126">We use the Multiply operator so that we can prepare the case.</description>
          </operator>
          <operator activated="true" class="concurrency:join" compatibility="9.3.000" expanded="true" height="82" name="Join" width="90" x="447" y="34">
            <parameter key="remove_double_attributes" value="false"/>
            <parameter key="join_type" value="inner"/>
            <parameter key="use_id_attribute_as_key" value="false"/>
            <list key="key_attributes">
              <parameter key="Word" value="Word"/>
            </list>
            <parameter key="keep_both_join_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">A simple inner join by words can show us what words are common among authors.</description>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="9.3.000" expanded="true" height="82" name="Generate Attributes" width="90" x="581" y="34">
            <list key="function_descriptions">
              <parameter key="Same?" value="Author == Author_from_ES2"/>
              <parameter key="Repeated?" value="if(Author&lt;Author_from_ES2, &quot;KEEP&quot;, &quot;DELETE&quot;)"/>
            </list>
            <parameter key="keep_all" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">The Join gave us that author A is the same as author A. We will compare each attribute and mark it as &amp;quot;Same&amp;quot;...</description>
          </operator>
          <operator activated="true" class="filter_examples" compatibility="9.3.000" expanded="true" height="103" name="Filter Examples" width="90" x="715" y="34">
            <parameter key="parameter_expression" value=""/>
            <parameter key="condition_class" value="custom_filters"/>
            <parameter key="invert_filter" value="false"/>
            <list key="filters_list">
              <parameter key="filters_entry_key" value="Same?.equals.false"/>
              <parameter key="filters_entry_key" value="Repeated?.equals.KEEP"/>
            </list>
            <parameter key="filters_logic_and" value="true"/>
            <parameter key="filters_check_metadata" value="true"/>
            <description align="center" color="transparent" colored="false" width="126">...so that we can filter these repeated similarities.</description>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="9.3.000" expanded="true" height="82" name="Select Attributes" width="90" x="849" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value=""/>
            <parameter key="attributes" value="Author|Author_from_ES2|Index|Index_from_ES2|Word"/>
            <parameter key="use_except_expression" value="false"/>
            <parameter key="value_type" value="attribute_value"/>
            <parameter key="use_value_type_exception" value="false"/>
            <parameter key="except_value_type" value="time"/>
            <parameter key="block_type" value="attribute_block"/>
            <parameter key="use_block_type_exception" value="false"/>
            <parameter key="except_block_type" value="value_matrix_row_start"/>
            <parameter key="invert_selection" value="false"/>
            <parameter key="include_special_attributes" value="false"/>
            <description align="center" color="transparent" colored="false" width="126">Finally, we select only the attributes we need.</description>
          </operator>
          <connect from_op="Create ExampleSet" from_port="output" to_op="De-Pivot" to_port="example set input"/>
          <connect from_op="De-Pivot" from_port="example set output" to_op="Multiply" to_port="input"/>
          <connect from_op="Multiply" from_port="output 1" to_op="Join" to_port="left"/>
          <connect from_op="Multiply" from_port="output 2" to_op="Join" to_port="right"/>
          <connect from_op="Join" from_port="join" to_op="Generate Attributes" to_port="example set input"/>
          <connect from_op="Generate Attributes" from_port="example set output" to_op="Filter Examples" to_port="example set input"/>
          <connect from_op="Filter Examples" from_port="example set output" to_op="Select Attributes" to_port="example set input"/>
          <connect from_op="Select Attributes" from_port="example set output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Hope this helps,

    Rodrigo.
  • Options
    Serek91Serek91 Member Posts: 22 Contributor II
    edited June 2019
    Hi, my model is in attachment (I can't add images).


    Content of my csv looks like:
    id, author_id, characters_number, words_number, average_sentence_length, average_word_length, unique_words_ratio, most_used_word_1, most_used_word_2, most_used_word_3, most_used_word_4
    "100395", "1000866", "1640", "318", "44", "6", "0,6006289", "anyway", "really", "decided", "write"
    "104212", "1000866", "1155", "230", "57", "6", "0,6173913", "we're", "almost", "scrub", "really"
    "108960", "1000866", "1774", "336", "59", "6", "0,5119048", "because", "chris", "about", "people"
    "111351", "1000866", "1034", "192", "47", "6", "0,6666667", "really", "peter", "because", "happy"


    EDIT: Few words about purpose of this:
    I'm writing my master thesis. I want to check impact of each attribute for end result - is it causing better (or not) accuracy? And what attribute used alone for training (without others) has the best accuracy. And I'm checking it for different operators (KNN, desision tree, etc.).

  • Options
    sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager
    @Serek91 I have boosted your profile. Now you can post images.

    Scott

  • Options
    Serek91Serek91 Member Posts: 22 Contributor II
    edited June 2019
    Thanks, but probably it is not exactly what I'm looking for. But always such additional knowledge can be helpful.

    My process looks like:


    Inside each Cross Validatn operator I have:



    Training operator differs each time - it can be Naive Bayes, Naive Bayes Kernel, Decision Tree or k-NN. Rest is the same.

    Example of my CSV:
    id, author_id, characters_number, words_number, average_sentence_length, average_word_length, unique_words_ratio, most_used_word_1, most_used_word_2, most_used_word_3, most_used_word_4
    "100395", "1000866", "1640", "318", "44", "6", "0,6006289", "anyway", "really", "decided", "write"
    "108960", "1000866", "1774", "336", "59", "6", "0,5119048", "decided", "chris", "really", "people"
    "111351", "1000866", "1034", "192", "47", "6", "0,6666667", "really", "peter", "because", "happy"
    "110248", "1011289", "3938", "723", "78", "6", "0,4979253", "there", "cordy", "another", "hours"
    "114290", "1011289", "1777", "328", "77", "6", "0,6128049", "jacen", "talking", "about", "they"
    "116160", "1011289", "1777", "348", "93", "6", "0,5545977", "about", "really", "write", "ending"
    "100209", "1011311", "3135", "598", "111", "6", "0,4598662", "remember", "really", "about", "think"
    "104488", "1011311", "1027", "196", "79", "6", "0,6479592", "lives", "worry", "control", "melody"
    "105743", "1011311", "1261", "243", "97", "6", "0,5884774", "little", "right", "think", "drivers"

    Each post has unique ID. Author_id is a label. And I don't want to train my model using conditions like most_used_word_1 === most_used_word_1_from_another_row, most_used_word_2 === most_used_word_2_from_another_row, etc.
    For words I want to have something like:
    1) Test row with ID 100395
    2) Check how many times word appears for each author - check rest rows for using given word (no matter if in column 1, 2, 3 or 4)
    a) word "anyway"
    - no match
    0 probability for each author
    2b) word "really"
    - used in ID 108960 (the same author)
    - used in ID 111351 (the same author)
    - used in ID 116160 (author 1011289)
    - used in ID 100209 (author 1011311)
    50% probability for the same author (1000866). 25% for 1011289 and 1011311
    2c) Check rest rows for using word "decided" (no matter if in column 1, 2, 3 or 4)
    - used in ID 111351 (the same author)
    100% probability that it is author with ID 1000866
    2d) word "write"
    - used in ID 116160 (author 1011289)
    100% probability that it is author with ID 1011289


    And this additional check should with operator inside cross validation.

    But I'm not sure if it has any sense to check it in this way^^






  • Options
    SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    before trying to answer your question I want to ask: Do you only have these 4 attributes or do you also have access to the word vectors or the raw texts? I think you are trying to predict under the assumption that these attributes have a good predictive power, which can easily not be the case.

    I would definitely try to get the word vectors and try out different supervised classification algorithms (best with Auto Model).

    Regards,
    Sebastian

  • Options
    Serek91Serek91 Member Posts: 22 Contributor II
    edited June 2019
    I have no idea what I'm doing^^ And I don't have good knowledge about using RapidMiner. I'm just trying to use different text properties (number of words in sentence, sentence length, total % of unique words, etc), than can have some impact on greater chance of finding correct author. All properties are calculated in c#, then I generate CSV to use it in RapidMiner.

    I have raw texts from this set: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

    But maybe checking most used words and comparing them in way as I described is too hard for me. I just want to pass this master thesis^^
Sign In or Register to comment.