Classification - comparison of one attribute to others attributes

Serek91 · June 2019

Hi. I'm trying to classify authors of texts. I have 4 attributes containing the most commonly used words - attribute A B C and D. Attribute A is compared against A in rest of data, B against B in rest of data, etc.

But I want to check if attribute A exists in attributes A B C and D. For example:

1) row X has A with "example" value and B with "test" value

2) row Y has A with "test" value and B with "qwerty" value

3) "test" value exists in both X and Y, so it should return true, so there is a bigger chance that author of X is the same as author of Y

How I can do that? I want to use it together with operators like Decision Tree, KNN, etc.

rfuentealba · June 2019

Hi @Serek91,

How does your data look like? Do you mind to share a little example?

There can be many ways to do this but it all depends on how your data looks like.

Here is a picture of what I'm thinking:

...and here is the XML code for that operation.

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">

</context>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">With the De-Pivot operator, a list of words is obtained together with its nominal index from where was the word obtained.</description>

</operator>

<description align="center" color="transparent" colored="false" width="126">We use the Multiply operator so that we can prepare the case.</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">A simple inner join by words can show us what words are common among authors.</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">The Join gave us that author A is the same as author A. We will compare each attribute and mark it as &quot;Same&quot;...</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">...so that we can filter these repeated similarities.</description>

</operator>

<description align="center" color="transparent" colored="false" width="126">Finally, we select only the attributes we need.</description>

</operator>

</process>

</operator>

</process>

rfuentealba · June 2019

Hi @Serek91,

This process has a problem, though. Since the Join gave us this:

Chéjov == Dostoievski
Dostoievski == Chéjov.

You can do something to eliminate those double sentences. I used the Generate Attributes to generate an attribute that says KEEP if the first author is less than the second (so Chéjov is less than Dostoievski, because it begins with C and C < D) and DELETE if the first author is greater than the second (Dostoievski is greater than Chéjov because D > C). This is the corrected process:

<?xml version="1.0" encoding="UTF-8"?><process version="9.3.000">

</context>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">With the De-Pivot operator, a list of words is obtained together with its nominal index from where was the word obtained.</description>

</operator>

<description align="center" color="transparent" colored="false" width="126">We use the Multiply operator so that we can prepare the case.</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">A simple inner join by words can show us what words are common among authors.</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">The Join gave us that author A is the same as author A. We will compare each attribute and mark it as &quot;Same&quot;...</description>

</operator>

</list>

<description align="center" color="transparent" colored="false" width="126">...so that we can filter these repeated similarities.</description>

</operator>

<description align="center" color="transparent" colored="false" width="126">Finally, we select only the attributes we need.</description>

</operator>

</process>

</operator>

</process>

Hope this helps,

Rodrigo.

Serek91 · June 2019

Hi, my model is in attachment (I can't add images).

Content of my csv looks like:

id, author_id, characters_number, words_number, average_sentence_length, average_word_length, unique_words_ratio, most_used_word_1, most_used_word_2, most_used_word_3, most_used_word_4
"100395", "1000866", "1640", "318", "44", "6", "0,6006289", "anyway", "really", "decided", "write"
"104212", "1000866", "1155", "230", "57", "6", "0,6173913", "we're", "almost", "scrub", "really"
"108960", "1000866", "1774", "336", "59", "6", "0,5119048", "because", "chris", "about", "people"
"111351", "1000866", "1034", "192", "47", "6", "0,6666667", "really", "peter", "because", "happy"

EDIT: Few words about purpose of this:

I'm writing my master thesis. I want to check impact of each attribute for end result - is it causing better (or not) accuracy? And what attribute used alone for training (without others) has the best accuracy. And I'm checking it for different operators (KNN, desision tree, etc.).

sgenzer · June 2019

@Serek91 I have boosted your profile. Now you can post images.

Scott

Serek91 · June 2019

Thanks, but probably it is not exactly what I'm looking for. But always such additional knowledge can be helpful.

My process looks like:

Inside each Cross Validatn operator I have:

Training operator differs each time - it can be Naive Bayes, Naive Bayes Kernel, Decision Tree or k-NN. Rest is the same.

Example of my CSV:

id, author_id, characters_number, words_number, average_sentence_length, average_word_length, unique_words_ratio, most_used_word_1, most_used_word_2, most_used_word_3, most_used_word_4

"100395", "1000866", "1640", "318", "44", "6", "0,6006289", "anyway", "really", "decided", "write"
"108960", "1000866", "1774", "336", "59", "6", "0,5119048", "decided", "chris", "really", "people"

"111351", "1000866", "1034", "192", "47", "6", "0,6666667", "really", "peter", "because", "happy"

"110248", "1011289", "3938", "723", "78", "6", "0,4979253", "there", "cordy", "another", "hours"
"114290", "1011289", "1777", "328", "77", "6", "0,6128049", "jacen", "talking", "about", "they"
"116160", "1011289", "1777", "348", "93", "6", "0,5545977", "about", "really", "write", "ending"

"100209", "1011311", "3135", "598", "111", "6", "0,4598662", "remember", "really", "about", "think"
"104488", "1011311", "1027", "196", "79", "6", "0,6479592", "lives", "worry", "control", "melody"
"105743", "1011311", "1261", "243", "97", "6", "0,5884774", "little", "right", "think", "drivers"

Each post has unique ID. Author_id is a label. And I don't want to train my model using conditions like most_used_word_1 === most_used_word_1_from_another_row, most_used_word_2 === most_used_word_2_from_another_row, etc.

For words I want to have something like:

1) Test row with ID 100395

2) Check how many times word appears for each author - check rest rows for using given word (no matter if in column 1, 2, 3 or 4)

a) word "anyway"

- no match

0 probability for each author

2b) word "really"

- used in ID 108960 (the same author)

- used in ID 111351 (the same author)

- used in ID 116160 (author 1011289)

- used in ID 100209 (author 1011311)

50% probability for the same author (1000866). 25% for 1011289 and 1011311

2c) Check rest rows for using word "decided" (no matter if in column 1, 2, 3 or 4)

- used in ID 111351 (the same author)

100% probability that it is author with ID 1000866

2d) word "write"

- used in ID 116160 (author 1011289)

100% probability that it is author with ID 1011289

And this additional check should with operator inside cross validation.

But I'm not sure if it has any sense to check it in this way^^

SGolbert · June 2019

Hi @Serek91

before trying to answer your question I want to ask: Do you only have these 4 attributes or do you also have access to the word vectors or the raw texts? I think you are trying to predict under the assumption that these attributes have a good predictive power, which can easily not be the case.

I would definitely try to get the word vectors and try out different supervised classification algorithms (best with Auto Model).

Regards,

Sebastian

Serek91 · June 2019

I have no idea what I'm doing^^ And I don't have good knowledge about using RapidMiner. I'm just trying to use different text properties (number of words in sentence, sentence length, total % of unique words, etc), than can have some impact on greater chance of finding correct author. All properties are calculated in c#, then I generate CSV to use it in RapidMiner.

I have raw texts from this set: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

But maybe checking most used words and comparing them in way as I described is too hard for me. I just want to pass this master thesis^^

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Classification - comparison of one attribute to others attributes

Answers