How can I do the following using RapidMiner Studio?
I have a dataset with several columns (all text). Each row of the dataset must be compared to the other rows of the same dataset, and I need the similarity between some of the texts fields in the dataset. One of these columns of the dataset (column x) is the information I'm trying to "predict" through text similarity. That is, I know that if rows 1 and 2 of my dataset are very similar, they should share the same information as column x. And if columns 1 and 3 of the dataset are not similar, they should have different information in column x. How can I, then:
- once I get the similarity score, relate this similarity to column x content?
- get accuracy (and other metrics) of this relation?
Thank you very very much!
I'm not completly clear based on your explanation, but I think the Cross-Distances operator does what you are describing. You may need to convert your text data into nominal data type first. Take a look at that operator and its tutorial process and see if it is what you want.
Yes, I believe this can help. But how can I extract validation measures (such as accuracy, for example) when I relate the distances found (% similarity) with column x?
Assuming column x from my dataset contains informations of colors, part of the similarity table resulted is:
I can generate this table above using Data to Similarity Data or Cross Distances. Except the COLORS information are shown as ID's for FirstRow and SecondRow.
What I want now, is, considering the color information is in my dataset; and I have another table with the similarities (like the one generated as a result of Data to Similarity operator), how can I relate the similarities to the column 'colors' from the dataset, and validate this relation?
From this example, we can assume that when colors are the same, the similarity tends to be higher. But there are some outliers, such as Grey and Grey, who got a low similarity score and also, Pink and Black who are different but got a high similarity score.
Sorry if this is confusing :s
There are many different distance metrics that are available for these operators. Are you sure you have selected a suitable one for text string comparisons? Perhaps the operator "Generate Levenshtein Distance" (part of the free Operator Toolbox extension, you might need to add it first) would be more suitable? There the distance metric is based on the number character subsitutions required to move from one string value to another. If you post a sample data file and your xml process it would be easier to investigate.
I guess Leveinshtein isn't suitable because my text strings are bigger. Right now my doubt is not regarding which distance metric I should use, right now I'm more concerned about relating one column of my dataset to the similarity results of the ResultSet I get running the similarity algorithms, to validate whatever algorithm I apply.
There is this column in the dataset that I'm trying to predict, but the "prediction" is 100% related to the text similarity in this solution. And that is what I'm trying to validate, if the higher similarity of texts gives me the same information of column x, while low similarities of texts gives me different content of column x, considering column x is text too.
Thank you very much,