turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Home
- :
- Product Help
- :
- Studio Forum
- :
- Distance Computation

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic to the Top
- Bookmark
- Subscribe
- Printer Friendly Page

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-23-2008 08:43 AM

7 REPLIES

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-23-2008 01:01 PM

07-23-2008 01:01 PM

Hello and welcome to RapidMiner

**Short anwer:**

Mixed Euclidean Distance. Why ? It is the metric commonly used.

**Long answer:**

Selecting a metric means you define whether two Examples are similar(*) or not. Which metric is "right" has the quality of philosophical discussion. The metric has a lot of influence on the following learning operations, so choosing the right one is crucial. This picture will illustrate the similarity problem:

*You know what similar is, when you see it !* But how define mathematically...?

Okay, seriously:

All the metrics available have different properties and choosing the right one depends on the data the metric is for. So ... given the current state of information, we are not able to make a wise suggestion and listing all properties of all metrics...I do not think I can/will do this ;D. But: The Mixed Euclidean Distance works for the general case...

greetings

Steffen

*although similarity "not equals" metric in the literature. I use this term here to ease the explanation.

Mixed Euclidean Distance. Why ? It is the metric commonly used.

Selecting a metric means you define whether two Examples are similar(*) or not. Which metric is "right" has the quality of philosophical discussion. The metric has a lot of influence on the following learning operations, so choosing the right one is crucial. This picture will illustrate the similarity problem:

Okay, seriously:

All the metrics available have different properties and choosing the right one depends on the data the metric is for. So ... given the current state of information, we are not able to make a wise suggestion and listing all properties of all metrics...I do not think I can/will do this ;D. But: The Mixed Euclidean Distance works for the general case...

greetings

Steffen

*although similarity "not equals" metric in the literature. I use this term here to ease the explanation.

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-23-2008 04:34 PM

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-24-2008 01:33 AM

07-24-2008 01:33 AM

Oh sorry

Take a look at the class*com.rapidminer.operator.similarity.SimilarityUtil*

There it is possbile to create a*SimilarityMeasure* using *resolveSimilarityMeasure*, the distance/similarity is finally calculated using the method *similarity(String x, String y) * in the class *SimilarityMeasure*

hope this is the information you are looking for

greetings

Steffen

Take a look at the class

There it is possbile to create a

hope this is the information you are looking for

greetings

Steffen

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-24-2008 03:40 AM

07-24-2008 03:40 AM

It´s not my post but I want to ask something by the way.

In the similarity(string x, string y) x and y are IDs for the examples. But the examples must be in the same exampleSet: the exampleSet passed to the similarity init method.

The only option I see to compute similarity between two examples of two different exampleSets is to merge both exmapleSets. Is there any other? ???

Thanks in advance.

F.J. Cuberos

In the similarity(string x, string y) x and y are IDs for the examples. But the examples must be in the same exampleSet: the exampleSet passed to the similarity init method.

The only option I see to compute similarity between two examples of two different exampleSets is to merge both exmapleSets. Is there any other? ???

Thanks in advance.

F.J. Cuberos

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-24-2008 07:26 AM

07-24-2008 07:26 AM

Hello,

in cases where the similarity measure extends "AbstractValueBasedSimilarity", you could cast to this measure and could then access methods like:

public double similarity(Example x, Example y);

This method does not rely on the init example set and on IDs at all.

About the IDs (this is also a connection to the other thread): using IDs for this stuff was intended by the original author to make things more easy to access and actually also for avoiding recalculations. However, it turned out that this is not the case - at least not for larger data sets - and that there are other constraints like the problems for different example sets like you have mentioned them. For that reason, we decided to revise the similarity calculations and we already started with this by using the KNN learner as example. This revision will definitely be finished until the next relase and then it will be easier to access the similarity measures than it is now (although the method should work...)

Cheers,

Ingo

in cases where the similarity measure extends "AbstractValueBasedSimilarity", you could cast to this measure and could then access methods like:

public double similarity(Example x, Example y);

This method does not rely on the init example set and on IDs at all.

About the IDs (this is also a connection to the other thread): using IDs for this stuff was intended by the original author to make things more easy to access and actually also for avoiding recalculations. However, it turned out that this is not the case - at least not for larger data sets - and that there are other constraints like the problems for different example sets like you have mentioned them. For that reason, we decided to revise the similarity calculations and we already started with this by using the KNN learner as example. This revision will definitely be finished until the next relase and then it will be easier to access the similarity measures than it is now (although the method should work...)

Cheers,

Ingo

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-30-2008 03:19 AM

07-30-2008 03:19 AM

Hello,

The Euclidean distance in WEKA

Is the way of dataset representation related (represented in rapid-i as a double array [datamangment = double_array]) to this difference !? ???

--

Motaz K. Saad

The Euclidean distance in WEKA

weka.clusterers.forOPTICSAndDBScan.DataObjects.Eucand Euclidean distance in Rapid-ilidianDataObject

com.rapidminer.operator.similarity.attributebased.do not give the same distance value. I reviewed them, they implemented in the same way. ???MixedEuclideanDistance;

Is the way of dataset representation related (represented in rapid-i as a double array [datamangment = double_array]) to this difference !? ???

--

Motaz K. Saad

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Get Direct Link
- Email to a Friend
- Report Inappropriate Content

07-31-2008 12:48 PM

07-31-2008 12:48 PM

Hi,

hmmh, the data representation could be an explanation but I actually do not believe it. In Weka, data is always represented as double. Missing values could also make a difference of course but you probably checked that. It could be a bug in one of the implementations of course.

However, I just tried it inside of RapidMiner on a very simple example:

containing only the two examples

This process delivered the correct euclidean distance of 15.935254871644613.

Cheers,

Ingo

hmmh, the data representation could be an explanation but I actually do not believe it. In Weka, data is always represented as double. Missing values could also make a difference of course but you probably checked that. It could be a bug in one of the implementations of course.

However, I just tried it inside of RapidMiner on a very simple example:

<operator name="Root" class="Process" expanded="yes">

<operator name="ExampleSetGenerator" class="ExampleSetGenerator">

<parameter key="number_examples" value="2"/>

<parameter key="number_of_attributes" value="2"/>

<parameter key="target_function" value="sum"/>

</operator>

<operator name="ExampleSet2Similarity" class="ExampleSet2Similarity">

</operator>

</operator>

containing only the two examples

Att1 | Att2 |

2.467612009982549 | 7.2671269538811885 |

1.2924127751628518 | -8.624734314791924 |

This process delivered the correct euclidean distance of 15.935254871644613.

Cheers,

Ingo