document management and P2P

IngoRM · May 2008

Original messages from SourceForge forum at http://sourceforge.net/forum/forum.php?thread_id=1691354&;forum_id=390413

Hi All, hi rrudolph,

(This thread follows the discussion begun in 'P2P proposal')

Now, concerning document management, I am very keen on these features and I used to be a groupware manager for a year. Have a look at the state-of-the-art techniques, and you will distinguish "ready-to-comment" systems vs. "rigorous-procedures" systems. In the first category, you have blogs, forums, wikis, peer-to-peer, etc...In the second category, you have groupwares, social bookmarking, databases, workflows, etc...It relies mainly on the "reference" value one wants to give to one's data. The more often you have to refer to a group of datas, the more the datas have to be guaranteed (quality/duration), the more the time you spend to sort them out and publish them.
Managing '.xml' examples seems a very good idea, which should not be limited to beginners : If such a repository is well taken up to date, it can become useful to experts, but who knows ? As they are examples, they have a value but cannot be considered as a reference. So I would 'map' example management between 'ready-to-comment' and 'rigorous-procedure' systems, that is to say either Wiki or Peer-to-Peer or social bookmarking.

It would be interesting to have an information retrieval mechanism, a bit like in StumbleUpon (http://www.stumbleupon.com), for instance you design a Tree, and the machine tells tou : "Okay, this guy whose name is X is closest to what you are doing" or "the tree/block you have been designing looks like the example N°Y : want to see it ?"

Any comments ?
Cheers,
Jean-Charles.

Answer by Ingo Mierswa:

Hello Jean-Charles,

setting up a Wiki is on our todo-list since some months now. I unfortunately had never the time to set it up myself. I am away the next few days but after I am back I will just ask one of our students if he can setup a Wiki for YALE. Managing example .xml files in a Wiki is certainly a good first step and would lead to something like the NSIS code base:

http://nsis.sourceforge.net/Category:Code_Examples

> It would be interesting to have an information retrieval mechanism, a bit like in StumbleUpon (http://
> http://www.stumbleupon.com), for instance you design a Tree, and the machine tells tou : "Okay, this guy whose name is
> X is closest to what you are doing" or "the tree/block you have been designing looks like the example N°Y : want
> to see it ?"

This would indeed be fascinating and also a great service. However, this goes far beyond a Wiki and would need some sort of collaborative working component inside of YALE in order to be user friendly enough. Until the next release our todo-list is completely full but maybe we will find enough time after the release. Of course we also highly appreciate any community help in these points.

Cheers,
Ingo

Answer by Jean-Charles:

Hello Robert, Hello Ingo,

I dare not imagine the problem of distance/similarity measure between trees. I must admit I am a bit dreamer... ;-)
If you know background theoretical knowledge on how to compare two or more trees, let me know the URLs, I will have a look...

Cheers,
Jean-Charles.

Answer by Ingo:

Hello Jean-Charles,

searching for "tree distance" in Google brings up some interesting papers:

http://www.google.de/search?hl=de&;q=tree+distance&btnG=Google-Suche&meta=

In that case I would however suggest to replace the concrete operators at each node by either its group or by a description of its input- and output types.

Cheers,
Ingo

Answer by Jean-Charles:

Hi Ingo, Hi All,

I just raise back this thread since I have found something simplier than tree distances, while another user spoke of P2P...The idea, taking into account what has been said on this thread, is to generate a "experiment vector", to be compared to other experiments-vectors, rather than comparing trees between them. Here information will be poorer, maybe, but it could be simplier to test for a first run.

Thus, for each experiment, a vector model with a few attributes. Two groups of attributes :
- The first one, dealing with the content of the experiment
- The second one, dealing with the "allure" of the tree design

---------

For the second group, use "software quality metrics" to have an estimate of the experiment's shape. For example, one of the attributes of the second attributes' group could be "cyclomatic complexity", here : http://en.wikipedia.org/wiki/Cyclomatic_complexity
The idea is to find other "pattern metrics" digging into "software quality" stuff...

---------

For the first group, the idea is to reuse the "operators' taxonomy", to flatten it, and to use each item to write an amount of operators used in this subgroup. For instance, loading an exampleset, and running a KNN on it through XVal would give the following XML (from sample 04_XVal) :
<operator name="Root" class="Process" expanded="yes">
<description text="see_samples"/>
<operator name="ExampleSource" class="ExampleSource">
<parameter key="attributes" value="../data/labor-negotiations.aml"/>
</operator>
<operator name="MissingValueReplenishment" class="MissingValueReplenishment">
<list key="columns">
</list>
</operator>
<operator name="XValidation" class="XValidation" expanded="yes">
<parameter key="number_of_validations" value="5"/>
<operator name="NearestNeighbors" class="NearestNeighbors">
</operator>
<operator name="OperatorChain" class="OperatorChain" expanded="yes">
<operator name="ModelApplier" class="ModelApplier">
<list key="application_parameters">
</list>
</operator>
<operator name="ClassificationPerformance" class="ClassificationPerformance">
<list key="class_weights">
</list>
<parameter key="classification_error" value="true"/>
</operator>
</operator>
</operator>
</operator>

In terms of vector, it would give :
IO.examples = 1
Preprocessing.data.filter = 1
Validation = 2
Learner.supervised.lazy = 1
This kind of array should be of "sparse" format, maybe...

--------

At the end, such "experiment vector" should be :
- processed as a text matrix (since XML is a formal grammar and an experiment a text) for experiments having different sizes (kind of normalization). OperatorGroup frequencies, occurrences, TF-IDF, etc...
- of sparse format

Was it clear enough ?
Cheers,
Jean-Charles.

Answer by Ingo:

Hi Jean-Charles,

> Was it clear enough ?

yes, it was! And great idea indeed. For RapidMiner 5, we have several goals and we will actually start in a few weeks on working on them. Beside an optimized data core better supporting views and on-the-fly transformations, the major issue for RapidMiner 5 will be usability and user support. So bringing back this thread perfectly fits into our plans.

Here is a small excerpt of thing we plan to do first:

- Workspace views for managing projects / sets of processes
- Data repositories inside of the workspace so that data sources can easily be connected to processes without re-defining the data source parameters every time anew
- Improving the comment facility (following the discussion some time ago)
- Allowing own operator groups, group structures and operator tags
- Using these information to allow several views (for example "My Time Series Prediction Operators" vs. "Operators for Text Clustering")
- giving more information about the data flows: the tree structure is great if you are know what you are doing (and much more efficient to work with for the experienced analysist) but it makes things really hard for beginners...

Seeing the amount of work which has to be done, I can currently hardly imagine that we will manage to also add a collaborative plugin for RapidMiner right now. On the other hand, we are currently planning some bigger projects and one of those actually is about support for collaborative analysis works so we will see.

Beside of that (you really like to dream, hum? :-)) the idea still is appealing and I think the solution described by you might actually work. So maybe there is a student out there who would like to develop something like this? We (Rapid-I) would really like to support this work but right now we are full of plans and project work (hey, after all we are still a small company...).

So, is there anybody out there?

Cheers,
Ingo

Answer by Jean-Charles:

> - Workspace views for managing projects / sets of processes
> - Data repositories inside of the workspace so that data sources can easily be connected to processes without re-defining the data source parameters every time anew
- Improving the comment facility (following the discussion some time ago)
- Allowing own operator groups, group structures and operator tags
- Using these information to allow several views (for example "My Time Series Prediction Operators" vs. "Operators for Text Clustering")
> - giving more information about the data flows: the tree structure is great if you are know what you are doing (and much more efficient to work with for the experienced analysist) but it makes things really hard for beginners...

Great, indeed !!
For the collaborative plugin, I could not help that much because all attributes in an experiment vector are to be computed for each experiment. Given that vector, suppose we have a collection of XP-vectors, then leading the analysis on it would be affordable...But I am not in a hurry, indeed

Cheers,
Jean-Charles.

Edit by Jean-Charles:

Hi Ingo, Hi All,

About workspace, I have found this : http://www.teamwpc.co.uk/products/wps/features/workbench

Hope it helps,
Jean-Charles.

IngoRM · May 2008

Hi,

About workspace, I have found this : http://www.teamwpc.co.uk/products/wps/features/workbench

thanks. Looks interesting. Although I personally would prefer something on a more visual (and less textual) base. But nothing is completely decided right now.

Thanks for sending in this hint and cheers,
Ingo

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

document management and P2P

Answers