meta data sychronization?

holgerholger Member Posts: 42 Contributor II
edited November 2018 in Help
Hi,

in rm5 there's a check-box in the process-menu to "Synchronize meta-data with real data". But even if being checked, the meta-data is incorrect even in simple scenarios. E.g. I'm using a "Read from DB" operator to read 60000 rows. There's a nominal attribute and when opening the result set all values of this attribute are shown correctly. But when showing the tooltip by hovering over the outputport, only 3 values of the attribute are shown.

Does the meta data system take the complete dataset into account or just the first N rows?

What's the effect of  "Synchronize meta-data with real data"? Is meta-data processing completely switched when this option is being disabled.

What are these meta-data relations?

When writing operators, what do I have to do to update/extend meta-data information? Maybe there are some examples in the codebase which highlight the most typical usecases (especially with respect to meta data transformation rules).

Best, Holger
Tagged:

Answers

  • holgerholger Member Posts: 42 Contributor II
    I've figured already out the src-code reason for the problem of missing nominal values in the meta-data: It is line 218 in AbstractDataReader, which limits the number of values to 3. But I don't understand why this happens/needs to happen. The meta-data model of RM gets more and more mysterious to me the longer I look onto it. Can someone of you enlighten me?

    -Holger
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Holger,
    if the number of nominal values inside the meta data is too high, the meta data transformation might take too long. Why exactly three is chosen, I don't know, I have changed this to the property, that limits the number in the meta data transformation.

    The synchronization should already work. Could you please post a simple sample process, where this does not?

    Greetings,
    Sebastian
  • holgerholger Member Posts: 42 Contributor II
    I'm sorry, but I still don't understand this. If ther's an exampleset-instance with a nominal attribute, you just have to get its mapping to obtain the set of all possible values. There's no calculation necessary to use the key-set of the mapping as values for the meta-data. Why just 3 values?

    To give you a better understanding of my problem, here's my usecase: I've written an operator, which allows the user to select a value of a nominal attribute in the operator-UI. To populate the list of options, my first guess was to use the meta-data of the input to get it from. But as meta-data is incomplete without any (at least obvious) reason, this list always contains just 3 values. Is there a better solution for my problem?

    -Holger
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    the reason for the restriction in the meta data is simple: if the number of nominal values gets too large, the meta data transformation often becomes as slow as performing the actual process (try Nominal to Binominal for example for large sets of values, or Nominal to Date - the parsing for millions of values becomes really slow...). The initial idea was that meta data is only available for data sets stored in the repository and will propagated throughout the process. In your case, however, you seem to use an AbstractDataReader which is used by the import operators and hence the meta data - which would only be fully available in the repository - is simply not complete for performance reasons. There is no guarantee (and there will never be one) that reading the meta data by import operators is possible at all or efficient and the only case where this is guaranteed is loading the data from the repository. This is one of the main reasons why we have introduced it, by the way.

    And the solution is as simple as the reason: don't use the import operators for anything else then for importing data into the repository. From there you will get all meta data available. At least in principle because for nominal values there is a property defining the number of used values within the meta data propagation. You can of course set this to something very large but be aware of long times until the propagation will be finished...

    If ther's an exampleset-instance with a nominal attribute, you just have to get its mapping to obtain the set of all possible values. There's no calculation necessary to use the key-set of the mapping as values for the meta-data.
    Yes, if there is the data set available. In this case, you could use the mapping. And if not? You have to use the meta data which, as I tried to explain above, is only guaranteed to be fully available for data sets in the repository.


    I hope this makes things a bit clearer.

    Cheers,
    Ingo
  • holgerholger Member Posts: 42 Contributor II
    Hi,

    thanks for your detailed explanation. However, the purpose of the meta-data is still not clear to me? If I can not rely on it when implementing an operator (because I never know how the user has read her data), the only solution seems to NOT use meta-data at all.

    Putting everything into a repository first, looks like an ugly workaround for me because of two reasons:
    1) RM doesn't enforce this
    2) data always emerges outside of RM, so getting it into it is a crucial step when doing data-mining. So every data-reader implementation should be implemented in a way that the resulting example-set if fully working/compatible with any RM-operator. For instance if I have my data in a database I want to read from it directly (that's why i have it) without first dumping the table into some special data-repository.
    Currently there are first-class data-readers (the repository) and second class readers with incorrect meta-data (AbstractReader subclasses), which is not clear to the user.

    BTW, it's not that I don't like the concept of putting data into a repository. It's a great tool, which keeps data and process-definitions well separated. :-) I just want to understand this meta-stuff better.

    If I remember correctly an earlier posting of you, the idea of meta-data was to ease inter-operator communication. But as operators can not rely on the meta-data, it seems to better to ignore them, which is complicated as they need to be updated in case of data-transformations. By adding a second pathway (data + and now also meta-data) operator-implementations seem to require much more effort. Or is there an easy way to adapt the meta-data  according to what happened when applying an operator? Is there a meaningful default behavior for meta-data updates or do I have to care about this whenever implementing a new operator?

    Is it possible to disable meta-data processing completely, or do some operators rely on it?


    -Holger
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hi,

    If I can not rely on it when implementing an operator (because I never know how the user has read her data), the only solution seems to NOT use meta-data at all.
    I am not sure if you ever worked with data sets containing several hundreds of millions of tupels? The fact that you said in http://rapid-i.com/rapidforum/index.php/topic,1635.0.html that you and your collegues always work by adding with a few operators, execute them, add additional ones etc. makes me believe that the data sets you usually work on are comparatively small. That's of course fine for you and I even would go as far and say that you are really lucky.

    Why I am asking this? Well, for smaller data sets the best solution certainly is to simply load the data and to rely on the actual data for providing the meta data. On larger sets, or often changing sets derived from databases on a regular base, or... this approach is not feasible at all. So welcome to the world of large-scale data mining  ;)

    Putting everything into a repository first, looks like an ugly workaround for me because of two reasons:
    1) RM doesn't enforce this
    Right. And why should it? RapidMiner works fine with directly loading from files or databases and in fact this is necessary for ETL processes anyway (your second point). The meta data is only for supporting GUI elements, checks, quick fixes etc. Nice things. But not necessary. However, in the new RapidMiner manual which is about to be finished, it is urgently recommended to use the repositories. Same is already true for our video tutorials: the repositories are always used there.

    data always emerges outside of RM, so getting it into it is a crucial step when doing data-mining. So every data-reader implementation should be implemented in a way that the resulting example-set if fully working/compatible with any RM-operator. For instance if I have my data in a database I want to read from it directly (that's why i have it) without first dumping the table into some special data-repository.
    Right. But this is the ETL setting I described above.

    Currently there are first-class data-readers (the repository) and second class readers with incorrect meta-data (AbstractReader subclasses), which is not clear to the user.
    The reasons should be clear from my post above. And as I said in this one: you do not have to use the repository, it only guarantees the nice features made possible by the meta data propagation. For really small data sets, there is often not a difference between both options and for ETL processes you might only have a subset of features. Where is the problem?

    If I remember correctly an earlier posting of you, the idea of meta-data was to ease inter-operator communication. But as operators can not rely on the meta-data, it seems to better to ignore them, which is complicated as they need to be updated in case of data-transformations. By adding a second pathway (data + and now also meta-data) operator-implementations seem to require much more effort.
    That's exactly why we rely on meta data only and not on the data itself. Ok, after reading that paragraph I am almost certain that you did never design a process for, let's say, 1 billion tupels with lots of different nominal values. Do you really want to wait several hours before the process breaks on your client due to memory restrictions simply because you wanted to load the data for accessing correct meta data in order to support your GUI? Certainly not. Relying on data, like the Weka Knowledge Flow or in KNIME for this crucial aspect is the wrong design decision in my opinion.

    Is it possible to disable meta-data processing completely, or do some operators rely on it?
    Yes, of course (look into the Process menu). But then you will loose the information in the elements of the graphical user interface like attribute names etc.


    I don't want to sound harsh but I am out of this discussion again, sorry. I am sort of busy during the next weeks (although I really like this kind of discussion a lot) and I can probably not say too much on the more technical questions here anyway. So maybe somebody else can jump in for those.

    Cheers,
    Ingo
  • holgerholger Member Posts: 42 Contributor II
    Hi Ingo,
    I am not sure if you ever worked with data sets containing several hundreds of millions of tupels?
    Not yet. I thought it's much, but compared to hundreds of millions its nothing. :-)
    works fine with directly loading from files or databases...  The meta data is only for supporting GUI elements, checks, quick fixes etc. Nice things. But not necessary.
    This is information I was looking for, it makes the meta-data model much clearer to me, and I'm more confident now when writing operators. :-)


    Thank you very much for your enlightening explanation. It really helped me a lot to get a much better understanding of the meta-data concept.

    Best, Holger
Sign In or Register to comment.