The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.

"Stream Database operator: metadata ?"

camielcoenencamielcoenen Member Posts: 4 Contributor I
edited May 2019 in Help
Hi,

I am working with a large dataset (approx. 250,000 rows and 300+ columns) which is loaded in a MySQL database table and would like to use the Stream Database operator to use this dataset in a proces. However, unlike the Read Database operator, the Stream Database operator doesn't output the metadata information, which makes it impossible to use other operators like Select Attributes in the steps following Stream Database.  I am using RapidMiner 5.1 .
Tagged:

Answers

  • MatthiasMatthias Member Posts: 13 Contributor II
    Hi,

    I think all the Import Data Operators couldn't prepare the meta data informations directly.Because only when you start the process RM can read the meta data informations.
    The easiest way is to save the dataset with the store operator at the repository. And then you have an fast acces to the dataset with the Retrieve operator. And alway the meta data informations.

    Greetings

    Matthias
  • camielcoenencamielcoenen Member Posts: 4 Contributor I
    Matthias wrote:

    Hi,

    I think all the Import Data Operators couldn't prepare the meta data informations directly.Because only when you start the process RM can read the meta data informations.
    The easiest way is to save the dataset with the store operator at the repository. And then you have an fast acces to the dataset with the Retrieve operator. And alway the meta data informations.

    Greetings

    Matthias
    Well, the "Read Database" operator does prepare the metadata information, even when a project  has not been started or run yet. The "Stream Database" operator does not prepare the metadata information. So, why this difference ? Yes, I can use the Store operator, but it is basically the same as the "Read Database" operator. The "Stream Database" has the caching features I need.

    Greetings,

    Camiel
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    let me formulate it in this way: Do you use the Community Edition?

    Greetings,
      Sebastian
  • camielcoenencamielcoenen Member Posts: 4 Contributor I
    Sebastian Land wrote:

    Hi,
    let me formulate it in this way: Do you use the Community Edition?

    Greetings,
      Sebastian
    Yes, I do use the Community Version. Does it make a difference in case of the Stream Database operator ?

    Thanks,

    Camiel
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    currently not, but as a community edition user you simply have to wait until someone has idle time to fix it. As an enterprise customer your wishes would have a "little" bit more importance to us. Not to mention that we could hire more guys helping us coding things if you would become enterprise customer.
    Anyway I think that handling of large amounts of data will become an enterprise feature sooner or later. So I won't bet that the improvements of Stream Database will make it into the community edition.

    Greetings,
      Sebastian

  • camielcoenencamielcoenen Member Posts: 4 Contributor I
    Thanks,

    Is it a JDBC connection issue that needs to be fixed ? The "Read Database", on the other hand, is working fine.

    Nevertheless, I would like to know how to handle a large dataset in Rapidminer Community Edition, what kind of operators can be used to make the dataset more manageable? Are there tutorials/samples on how to do this ?

    Greetings,

    Camiel
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    aggregate it before loading it. Split the data set before loading it. Try to cluster things before by using samples where possible, apply in batches...

    Well, everything depends on your problem. But the basic idea is to use only samples or batches where possible or to compress the data even before loading.

    Greetings,
      Sebastian
Sign In or Register to comment.