scoring and storing very large datasets in RM - any hint?

dan_agape · July 2010

The stream database operator is one of the essential operators in handling very large datasets in RM (if not the most important one).

I did the following experiment: using this operator, I have sampled a large dataset stored in a database, such that the sample fits and can be handled in the main memory in order to learn a model (data preprocessing included). So far so good: I got and evaluated the model and was happy with its performance parameters, so I saved it. Then I applied the saved model to the whole dataset that, logically, was accessed via the same stream database operator, with the intention to save the result in a new table of the database.

The process failed - with the suggestion of materialising the dataset in the memory first (!!), which is not the solution given the size of the dataset.

Although I find it obvious how to implement this in a consecrated Data Mining suite as SPSS Clementine/ Modeler or SAS Enterprise Miner, I cannot see another approach of scoring and storing the whole (large) dataset with RM. I assume it should be possible. Many thanks to those that would like to share from their experience or provide a useful hint.

Best
Dan

haddock · July 2010

Hi,

As with my response to your last post, I work on databases and do not experience the issues you describe, so it would be helpful to see your process XML and to know your configuration.

dan_agape · July 2010

So if we keep things simple: you have a dataset in a database, you have a supervised learning model produced on a compatible training dataset, and you want to apply it to score the first dataset, which is so large
that it does not fit into the main memory (obviously you want the scored dataset saved back in the database).

1) How would you score the dataset using your approach?
2) How would you correct the following simplified process in order it to work? Assume you have the appropriate model, the appropriate connection details, and the appropriate dataset in the database. You got the message: "Process failed..." and you are suggested "to transform the dataset into a memory based data table first or materialize the data table in memory" !...

In conclusion since you say you use to mine databases, how do you score large datasets?

Thanks for your input,
Dan

Here is the simplified code (saving the scored dataset in the database was omitted here)

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <parameter key="logverbosity" value="3"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="1"/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="parallelize_main_process" value="false"/>
    <process expanded="true" height="417" width="614">
      <operator activated="true" class="stream_database" compatibility="5.0.8" expanded="true" height="60" name="Stream Database" width="90" x="45" y="120">
        <parameter key="define_connection" value="0"/>
        <parameter key="connection" value="blabla"/>
        <parameter key="database_system" value="0"/>
        <parameter key="table_name" value="telecomchurn"/>
        <parameter key="recreate_index" value="false"/>
      </operator>
      <operator activated="true" class="read_model" compatibility="5.0.8" expanded="true" height="60" name="Read Model" width="90" x="45" y="30">
        <parameter key="model_file" value="c:\churn_scoring.mod"/>
      </operator>
      <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
        <list key="application_parameters"/>
        <parameter key="create_view" value="false"/>
      </operator>
      <connect from_op="Stream Database" from_port="output" to_op="Apply Model" to_port="unlabelled data"/>
      <connect from_op="Read Model" from_port="output" to_op="Apply Model" to_port="model"/>
      <connect from_op="Apply Model" from_port="labelled data" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

land · July 2010

Hi,
the general contract is, that you cannot write into the datatable you are just reading. Hence you have to materialize your data first as the exception suggest. Since you cannot materialize the complete data, you have to do this in chunks. These chunks can be appended to a new table after beeing classified.
A example process would look like this:

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<process version="5.0">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="5.0.8" expanded="true" name="Process">
    <process expanded="true" height="417" width="614">
      <operator activated="true" class="read_model" compatibility="5.0.8" expanded="true" height="60" name="Read Model" width="90" x="45" y="30">
        <parameter key="model_file" value="c:\churn_scoring.mod"/>
      </operator>
      <operator activated="true" class="remember" compatibility="5.0.8" expanded="true" height="60" name="Remember" width="90" x="179" y="30">
        <parameter key="name" value="model"/>
        <parameter key="io_object" value="Model"/>
      </operator>
      <operator activated="true" class="stream_database" compatibility="5.0.8" expanded="true" height="60" name="Stream Database" width="90" x="45" y="120">
        <parameter key="connection" value="blabla"/>
        <parameter key="table_name" value="telecomchurn"/>
      </operator>
      <operator activated="true" class="loop_batches" compatibility="5.0.8" expanded="true" height="60" name="Loop Batches" width="90" x="246" y="120">
        <process expanded="true" height="423" width="854">
          <operator activated="true" class="recall" compatibility="5.0.8" expanded="true" height="60" name="Recall" width="90" x="45" y="165">
            <parameter key="name" value="model"/>
            <parameter key="io_object" value="Model"/>
          </operator>
          <operator activated="true" class="materialize_data" compatibility="5.0.8" expanded="true" height="76" name="Materialize Data" width="90" x="45" y="30"/>
          <operator activated="true" class="apply_model" compatibility="5.0.8" expanded="true" height="76" name="Apply Model" width="90" x="179" y="30">
            <list key="application_parameters"/>
          </operator>
          <operator activated="true" class="write_database" compatibility="5.0.8" expanded="true" height="60" name="Write Database" width="90" x="313" y="30">
            <parameter key="connection" value="Bla"/>
            <parameter key="table_name" value="new_table_name"/>
            <parameter key="overwrite_mode" value="overwrite first, append then"/>
          </operator>
          <connect from_port="exampleSet" to_op="Materialize Data" to_port="example set input"/>
          <connect from_op="Recall" from_port="result" to_op="Apply Model" to_port="model"/>
          <connect from_op="Materialize Data" from_port="example set output" to_op="Apply Model" to_port="unlabelled data"/>
          <connect from_op="Apply Model" from_port="labelled data" to_op="Write Database" to_port="input"/>
          <portSpacing port="source_exampleSet" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Read Model" from_port="output" to_op="Remember" to_port="store"/>
      <connect from_op="Stream Database" from_port="output" to_op="Loop Batches" to_port="example set"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
    </process>
  </operator>
</process>

Please tell me if you experience any problem with that.

Greetings,
Sebastian

haddock · July 2010

Or you can appeal to higher powers, as closed sourcers do!

Although I find it obvious how to implement this in a consecrated Data Mining suite as SPSS Clementine/ Modeler or SAS Enterprise Miner, I cannot see another approach of scoring and storing the whole (large) dataset with RM.

consecrate verb (consecrated, consecrating) 1 to set something apart for a holy use; to make sacred; to dedicate something to God. 2 Christianity to sanctify (bread and wine) for the Eucharist. 3 to devote something to a special use. consecration noun.
ETYMOLOGY: 15c: from Latin consecrare, consecratum to make sacred, from sacer sacred.

Chambers

dan_agape · July 2010

Sebastian, thanks, that's been useful.

dan_agape · July 2010

m_r_nour wrote:

hi

haddock , Here is a forum to ask questions and doubts
so if you do not want help me please don't disturb me >:(

haddock:

Most non native English speakers on this forum may not have expressed themselves perfectly when posting.

If language mistakes are tolerated by everybody here, rudeness is not. Other people complained of your behaviour on this forum. It seems you are consistently rude (perhaps because frustrated?) when interacting with other users. Please ignore my postings, I am surely ignoring yours.

haddock · July 2010

Cheer up, and stop being so defensive. I was making a joke

PS It also occurs to me that your training set must be smaller than your test set, unusual.

dan_agape · July 2010

haddock wrote:

Cheer up, and stop being so defensive. I was making a joke

It seems your jokes are not properly understood on this forum.

haddock wrote:

PS It also occurs to me that your training set must be smaller than your test set, unusual.

"Must" rather shows overconfidence here. As said, the model could not be used to score the whole dataset,
so for sure the model was not tested on this, thus it's a mistake to take it, for sure, for the test set. But, as mentioned for this experiment, the model had been evaluated before any attempt to score the whole dataset. How? Both the training and the test datasets were parts of the data sample, chosen with a usual split of 2/3 and 1/3. The matter is closed, and I will not respond to any of your postings any more.

haddock · July 2010

Listen to yourself.

land · July 2010

Ok,
that's enough guys, please calm down. This is a forum for helping each other not for battling!
Here's no competition for the one with the most exquisite language skills or best data miner on earth.

Might be Haddock should have not made fun of you mistake, but in fact it had some aspect of humor. Especially if you think of asking SPSS for help with a problem, if you don't have paid for their software: It's like praying to god...Simply won't help you with your software problem.

And, let me add this as another non-native speaker, Haddock has to live with the fact, that we mess up his mother language. At least with my german mother tongue, most of my English sentencens will either sound rude or simply confusing. Probably this is a reason to get sarcastic sometimes.
Last but not least, Haddock is the most active community member and has helped many of our users with valuable tips. It's definitively a good idea to listen to what he has to say.

Please continue this discussion as professionals, I don't want to clean the mess up if this escalates.

Greetings,
Sebastian

dan_agape · July 2010

Sebastian Land wrote:

Ok,
that's enough guys, please calm down. This is a forum for helping each other not for battling!
Here's no competition for the one with the most exquisite language skills or best data miner on earth.

Indeed, Sebastian, this is a learning environment. Most of us are here to learn and thus to ask questions. If at some point a user does not respond to a posting with the genuine intention to provide some help that was requested, perhaps s/he should refrain from replying that time. Anyway, replying with just 'read the documentation' without a useful hint, or just being sarcastic, is not useful but rather may discourage users from this forum. It is not a requirement to answer questions even for the most active forum consumer. Moreover, whenever possible, we have the kind and the patient (and absolutely never the rude or sarcastic, despite the proficiency in data mining) Sebastian or any other member of the team around with a useful answer (which is ultimately and indirectly beneficial to RM too).

Sebastian Land wrote:

And, let me add this as another non-native speaker, Haddock has to live with the fact, that we mess up his mother language. At least with my german mother tongue, most of my English sentencens will either sound rude or simply confusing. Probably this is a reason to get sarcastic sometimes.
Last but not least, Haddock is the most active community member and has helped many of our users with valuable tips. It's definitively a good idea to listen to what he has to say.

On the positive side, I am sure that most native English speakers rather appreciate a non native speaker for his/her efforts in learning and using the language. At least my friends and colleagues do. Actually they appreciate even more that I am fluent in 2 languages - mother tongue not included. So small mistakes are forgiven - perhaps because they speak just one?

If I am allowed to make a joke, in more statistical terms - since we do data mining, it's true though that they are a very representative sample of the population from this point of view. So perhaps I will collect a small dataset from the workplace and learn a decision tree on it with respect to the ability to speak a foreign language. I wander what the machine would say

Anyway let's just deal with RM and data mining, in a professional manner, that's why we are on this forum.

Cheers,
Dan

land · July 2010

Hi Dan,
"Read the fucking manual" may sound rude and I try to avoid this four words where ever possible, but sometimes when I have bad times and someone asks an idiotic question that could have been answered with just taking a single look at the documentation or just switching on the own brain, then I would like to be allowed to write them.
You have to see it in this way: Whenever someone asks a question here and nobody answers it, we feel obliged to help him. For this we have to sacrifice some minutes of our working time. On the one hand this is good, because user need to get over the steep learning curve, on the other it consumes much time we could use for improve the program (and the documentation). And if someone asks questions even before having reached the start of learning curve, this just costs unnecessary time. Time that we could use to answer more important question like the one you originally asked in this thread. Please keep in mind that we are not being paid for maintaining this forum, so stupid questions do not increase our human resources, just consuming it...
So I personally find some of these questions just impolite, because they are asked before people start thinking on their own. Of course I won't answer in the same impolite way...

But I cannot remember, any participant of this discussion every asked such a question. Nor do I believe that haddocks joke was made to insult either you or your language skills. I think we could settle the hole matter NOW.

Keep calm and carry on.

Sebastian

THW_Mark · October 2010

bumping a bit of an old topic (But it occured to me that it is better to keep the documentation somewhat grouped, before we have to datamine the forum :P)

The method Sebastian describes looks indeed very promising, at first i didn't realize that the 'streaming database' should be combined with loop batches. So that has been cleared up, thanks!

I'm wondering, does this method also work for large datasets when training the model? For example, I want to use SVM on a large text database. Can I just use loop batches to train, or will this interfere with the iterative inner workings of the SVM algorithm?

land · October 2010

Hi,
I don't think this will work since the SVM needs all training examples at once as far as I know. And be sure: Since the runtime grows to the power of three with the number of examples, you simply don't want to train SVM on such many examples...
Nevertheless what you could do is to group examples with some sort of consistent label distribution and learn several SVMs on that. Later you can combine them to one voting model. MIght be this will improve performance further...

Greetings,
Sebastian

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

scoring and storing very large datasets in RM - any hint?

Answers