Questions about Time Series Example Sets

fhamkensfhamkens Member Posts: 9 Learner I
edited July 2019 in Help
Hi,

I'm relative new to RM, and Machine Learning in generell. I tried to solve this problem myself over the course of the last week, but I just have to admit, that I am not making any progress. So I hope that I can get at least a clue, how I can solve this problem. Everytime I think I am on my way to the solution, there comes a point in time, where I get stuck or learn, that I was just wrong.

My first question is about time series. I have several Datasets in the form of time series. Now I wanted to train a model using those, but I am not sure what I need to do to achieve this. My understanding is, that all of those datasets should be one example set, where each row is one time series. That would mean I need something like a nested example set, which I don't think exists, right? So I tried to loop each example sets (or time series on that note) through the training. But the result I get is a model for each example set. 
Is there a way that I can train (and test) one single model for all example sets?
In the end I thought to use a Random Forest, but would like to try some other models.

And my second question is, how do I add some kind of "global", time independent attribute to an example set? By adding another attribute in the form of column, wouldn't that imply that this attribute is time dependent?

I hope, this is not too much to ask for. I would be already grateful, if someone could hint me to the right direction, since I am at a point, where I have absolutly no idea how to solve those problems.

Best regards,
Joe  :)
Tagged:

Best Answer

Answers

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    thank you for your answer!
    If you loop every file individually then you will get separate models for each file 

    Yes, I already experienced this, and this is not what I want.

    Did the data files followed any naming convention or are they just named randomly? 

    The files are named after what they represent, but are not numered or whatever. Though I think I could change that. They are .csv files. The timeframe of all files is the same, so are the attributes.

    With merging, you mean with the "Join"-Operator, so I have every file side by side? In this case I should rename the attributes, so I can distinguish them, shouldn't I? But wouldn't I then need to tell RM somehow, which attributes are related (come from the same file) and which are not? 

    Because the example sets that I have are (I guess) mostly unrelated, or at least I don't want RM to think they could be related. There should be behaviours of each set though, that are comparable and should be predictable (I hope it is clear, what I am writing. If not, feel free to ask).

  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019
    Hello @fhamkens

    Merging in my sense is like appending the data one file after other into a single file, as you said all the attribute names are same the "append" operator can do it. Below XML show, how you can append these files if they have same attribute names. To use XML below, CLick show, the copy the code and paste it in XML window of RM and then click on green tick mark. If you cant see XML window then go to View --> Show Panel --> XML.
    The timeframe of all files is the same, so are the attributes.

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
    <context>
    <input/>
    <output/>
    <macros/>
    </context>
    <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
    <parameter key="logverbosity" value="init"/>
    <parameter key="random_seed" value="2001"/>
    <parameter key="send_mail" value="never"/>
    <parameter key="notification_email" value=""/>
    <parameter key="process_duration_for_mail" value="30"/>
    <parameter key="encoding" value="SYSTEM"/>
    <process expanded="true">
    <operator activated="true" class="concurrency:loop_files" compatibility="9.3.001" expanded="true" height="82" name="Loop Files" width="90" x="179" y="136">
    <parameter key="directory" value="F:\RM\Union"/>
    <parameter key="filter_type" value="glob"/>
    <parameter key="recursive" value="false"/>
    <parameter key="enable_macros" value="false"/>
    <parameter key="macro_for_file_name" value="file_name"/>
    <parameter key="macro_for_file_type" value="file_type"/>
    <parameter key="macro_for_folder_name" value="folder_name"/>
    <parameter key="reuse_results" value="false"/>
    <parameter key="enable_parallel_execution" value="true"/>
    <process expanded="true">
    <operator activated="true" class="read_csv" compatibility="9.3.001" expanded="true" height="68" name="Read CSV" width="90" x="380" y="34">
    <parameter key="column_separators" value=","/>
    <parameter key="trim_lines" value="false"/>
    <parameter key="use_quotes" value="true"/>
    <parameter key="quotes_character" value="&quot;"/>
    <parameter key="escape_character" value="\"/>
    <parameter key="skip_comments" value="false"/>
    <parameter key="comment_characters" value="#"/>
    <parameter key="starting_row" value="1"/>
    <parameter key="parse_numbers" value="true"/>
    <parameter key="decimal_character" value="."/>
    <parameter key="grouped_digits" value="false"/>
    <parameter key="grouping_character" value=","/>
    <parameter key="infinity_representation" value=""/>
    <parameter key="date_format" value=""/>
    <parameter key="first_row_as_names" value="true"/>
    <list key="annotations"/>
    <parameter key="time_zone" value="SYSTEM"/>
    <parameter key="locale" value="English (United States)"/>
    <parameter key="encoding" value="SYSTEM"/>
    <parameter key="read_all_values_as_polynominal" value="false"/>
    <list key="data_set_meta_data_information"/>
    <parameter key="read_not_matching_values_as_missings" value="true"/>
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    </operator>
    <connect from_port="file object" to_op="Read CSV" to_port="file"/>
    <connect from_op="Read CSV" from_port="output" to_port="output 1"/>
    <portSpacing port="source_file object" spacing="0"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_output 1" spacing="0"/>
    <portSpacing port="sink_output 2" spacing="0"/>
    </process>
    </operator>
    <operator activated="true" class="append" compatibility="9.3.001" expanded="true" height="82" name="Append" width="90" x="447" y="136">
    <parameter key="datamanagement" value="double_array"/>
    <parameter key="data_management" value="auto"/>
    <parameter key="merge_type" value="all"/>
    </operator>
    <connect from_op="Loop Files" from_port="output 1" to_op="Append" to_port="example set 1"/>
    <connect from_op="Append" from_port="merged set" to_port="result 1"/>
    <portSpacing port="source_input 1" spacing="0"/>
    <portSpacing port="sink_result 1" spacing="0"/>
    <portSpacing port="sink_result 2" spacing="0"/>
    </process>
    </operator>
    </process>

    I am also a bit confused with this statement. Based on your earlier comment that the attributes are same, I see that you have similar data in all files, might be related to a different subject or any other.

    Because the example sets that I have are (I guess) mostly unrelated, or at least I don't want RM to think they could be related. 

    Ask me questions and I can provide more on this if needed. If the attribute names are different then a simple append operator doesn't work for you.

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    oh, I see what you are aiming for, but I think I wasn't clear enough.

    I have example sets from different sources. Let's say I want to take a look on different financial assets. All datasets cover the same timeframe, for example years 2015 to 2018. I try to forecast a signal. That signal appears on all these assets and behaves more or less similar. This signal should be prefigured by slight changes in the data a few hours prior. Sadly the signal doesn't appear very often, that's why I want to train the model with several Example Sets. 

    This is more or less an exercise for me, I want to use that knowledge for other things in the future.

    I hope this makes everything clearer know. Sorry for any inconveniences, english isn't my first language. I will try harder in future posts.
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019
    Thanks, this is more clear. 

    I think you need to join datasets to train only one model from different sources, as the time frame is same. I am not sure if the data can be synchronized as I am not sure how your data looks like from different sources. Are these datasets time-synchronized? My preliminary idea is to synchronize the datasets and join them as different attributes as they are from different sources. You can remove the attributes that are highly correlated using "Remove correlated attrubtes" and train your model on this dataset. If you still want to remove some attributes, you can use "select attributes" to handpick the required ones.

    @tftemme or @hughesfleming68 any suggestions here. 
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    by "join", do you mean to use the Join-Operator? Could that even work? Since the example sets are from different sources. 

    And by "time synchronized", do you mean that the time values are all the same? If yo, then yes, the "rows" in the example sets are in full hours. So theoretically I could join all example sets and the examples would match in time. 

    But I am not sure, if this confuses RM somewhat, would it?
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Join operator depends on Primary key and it just joins both the examples based on the condition (Inner, Outer, Right, Left).

    If you have comparable data from different sources, my suggestion would be to create a join. If you don't have a specific column that can be used as a primary key then you use generate ID column to create an ID for samples in datasets and then join them based on the ID. You should be careful enough to cross-check whether you are joining the data sets related to the same day, same year and at the same time. This shouldn't be messed up as your analysis will be wrong incase you join different examples (samples) related to different times.

    If you could provide us with a couple of sample datasets, we can give more info. It is really tough as everything I say here is based on my imagination of your datasets.

    I also don't understand the meaning of "Confuse RM". Can you brief on that? What kind of confusion does RM have with data?
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    I could use the timedate as the primary key when joining the data. There should be no problem with that.

    I will attach a few examples, so you can see, what I am working with.

    What I mean with "confusing RM": As far as I understand it correctly, each row is one example (in this case a snapshot of a chart). If I join the data, each row will have values of different assets. So for example there will be values for "High" from two (or more) assets, so this isn't one example anymore. But to my understanding, RM will work with this data, as if it was one example and that the attributes of this example are dependet, not necessary from each other, but at least from the asset the values came from. But if there are several assets in one table, RM can't know which value belongs to which asset (let alone that RM doesn't know about there being several assets). 

    Let me try to explain it with the golf sample provided by RM: Each row/example is one specific day (so you have somewhat of a timeseries). The attributes are for one golfcourse. I could measure the same attributes on another golfcourse on the same days and join these two tables. Now I have one table and in each row the values for e.g. the temperature on golfcourse A and golfcourse B. I couldn't imagine how RM would handle this. And to think one step further, I wouldn't know how to set a lable, since I am looking for the signal in each asset (though I think there is something like Multilable iirc?).

    Do you now understand, what I mean by this? Could be, that I am misunderstanding the concept of how RM handles the data. 

    I have attached three example files, so you can have a look and maybe get a better idea.

    Best regards,
    Joe :)


  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    edited July 2019
    Hello Joe,

    Thanks for your clarifications. This is a good question for me as well. I have a couple of ideas as well as a couple of questions based on your data.

    1. what is your target? Do you want to see which of these data sources do well or are you looking to combine sources and try to predict outputs of different sources from the whole dataset (this is possible we can use multi-label modeling)?
    2. Do you want to use all examples and model learns from all of them and make predictions of all the examples? For this, you can use a simple append operator (not Join) which I mentioned earlier and run models.
    3. Does the target label in your case is relevant to a single entity? For example, data coming from two thermostats in a single room, In this case, data is coming from different sources but the measured quantity is the same (room temperature), it might have different value based on error, etc.

    If you want to see which sources does well, then you need to create independent models and compare their performance, this makes sense if the target labels belong to similar but a different entity. Same example, golf course A and golf course B, the data in these are similar but they belong to different entities (Course A and Course B ).

    If you want to use all the inputs but you want to predict one label attribute at a time then multi-label does the trick, we can join the data sets (Join operator) and then set the Three volume columns as labels (RapidMiner 9.4 has this option) if not we can use RM 9.3 as well with loop attributes to do this. I also want to clarify, there is a misconception about multi-label predictions, Multi-label prediction doesn't mean that the model is predicting all the outputs in a single go, what it does is, it creates multiple models based on the number of labels to predict and generates outputs based on that. This is because a predictive model can only work with one label at a time. Vector linear regression can predict multi labels at the same time.

    If you want to use all examples from all the sources, then you can just append the datasets and it doesn't create new columns but just concatenates the data into relevant attributes. We will generate an ID for each dataset stating which source it belongs to and run a model. But in this, the model will train on all example sets and learn from all the sources. So, if we have data from golf course A and golf course B, the model will predict the label for each example (might belong to A or B ) based on its training on data from both A and B. With the help of this ID column, we can also use cross-validation to split on batch so that it creates multiple models inside it based on the golf course it belongs to and aggregates the performance of all models and give a single model trained on all the datasets.

    I understand that this information is a bit overwhelming. I am also basing my assumption from the Machine learning point of view. You should also consider how these solutions make sense from a domain perspective as well.
    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    edited July 2019
    Hi Varun,

    thanks for your extensive answer, I took my time to read through it and will try to answer thoroughly.

    To answer your three questions:

    1.: By "different sources from the whole dataset" do you mean "Golfcourse C"? In this case, yes, I think that would be appropriate, because since the signal I am looking for is so rare, it is not uncommon to appear in assets I am not training the model on. And I don't think it would be wise to include data from assets, where this signal never appeared in the past, right?

    2.: To use all examples was the plan in the first place, yes. But yesterday I had another idea, that maybe would solve the problem altogether: I construct a few key datas for each asset, for example price change, volatity, volume and whatever in the last  24h/48h/week... I would also add an attribute for "Sginal y/n". So now I would have a table for each asset with said attributes and one hour per row. In the next step I would try to aggrevate the datasets, so I have only one table at the end (not sure how though, it is important that the appearance of the signal is still prominent, have to look into that), where there are no longer multiple similar attributes needed. Then I think it would be enough to take few examples where there was a signal and some where there is none. Since each example would have information of the last x hours, I don't have a time series anymore, but rather a "normal" example set. Do you get the idea? What do you think about this? If you want me to, I would try and construct a simple example.

    But I find your last paragraph very interesting and for what I want most useable. But one at a time.

    3.: Well, my target label is, if I am not mistaken, only relevant for one entity. But since the signal is so rare, if there is a positive forecast for one entity, it is not very presumable there would be other positives.

    Like I said, your last paragraph is very interesting (well, all of them are, but the last might the most relevant here). I wasn't aware, that something like this is possible. So, how would the ID look like? Would it be just another (regular?) attribute? I wouldn't be an ID attribute, right, since that would be the datetime, if I am correct? I also didn't know, that cross-validation could split the data, based on that ID. I guess it is important, that the data is split in a way, so there aren't sets from different sources in a single table, because than windowing wouldn't be working correct, since it is timebased. 
    The spliting into different datasets, based on ID/source, training of different models based on each of those datasets and aggrevating and therefore creating a single model sounds pretty much, if not exactly like what I am trying to achieve.

    Regards,
    Joe 
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello Joe,

    I will explain more briefly about the last example. 

    1. Generating ID: As you have datasets from different sources in different files, I would create an ID for each data set separately. It is like adding and extra ID column, but all the ID's are same like if it belongs to the first dataset and this data set has 500 examples in it then, there will be a new column with a value of 1 in 500 rows. This can be done with the help of generate attribute operator in rapidminer. In this, I will provide a column name and in function expression, I will give a number 1 and it will create this column. Similarly for all the datasets, like 2 for the next one, 3 for the other, etc.

    I guess it is important, that the data is split, so there aren't sets from different sources, because than windowing wouldn't be working correctly, since it is time-based

    But the problem using batch split inside a cross-validation is that , if you have like 10 batches (10 assets) then the first model will train on the first 9 batches and tests on the 10th batch and records the performance, similarly it will create other models with last 9 batches and tests on the first batch, this will continue till all the  batches are tested. The final performance is an aggregate of all the testing data on individual batches. Understand how batch split validation works before applying, this is a crucial step.

    If you want to create an individual model for individual datasets then you can loop the datasets and create models and finally, you can also aggregate the performance of all models. You don't need batch validation for this. Batch doesn't create a model on each dataset but it trains on n-1 batches (assets/datasets) and tests the trained model on the left out batch (asset/dataset).

    In the next step I would try to aggrevate the datasets, so I have only one table at the end

    I think by "aggrevate" you mean aggregate (average), this is tricky, one issue with aggregation is that it loses some important artifacts in the dataset which might be crucial for a model to detect. I use aggregation when I have a huge sampling size and I am aware that the dataset doesn't have a high time-sensitive signal. A time-sensitive signal in my view is the one that just appears for a very short amount of time. The problem for this is this highly crucial change might be lost in aggregation. Give this a thought based on your data. If this is not highly sensitive, this idea looks fine for me.

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    I love your answers, thank you very much for taking your time  :)

    I looked into the batch splitting of the crossvalidation. But as you said, I observed it trains the model on n-1 assets. So maybe that isn't what I was looking for in the end.

    If you want to create an individual model for individual datasets then you can loop the datasets and create models and finally, you can also aggregate the performance of all models. You don't need batch validation for this. Batch doesn't create a model on each dataset but it trains on n-1 batches (assets/datasets) and tests the trained model on the left out batch (asset/dataset).
    I think it's better if I try what you wrote in your third paragraph. I will try to make a process out of this. I am not 100% sure which operators I will need/use, but I guess a Loop, Windowing, Validation and for the model I think I'll use Random Forest. I will take a look into it. Don't know if I have time during the weekend, but as soon as I created something I will post it here. 

    I think by "aggrevate" you mean aggregate (average), this is tricky, one issue with aggregation is that it loses some important artifacts in the dataset which might be crucial for a model to detect. I use aggregation when I have a huge sampling size and I am aware that the dataset doesn't have a high time-sensitive signal. A time-sensitive signal in my view is the one that just appears for a very short amount of time. The problem for this is this highly crucial change might be lost in aggregation. Give this a thought based on your data. If this is not highly sensitive, this idea looks fine for me.

    Yes, "aggregate" was what I meant. I looked into it this morning and I think I understood something wrong. I thought the Aggregate-Operator would help with fusing respective columns of each asset, but instead it produces one value out of one column. But I as I am writing this I discovered the "Generate Aggregation" Operator. Otherwise I could do this with the "Generate Attribute" Operator (or, if that won't do it, I'll do it in Excel). Though, the problem is, it actually is a time-sensitive signal, a very short one, and it's rare. If I would use the average of the datasets, the signal wouldn't be visible anymore, and at first glance the aggregation functions of the operator don't look helpful. 

    I will focus on your idea first, with the looping of the data, train a model for each dataset and try some kind of aggregation at the end. Maybe this will bring me closer to my goal.

    Best regards,
    Joe :)
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Great. Try these things and also you can try deep learning with LSTM which captures time-series related signals very well.

    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    edited August 2019
    Hi,

    so I just did a fedw thinks in RM, as you can see in the spoiler. I had to use example sets with more examples, which I will attach to this post, too. 

    So far so good. I imagine the models to be aggregated as you mentioned, but I am not sure how to do this. I know there are a few operators to ensemble several models, but I am not sure, if these are the ones I need or had you something else in mind? 

    Also, the model is pretty bad, but that is because therer is no correlation in the example sets I used here.

    <?xml version=1.0 encoding=UTF-8?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve CUT_more" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Varun/Data/CUT_more"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve WOOD_more" width="90" x="45" y="136">
            <parameter key="repository_entry" value="../Data/WOOD_more"/>
          </operator>
          <operator activated="true" class="loop_data_sets" compatibility="9.3.001" expanded="true" height="103" name="Loop Data Sets" width="90" x="380" y="34">
            <parameter key="only_best" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Close"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
                <parameter key="attribute_name" value="Date"/>
                <parameter key="target_role" value="id"/>
                <list key="set_additional_roles">
                  <parameter key="Volume" value="label"/>
                </list>
              </operator>
              <operator activated="true" class="series:windowing" compatibility="7.4.000" expanded="true" height="82" name="Windowing" width="90" x="313" y="34">
                <parameter key="series_representation" value="encode_series_by_examples"/>
                <parameter key="window_size" value="10"/>
                <parameter key="step_size" value="1"/>
                <parameter key="create_single_attributes" value="true"/>
                <parameter key="create_label" value="true"/>
                <parameter key="select_label_by_dimension" value="false"/>
                <parameter key="label_attribute" value="Volume"/>
                <parameter key="horizon" value="1"/>
                <parameter key="add_incomplete_windows" value="false"/>
                <parameter key="stop_on_too_small_dataset" value="true"/>
              </operator>
              <operator activated="true" class="series:sliding_window_validation" compatibility="7.4.000" expanded="true" height="124" name="Validation" width="90" x="447" y="34">
                <parameter key="create_complete_model" value="false"/>
                <parameter key="training_window_width" value="100"/>
                <parameter key="training_window_step_size" value="-1"/>
                <parameter key="test_window_width" value="100"/>
                <parameter key="horizon" value="1"/>
                <parameter key="cumulative_training" value="false"/>
                <parameter key="average_performances_only" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.3.001" expanded="true" height="103" name="Random Forest" width="90" x="112" y="34">
                    <parameter key="number_of_trees" value="100"/>
                    <parameter key="criterion" value="least_square"/>
                    <parameter key="maximal_depth" value="10"/>
                    <parameter key="apply_pruning" value="false"/>
                    <parameter key="confidence" value="0.1"/>
                    <parameter key="apply_prepruning" value="false"/>
                    <parameter key="minimal_gain" value="0.01"/>
                    <parameter key="minimal_leaf_size" value="2"/>
                    <parameter key="minimal_size_for_split" value="4"/>
                    <parameter key="number_of_prepruning_alternatives" value="3"/>
                    <parameter key="random_splits" value="false"/>
                    <parameter key="guess_subset_ratio" value="true"/>
                    <parameter key="subset_ratio" value="0.2"/>
                    <parameter key="voting_strategy" value="confidence vote"/>
                    <parameter key="use_local_random_seed" value="false"/>
                    <parameter key="local_random_seed" value="1992"/>
                    <parameter key="enable_parallel_execution" value="true"/>
                  </operator>
                  <connect from_port="training" to_op="Random Forest" to_port="training set"/>
                  <connect from_op="Random Forest" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                    <list key="application_parameters"/>
                    <parameter key="create_view" value="false"/>
                  </operator>
                  <operator activated="true" class="performance_regression" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
                    <parameter key="main_criterion" value="first"/>
                    <parameter key="root_mean_squared_error" value="true"/>
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="relative_error" value="true"/>
                    <parameter key="relative_error_lenient" value="false"/>
                    <parameter key="relative_error_strict" value="false"/>
                    <parameter key="normalized_absolute_error" value="false"/>
                    <parameter key="root_relative_squared_error" value="false"/>
                    <parameter key="squared_error" value="false"/>
                    <parameter key="correlation" value="false"/>
                    <parameter key="squared_correlation" value="false"/>
                    <parameter key="prediction_average" value="false"/>
                    <parameter key="spearman_rho" value="false"/>
                    <parameter key="kendall_tau" value="false"/>
                    <parameter key="skip_undefined_labels" value="true"/>
                    <parameter key="use_example_weights" value="true"/>
                  </operator>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="store" compatibility="9.3.001" expanded="true" height="68" name="Store" width="90" x="581" y="34">
                <parameter key="repository_entry" value="../Data/RF Model"/>
              </operator>
              <connect from_port="example set" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Windowing" to_port="example set input"/>
              <connect from_op="Windowing" from_port="example set output" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="model" to_op="Store" to_port="input"/>
              <connect from_op="Store" from_port="through" to_port="output 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve CUT_more" from_port="output" to_op="Loop Data Sets" to_port="example set 1"/>
          <connect from_op="Retrieve WOOD_more" from_port="output" to_op="Loop Data Sets" to_port="example set 2"/>
          <connect from_op="Loop Data Sets" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>


    Best regards,
    Joe :)
  • varunm1varunm1 Moderator, Member Posts: 1,207 Unicorn
    Hello,

    I observed one issue in this, the store operator you use inside the loop will replace the models. For example, you connected two datasets, now in the first loop it will store model created for the first dataset, then it runs again to create a model for the second dataset and rewrites the 1st model as the names are the same in store operator. To store them individually, you can use a macro, RF_Model_%{execution_count}  instead of RF Model. This will store the models individually like RF_Model_1, RF_Model_2,... based on the number of datasets looped.

    To combine the models you can use "group models" operator in rapidminer. Check the below link to see if this satisfies your need.
    https://docs.rapidminer.com/latest/studio/operators/modeling/predictive/group_models.html

    There is one more thing If you want to average the performance over all the models created in the loop, you can connect the average operator to the output of loop where the per from validation operator is connected.


    Regards,
    Varun
    https://www.varunmandalapu.com/

    Be Safe. Follow precautions and Maintain Social Distancing

  • fhamkensfhamkens Member Posts: 9 Learner I
    Hi Varun,

    sorry that I didn't response earlier, sadly I didn't had much time last week.

    Thank you for looking at my process. I changed the storeoperator according to your suggestion. 

    About the "group model"-operator, I knew about that one, but I think it would be better, if I would somehow combine the models to one model. As far as I unterstood it the usage of it correctly, I would group the models, if I had different models for one dataset. So I want to try and average the models. But I don't get the average-opterator. What do I gain, if I average the perfomances of the models as you said? Or maybe the better question would be: What do I do with the average performance?

    <?xml version="1.0" encoding="UTF-8"?><process version="9.3.001">
      <context>
        <input/>
        <output/>
        <macros/>
      </context>
      <operator activated="true" class="process" compatibility="9.3.001" expanded="true" name="Process">
        <parameter key="logverbosity" value="init"/>
        <parameter key="random_seed" value="2001"/>
        <parameter key="send_mail" value="never"/>
        <parameter key="notification_email" value=""/>
        <parameter key="process_duration_for_mail" value="30"/>
        <parameter key="encoding" value="SYSTEM"/>
        <process expanded="true">
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve CUT_more" width="90" x="45" y="34">
            <parameter key="repository_entry" value="//Varun/Data/CUT_more"/>
          </operator>
          <operator activated="true" class="retrieve" compatibility="9.3.001" expanded="true" height="68" name="Retrieve WOOD_more" width="90" x="45" y="136">
            <parameter key="repository_entry" value="../Data/WOOD_more"/>
          </operator>
          <operator activated="true" class="loop_data_sets" compatibility="9.3.001" expanded="true" height="103" name="Loop Data Sets" width="90" x="380" y="34">
            <parameter key="only_best" value="false"/>
            <process expanded="true">
              <operator activated="true" class="select_attributes" compatibility="9.3.001" expanded="true" height="82" name="Select Attributes" width="90" x="45" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="Close"/>
                <parameter key="attributes" value=""/>
                <parameter key="use_except_expression" value="false"/>
                <parameter key="value_type" value="attribute_value"/>
                <parameter key="use_value_type_exception" value="false"/>
                <parameter key="except_value_type" value="time"/>
                <parameter key="block_type" value="attribute_block"/>
                <parameter key="use_block_type_exception" value="false"/>
                <parameter key="except_block_type" value="value_matrix_row_start"/>
                <parameter key="invert_selection" value="true"/>
                <parameter key="include_special_attributes" value="false"/>
              </operator>
              <operator activated="true" class="set_role" compatibility="9.3.001" expanded="true" height="82" name="Set Role" width="90" x="179" y="34">
                <parameter key="attribute_name" value="Date"/>
                <parameter key="target_role" value="id"/>
                <list key="set_additional_roles">
                  <parameter key="Volume" value="label"/>
                </list>
              </operator>
              <operator activated="true" class="series:windowing" compatibility="7.4.000" expanded="true" height="82" name="Windowing" width="90" x="313" y="34">
                <parameter key="series_representation" value="encode_series_by_examples"/>
                <parameter key="window_size" value="10"/>
                <parameter key="step_size" value="1"/>
                <parameter key="create_single_attributes" value="true"/>
                <parameter key="create_label" value="true"/>
                <parameter key="select_label_by_dimension" value="false"/>
                <parameter key="label_attribute" value="Volume"/>
                <parameter key="horizon" value="1"/>
                <parameter key="add_incomplete_windows" value="false"/>
                <parameter key="stop_on_too_small_dataset" value="true"/>
              </operator>
              <operator activated="true" class="series:sliding_window_validation" compatibility="7.4.000" expanded="true" height="124" name="Validation" width="90" x="447" y="34">
                <parameter key="create_complete_model" value="false"/>
                <parameter key="training_window_width" value="100"/>
                <parameter key="training_window_step_size" value="-1"/>
                <parameter key="test_window_width" value="100"/>
                <parameter key="horizon" value="1"/>
                <parameter key="cumulative_training" value="false"/>
                <parameter key="average_performances_only" value="true"/>
                <process expanded="true">
                  <operator activated="true" class="concurrency:parallel_random_forest" compatibility="9.3.001" expanded="true" height="103" name="Random Forest" width="90" x="112" y="34">
                    <parameter key="number_of_trees" value="100"/>
                    <parameter key="criterion" value="least_square"/>
                    <parameter key="maximal_depth" value="10"/>
                    <parameter key="apply_pruning" value="false"/>
                    <parameter key="confidence" value="0.1"/>
                    <parameter key="apply_prepruning" value="false"/>
                    <parameter key="minimal_gain" value="0.01"/>
                    <parameter key="minimal_leaf_size" value="2"/>
                    <parameter key="minimal_size_for_split" value="4"/>
                    <parameter key="number_of_prepruning_alternatives" value="3"/>
                    <parameter key="random_splits" value="false"/>
                    <parameter key="guess_subset_ratio" value="true"/>
                    <parameter key="subset_ratio" value="0.2"/>
                    <parameter key="voting_strategy" value="confidence vote"/>
                    <parameter key="use_local_random_seed" value="false"/>
                    <parameter key="local_random_seed" value="1992"/>
                    <parameter key="enable_parallel_execution" value="true"/>
                  </operator>
                  <connect from_port="training" to_op="Random Forest" to_port="training set"/>
                  <connect from_op="Random Forest" from_port="model" to_port="model"/>
                  <portSpacing port="source_training" spacing="0"/>
                  <portSpacing port="sink_model" spacing="0"/>
                  <portSpacing port="sink_through 1" spacing="0"/>
                </process>
                <process expanded="true">
                  <operator activated="true" class="apply_model" compatibility="9.3.001" expanded="true" height="82" name="Apply Model" width="90" x="112" y="34">
                    <list key="application_parameters"/>
                    <parameter key="create_view" value="false"/>
                  </operator>
                  <operator activated="true" class="performance_regression" compatibility="9.3.001" expanded="true" height="82" name="Performance" width="90" x="246" y="34">
                    <parameter key="main_criterion" value="first"/>
                    <parameter key="root_mean_squared_error" value="true"/>
                    <parameter key="absolute_error" value="true"/>
                    <parameter key="relative_error" value="true"/>
                    <parameter key="relative_error_lenient" value="false"/>
                    <parameter key="relative_error_strict" value="false"/>
                    <parameter key="normalized_absolute_error" value="false"/>
                    <parameter key="root_relative_squared_error" value="false"/>
                    <parameter key="squared_error" value="false"/>
                    <parameter key="correlation" value="false"/>
                    <parameter key="squared_correlation" value="false"/>
                    <parameter key="prediction_average" value="false"/>
                    <parameter key="spearman_rho" value="false"/>
                    <parameter key="kendall_tau" value="false"/>
                    <parameter key="skip_undefined_labels" value="true"/>
                    <parameter key="use_example_weights" value="true"/>
                  </operator>
                  <connect from_port="model" to_op="Apply Model" to_port="model"/>
                  <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
                  <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
                  <connect from_op="Performance" from_port="performance" to_port="averagable 1"/>
                  <portSpacing port="source_model" spacing="0"/>
                  <portSpacing port="source_test set" spacing="0"/>
                  <portSpacing port="source_through 1" spacing="0"/>
                  <portSpacing port="sink_averagable 1" spacing="0"/>
                  <portSpacing port="sink_averagable 2" spacing="0"/>
                </process>
              </operator>
              <operator activated="true" class="average" compatibility="9.3.001" expanded="true" height="82" name="Average" width="90" x="648" y="136"/>
              <operator activated="true" class="store" compatibility="9.3.001" expanded="true" height="68" name="Store" width="90" x="648" y="34">
                <parameter key="repository_entry" value="../Data/RF_Model_%{execution_count}"/>
              </operator>
              <connect from_port="example set" to_op="Select Attributes" to_port="example set input"/>
              <connect from_op="Select Attributes" from_port="example set output" to_op="Set Role" to_port="example set input"/>
              <connect from_op="Set Role" from_port="example set output" to_op="Windowing" to_port="example set input"/>
              <connect from_op="Windowing" from_port="example set output" to_op="Validation" to_port="training"/>
              <connect from_op="Validation" from_port="model" to_op="Store" to_port="input"/>
              <connect from_op="Validation" from_port="averagable 1" to_op="Average" to_port="averagable 1"/>
              <connect from_op="Average" from_port="average" to_port="output 2"/>
              <connect from_op="Store" from_port="through" to_port="output 1"/>
              <portSpacing port="source_example set" spacing="0"/>
              <portSpacing port="sink_performance" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
              <portSpacing port="sink_output 3" spacing="0"/>
            </process>
          </operator>
          <connect from_op="Retrieve CUT_more" from_port="output" to_op="Loop Data Sets" to_port="example set 1"/>
          <connect from_op="Retrieve WOOD_more" from_port="output" to_op="Loop Data Sets" to_port="example set 2"/>
          <connect from_op="Loop Data Sets" from_port="output 1" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>

    Best regards,
    Joe
Sign In or Register to comment.