No worries - we all started at some point
The wordlist is actually the final result of the text processing operators, i.e. after you did all the necessary text processing like tokenization etc. All those steps happen "inside" of the text processing operator (do you see the little icon in the bottom right corner of the operator? This indicates that this is an operator in which you can go "inside" with a double click).
I think it is probably easier if you follow along one of the following videos (there are tons more if you search on Google):
So what is the point of the wordlist then? This makes sure that you use exactly the same words (and only those) for scoring than for training. This is something which is actually kind of annoying in R for example which is why I really prefer to do text analytics in RapidMiner...
Actually it sounds like you're doing it right, what I tend to do is before building the model, split my data into a Training set & a Test set.
So you have 3 datasets:
Try this now and look at the results. Great right!
However, how can you be really sure you can 'trust' your model? You've only tested it once, maybe it just got 'lucky' and in reality it's not going to perform as expected.
There's various ways to ensure you can trust your tested model performance, so after you've tried out the Split Validation I'd like you to read this series of 4 blog posts by @IngoRM and download the Repository with sample processes.
Let us know here how you get on!
There are several ways to handle this situation, but perhaps the easiest thing to do would be to save your first process as "data ETL" or something similar.
Then create a separate process for doing data ETL on your test data, and from that process you simply load the test data (however that is done, via files or db connection) and then call the original ETL process from your repository using the "Execute Process" operator. As long as the test data starts in the same raw format as your original data, this will work fine. And you can also use that same ETL process in the future to transform unlabeled data.
Under this approach, you will only have to maintain the one version of your ETL process, so if you add to it or update it in the future, you don't need to worry about replicating those changes elsewhere. The "Execute Process" operator will always retrieve the most current version of that process to apply.
Could you attach a sample process? When I try to log inside the CV operator, it works just fine. There was a minor patch for performance criterions in 7.5.3, so maybe that is the same root cause. Which version are you using?
I do not think that this has anything to do with PC vs. Mac to be honest. The way this works is that RapidMiner keeps things in memory as long as this memory is not needed for anything else. As long something is still in memory, you can get the intermediate result from the port. But if the memory is needed for something else, then those intermediate results are freed up. Only those results which are connected to the result ports on the right are guaranteed to stay.
And I think this is what happened: on the Mac you checked one of the ports but the intermediate result was deleted while on the PC it just was still there...
By the way, by doing this caching in this way we can ensure that processes are not slowed down by caching all intermediate results (you might never look at) to disk all the time.
Hope this helps,
You should see this if you are using version 6+
So a couple of things, that video of mine is available on my YouTube channel here: https://www.youtube.com/watch?v=UmGIGEJMmN8&t=2s
and with respect to power consumption, you might want to check out this paper on using RM and SVM's to forecast electricity consumption. It starts on page 46 or thereabouts. I would add in your weather as an attribute and the use your power consumption as a label.
From there build a process like this that loads and ETL's your data and use a Windowing and Sliding Validation operator. Insert a SVM set to RBF kernel and then optimize around the gamma and C parameters.
For a sample process you can try this process in this thread: http://community.rapidminer.com/t5/RapidMiner-Studio-Forum/Financial-Time-Series-Prediction/m-p/3345...
Like I said in my above reply, just set system_load to a label in the (first) set role. Then it all works.
Thank you for your fast response time. I didn’t know that was possible to autopopulate a process from XML.
About the syntax provided. I am not retrieving the desired output. I’m struggle to construct the third argument. Whereby I would like the value of the generated attribute (dot_zero) to have the attribute value of the [temp] for that particulate example if ".0" is not found within the [temp] attribute value.
Please could you tell if this is even possible with the If function expression in RM.
I believe RapidMiner will sample randomly the 10k rows from your 20k data file. To test this I would add a sequentially numbered column (e.g. 1,2,3,5..20k) in the spreadsheet and then load that in, set it as an ID and then see if the rows are randomly selected.