Limits on rows - Truncated data set - Missing - Non-missing data

RadostinaRadostina Member Posts: 5 Learner I
Hello, 

I am a newbie. Downloaded RapidMiner Free Studio (my understanding is it comes with one month of studio Enterprise) and signed up for RapidMiner GO on Saturday (April 3rd). Watched all the Academy videos (if you are going to send me there :)). Started playing around and I got to several problems. 

DIsclaimer - At the moment I am evaluating head-to-head performance of RapidMiner, Weka and Orange. Weka and Orange are free, RM Go cost at 10$ a month is comparable. 

My understanding is that if file <50MB, <500 attributes file should be good to go on RapidMiner Go. 

So, I have a file of size <50MB, 11 attributes and 542 919 rows. No missing data in the dataset - not a single one. 



I 've run a classification on it, but unfortunately, it got truncated down to 120 000 rows. 
I have not gotten any notification in between.



I also did some predictive modeling and I get a lot of MISSING data (especially obvious when I predict on a testing set). 

So far - this was GO. 
For the first month of Free RM Studio, I get Enterprise Studio where there is no limit on anything ( in theory). My computer has 16GB of RAM (if somebody is to ask - not bad, not great either).

When I execute the same Modeling with AutoModel in Studio (I have Free Educational License with would come with a month of Enterprise Studio) I get even more truncated results - It all stops at 100 000 rows. 



Even worse - there is not a single missing point in the dataset, but there are missing in the prediction. 

So, my questions are: is the data truncated in Go? Is data truncated in Studio? Why Missing Values if there is not a single missing or NA cell in the dataset? How can I overcome this if I am to use RM in the future?

Thank you so much, 

Radostina

Answers

  • ceaperezceaperez Member Posts: 517 Unicorn
    Hi @Radostina
    To understand what is happening with the dataset in Automodel you can open the generated model and dive into the operators.
    is possible that the Automodel perform some ETL operations with the dataset according with the selected models and parameters.

    Regards

  • RadostinaRadostina Member Posts: 5 Learner I
    Thank you @ceaperez

    I understand that ETLs are the reason for the assignment of the missing values, but the truncation - I have no explanation of this. I get the first 125 000 rows and the rest are gone. 

    Also, when I apply the model on a new dataset (a testing dataset)  in Go, the prediction results I get have nothing to do with the input testset I submit. It is like mix and match of colums and rows, with lots of Missing spread around. This is unfortunate, because I like the concept behind supporting and contradicting predictions - this alone is the best feature in RM from what I have seen so far. 

    I definitely cannot trust the RM results for the time being - would spend two more days exploring (until the trial of Go expires), but seems like I would not commit to the platform for now. There is better control on WEKA/ Orange/ Genie. 
  • ceaperezceaperez Member Posts: 517 Unicorn
    Hi @Radostina
    I understand your frustration, but in RM is possible to track all the steps and operators used. My recommendation is to open the proposed model by Automodel feature, and put a breakpoint after the operators in order to figureout what happend with your data and fixit.

    can you share your model?

    regards
  • RadostinaRadostina Member Posts: 5 Learner I
    Thank you again, @ceaperez.

    Apologies for the stupid question, but how do I share the model here? 

    This is the process - see attached. The process should equal the model. 

    Best, 

    Radostina
  • ceaperezceaperez Member Posts: 517 Unicorn
    Hi @Radostina
    It's ok, I was refer to the process. I haven't access to the dataset but I can see several operators for data preprocessing on it. My recommendation to investigate what happend in each operator using breakpoints and reviewing the outputs. 

    Regards
  • RadostinaRadostina Member Posts: 5 Learner I
    Thank you so much @ceaperez - for taking the time and suggesting me how to deal - it is a steep learning curve for me right now. 

    I heard you loud and clear in the previous comment, when you suggested the breakpoints and now I am doing that - hopefully, I will figure the points of data mishandling.

    I still do not understand the truncation on Go and the mix and match of results, but for the time being - I will play around. The more I understand the process/ model, the better I will understand the modeling. 

    Thanks for your input.

    best, 

    Radostina
  • ceaperezceaperez Member Posts: 517 Unicorn
    Hi @Radostina
    It's a pleasure, we are here to help us each other's. The Automodel is a great tool but remember that you are the analyst and who has the business knowledge, It's important to understand the dataset preprocessing and validate it.

    Good luck. 


  • RadostinaRadostina Member Posts: 5 Learner I
    Thank you, @ceaperez:)
Sign In or Register to comment.