/Considering Text Mining with RapidMiner Community Edition : Performance
I am considering doing a text mining project for my college masters thesis. I am looking at Open Source software that will help me accomplish this.
RapidMiner looks like a good bet as I am familiar with it somewhat from doing some predictive modelling already. The main concern I would have at this point is performance using the Community Edition to text mine large datasets. I would probably be using the Community Edition. I think the datasets I will be looking at will have millions or perhaps tens of millions of rows each with a large free text field of maybe a few thousand characters each. My knowledge of text mining right now is fairly minimal so I can't say too much more now about the exact preprocessing and text analysis I will be doing.
So I have questions about performance
1. Is it feasible to expect the Community Edition to text mine data of that size?
2. I may have the opportunity of using a server a Solaris or Linux server at work with multiple cpus and large memory allocation. But will the Community Edition algorithms implementation scale with the extra computing power?
3. I may also have the opportunity to connect to Vectorwise fast databases. But it looks like it is RapidAnalytics Enterprise Edition that takes advantage of Vectorwise. I don't think I would have access to that on this project. So do the Community editions deal with data in memory only? If so, is memory the major constraint.
4. I've seen some discussion tracks about using RapidMiner in the Amazon Cloud. The discussion was inconclusive. I wonder does anyone have thoughts on how RapidMiner might scale if using a Big Memory instance in the Cloud. Itis something I may look into. I couldn't use it for the main part of my project as the data must stay within an organisation. But it might be useful to give me some benchmarks. If it is much faster, that would be an interesting finding.
5. Any optimisation tips that might be useful etc?
I realise this is a long message! Any thoughts on these questions would be much appreciated!