RapidMiner in Amazon EC2

earmijoearmijo Member Posts: 270 Unicorn
edited August 2019 in Help
I've recently been running R programs in Amazon EC2. It is just fantastic. I love this model in which all you need is a terminal.

I saw here a question about RApidMiner & EC2 a while ago. Has anybody experimented since with using both?

(I'm new to the whole cloud thing, but it would be nice if somebody made available an AMI capable of running RapidMiner.)

Regards,

\E.
Tagged:

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    I heard from someone who already did this. In fact it was a talk on the last OSBI...
    If you find or create such an AMI, please inform us! Would be quite useful for everybody here.

    Greetings,
      Sebastian
  • earmijoearmijo Member Posts: 270 Unicorn
    I will Sebastian. I've been playing with it for the last few days. At the beginning at a very slow pace because I'm a Windows creature and I was using Ubuntu (which is new territory for me). Last night I was able to put together a Windows machine (Windows 2008) and now things went smoothly.

    Here's a quick sketch of what I did (I don't include steps involving getting an account with Amazon, getting the credentials, etc) :

    1) Launch a Windows 2008 Server image (for instance, Basic 64-bit Microsoft Windows Server 2008 (AMI Id: ami-d9e40db0) ). Make sure that you open the RDP port (3389) because that's how you are going to communicate with the Virtual Machine.

    2) Install Java (www.java.com/downloads)

    3) Install RapidMiner

    The whole process takes about 20-25 minutes depending on how quickly the instance is available. If you guys want a step-by-step guide, you can download the instructions at https://s3.amazonaws.com/mirlitus/RapidMinerAmazonEC2.pdf. The only things that I've left out are the creation of an account and of initial security credentials (I have had an account with AWS for 3 years now. I've forgotten what I did, but it is fairly easy.)

    For benchmarking purposes, I ran a small program in the following machines

    - (Lenovo X201, 8 GB ram, Windows 7, dual)
    - (Dell Precision T5400, 4Gb ram, windows XP, quad)

    The program was simple: finding the best subset of variables to estimate a logistic regression using the operator (Optimization Brute Force Parallel).

    I mounted the image in the best machine available ( 26 "cores" , 68  Gb ram ). By the way the downloading speeds are awesome ( I downloaded Java and RapidMIner in a few seconds).

    Times:

    Lenovo (without Parallel ) : 28 min
    Lenovo (with Parallel) : 14 min
    Dell (with Parallel) : 11.5 min
    Amazon : 2 1/2 min

    In the next few days, I'll try to put together the image and I will let you know. You have to understand that I'm still learning the whole thing, but it looks promising.

    Another possibility for you guys at RapidMiner is to talk to the guys at Bitnami (http://bitnami.org/). What they do is creating images for Open-Source programs. One of the first examples I tried was to mount a Moodle server (this is a Course Management System). All I did was to select the Bitnami image and had the server working in minutes.


  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    thank you for the information. Sounds really promising...

    Greetings,
      Sebastian
  • earmijoearmijo Member Posts: 270 Unicorn
    As promised, I put together an AMI with RapidMiner 5.0 installed and I just made it public. If anybody wants to test-drive it, here's the info you need (actually you only need the AMI ID, the rest are the details about the image):

    AMI ID: ami-c31bf0aa
    Name: windows2008-rapidminer5.0
    Description: Windows Server 2008 with Rapid Miner 5.0 Installed
    Source: 618748120321/windows2008-rapidminer5.0
    Owner: 618748120321
    Visibility: Public
    Architecture: x86_64
    Platform: Windows
    Root Device Type: ebs
    Root Device: /dev/sda1
    Image Size: 30 GiB
    Virtualization: hvm

    One last thing: The password to access the Windows Server is 'mirlitus'. Change it as soon as you log in.

    As you can see, it is a Windows 2008 version (these images are a little more expensive because Microsoft has to be paid). If there is enough interest, I could prepare an image in ubuntu. Hour charges drop significantly.

    I installed Java, Firefox, and Open Office too. I'm planning to have a second image that will have other programs installed. Namely, R, text editors, etc.

    You can run this machine in instances from m1.large ($0.48/hours regular price or about $0.22 in the spot mkt ) to m2.4xlarge ($2.88/hour regular price or about $1.20 in the spot mkt)

    This is as far as my skills can take me. There is one type of instance I haven't been able to work with (cc1.4xlarge). This seems to be the most promising one since you can cluster them at will. Amazon recently put together a cluster of 880 units and placed 145 among the Top 500 Supercomputers in the world (see http://www.zdnet.com/blog/btl/amazon-web-services-tackles-high-performance-computing-instances/36632). But I'm not a computer scientist... :-(

    Again if anybody is successful playing with those instances, please share the info with us. I'm very interested. Same here: if I can be of any help to any of you  with the lesser instances, I'll be glad to help you.

    I'm curious about RapidAnalytics and the possibility of running it in Amazon EC2. How can I get my hands on the Community version?
  • poppop Member Posts: 21 Maven
    This is fantastic!! Thank you very much for sharing.
  • wesselwessel Member Posts: 537 Maven
    Screenshot?
  • clemensclemens Member Posts: 2 Contributor I
    https://sites.google.com/site/rapidminerscreenshots/textclustering
    Better late than never ;)
    anyone with an use case sort of ... Mac OS X remote - EC2 Windows 2008 Server, RM5, DisPaRe & GridGain & Amazon Elastic MapReduce?
    please share your experience
  • wesselwessel Member Posts: 537 Maven
    clemens wrote:

    https://sites.google.com/site/rapidminerscreenshots/textclustering
    Better late than never ;)
    anyone with an use case sort of ... Mac OS X remote - EC2 Windows 2008 Server, RM5, DisPaRe & GridGain & Amazon Elastic MapReduce?
    please share your experience
    I would have hoped you would have taken a screen shot of something that shows that it runs really fast.
    My experience is negative I guess. I could not get Rapid Miner to run faster on Amazon then on my home PC.
    My home PC is only an i7 2.667 CPU.
  • clemensclemens Member Posts: 2 Contributor I
    It depends on the used problem solving algorithm, its complexity class and if a Reduction of complexity is possible.

    The paper "Distributed Pattern Recognition in RapidMiner" is imho very helpful:
    "However, it has only limited support for parallelization and it lacks functionality to spread long-running computations over multiple machines. A solution to this is distributed computing with paradigms like MapReduce. In this paper, we present a system called DisPaRe, which integrates distributed computing frameworks into RapidMiner. " (cited)

    e.g. k-means could be "easily" scaled and "run faster" ...

    cheers
  • earmijoearmijo Member Posts: 270 Unicorn
    Wessel:

    I agree with Clemens. It depends on the type of job you are running and on the type of Amazon instance you choose.

    Jobs that would take advantage of Parallelization (like Cross-Val operators, or feature selection --both version Parallel of course) will run much faster online.

    Jobs that consume a lot of memory also would take advantage of the cloud.  

    NOw Amazon offers different type of instances (the really cheap ones are not going to beat your laptop; others will. The best machine I've used is the equivalent of an 8-core with 68 GB of memory but is $2.8/hour)

    \E
  • rakirkrakirk Member Posts: 29 Contributor II
    What you guys have done is really cool- I've been thinking about doing the same thing and as is always the case someone else has already figured it out for me  ;D

    Has anyone tried running RapidAnalytics on an Amazon-type instance? It would be great to have an instance so real-time applications could be developed for web applications, etc.
  • earmijoearmijo Member Posts: 270 Unicorn
    Yes I have. I have an image for RapidAnalytics installed in Centos 5.5 but it's not a public image (it has my databases installed ).

    But I was surprised how easy it is to configure it (believe me i'm not an expert in computers by any standard).

    Steps:

    1) Spin a  linux machine of your choice (I used Centos above because I wanted the fastest machines with the largest memory)
    2) Install java
    3) Install mysql
    4) Follow step by step the instruction that come as documentation for RapidAnalytics
    5) That's it.

    There is a trick which is needed that has to do with changing the name of the hostname . See this post here: http://rapid-i.com/rapidforum/index.php/topic,2930.0.html.

  • rakirkrakirk Member Posts: 29 Contributor II
    @earmijo- interesting I've been thinking about making the switch to Ubuntu, perhaps I'll have to reconsider since speed is everything.

    Does anyone know if the community license for RapidAnalytics allow for commercial use? So far I've been just been doing a lot of research.
  • earmijoearmijo Member Posts: 270 Unicorn
    Rakirk: You can also spin a Windows machine but they are more expensive. If you want to have it running for long periods of time the switch to Linux is advised.

    About commercial use: My understanding is that you can use the community version for commercial purposes. However, you don't get the technical support you would get from Rapid-I. If you want to have that support (and I can imagine that it has to be superb since the folks at Rapid-i are so nice with those of us who use the community version) the enterprise version is a good option. The enterprise version also give you additional functionality not present in the community version.
  • wesselwessel Member Posts: 537 Maven
    earmijo wrote:

    Rakirk: You can also spin a Windows machine but they are more expensive. If you want to have it running for long periods of time the switch to Linux is advised.

    About commercial use: My understanding is that you can use the community version for commercial purposes. However, you don't get the technical support you would get from Rapid-I. If you want to have that support (and I can imagine that it has to be superb since the folks at Rapid-i are so nice with those of us who use the community version) the enterprise version is a good option. The enterprise version also give you additional functionality not present in the community version.
    If you work for a commercial company, money should not be a problem.
    They will make you an offer precisely tailored to your needs.
    So you pay only for what you need.
  • kh83kh83 Member Posts: 1 Contributor I
    So RapidMiner runs the dbase in mem for faster runtimes for the algos as I understand. Let's say my dbase is much larger than any commerical desktop's mem. Is the Amazon cloud a solution to this?I understand not all algos scale well, but what I want to know is if it can at least get the job done.

    Even though the (virtual) internal mem can be large enough to load the dbase, won't it be that it can't be processed unless the RapidMiner algos are updated for parallization? Or will my BIG dbase get loaded and processed by any of RapidMiner's algos--assuming that the algos see the cloud's internal MEM really as one big chunk of mem.

    Anyone? My knowledge of parallal computions is limited.
Sign In or Register to comment.