150+ data source connectors to RapidMiner - made simple with DataVirtuality
Like many RapidMiners, I have been in the hunt for a "one size fits all" tool to bring in a wide array of 3rd party data sources - fast and simple. Yes, we do have a few really slick custom operators such as Twitter, Zapier, Dropbox, Aylien, Rosette, and several others. But there are so many more!
Well I have recently discovered a NICE solution for users willing to spend $$ to make this easy: DataVirtuality. Their bread-and-butter is being the middleman between 3rd party data sources and BI tools. They have over 150 "connectors" out of the box such as Google Analytics, Google AdWords, LinkedIn, Facebook, Salesforce Pardot, and many many more. So I reached out to them to see if we could connect their service to RapidMiner. Well yes, you can, and it is rather slick.
First thing you need to do is create an account with DV. I will not publish their pricing here but suffice to say that it is not cheap. If you are on a budget or have no budget at all, I'd recommend stopping your reading here. You'll see how easy this whole thing is, and then cry when you realize you likely cannot afford it. But if you have a decent budget, read on...
After you sign up, you will download their local client called "DVStudio" and connect up your data sources to their system. [What is "their system"? Basically it's your own AWS RDS SQL instance. DV creates "pipes" that connect these 3rd party sources to your DV-provisioned RDS instance, maintains it, and so on. You can also do this on-premise but I don't think it's any cheaper]. Most data sources have a wizard in DVStudio to walk you through the steps to get this set up. The wizard creates a SQL query and once you press go, you should be set.
I tested this by connecting RapidMiner's own Google Analytics account and I will say that everything was very simple from my end EXCEPT for dealing with Google Cloud IAM nonsense. When you're done, it looks like this:
DVStudio's main screen
Note that if you click on a schema, all you're really seeing is a pre-made SQL script. Here's one they have done for Atlassian JIRA to see all unresolved support tickets:
DV Studio's "virtual schema" for loading Atlassian JIRA unresolved tickets
That's all you need to do from the DVStudio side.
RapidMiner Studio (or Server)
Once you have set up DVStudio to your liking, you move over to RapidMiner. DV does not have custom operators (yet...) for RapidMiner; it connects with RapidMiner via their own JDBC driver. Put the DV JDBC driver in the RapidMiner jdbc folder, boot up RapidMiner Studio, and you're almost there.
Next go to Connections -> Manage Database Drivers and then Manage Database Connections to set up the credentials according to their instructions (see attachment below). It should all look like this:
RapidMiner database connection for DataVirtuality
If your test is OK, you should be able to open the database in the Repository window:
DataVirtuality's "pipes" in RapidMiner Studio
Then it's simply a matter of drag-and-drop any of those data sources in your process window and away you go!
A sample DV YouTube "pipe" in RapidMiner Studio......and the data in the Results panel
I'd like to give special thanks to DV engineers Amber Beebe and Matthias Korn, and of course DV CEO Nick Golovin for taking the time to create special RapidMiner instructions and showing me the ropes. Happy DV'ing!