If you are an experienced user of RapidMiner, chances are that you already noticed that RapidMiner is a golden hammer. Whatever kind of problem in the area of Data Science you are facing, the probability is high that you can create a solution with RapidMiner from the top of your head. And often enough the harsh project reality shows that a golden hammer is direly needed. The only alternative to a golden hammer would be to use hundreds of different tools, but, figuratively speaking, they would make your tool belt so heavy that you will only be able to crouch through your project.
And sometimes you not only need one golden hammer, but many of them: Many standard projects nowadays involve a amounts of data that does not yet justify the usage and overhead of real Big Data technologies, but still requires a special approach to be executable on commodity hardware in memory. This is exactly where our extension fits in: It will turn your golden hammer RapidMiner into an easy to use, golden Jackhammer, hammering away a lot of your daily grieves.
In the following sections we will outline a few areas where the extension will help you. We will close this article with an exhaustive list of operators. From time to time we want to come back, update the list as we are constantly improving the extension.
In RapidMiner you organize your projects in repositories that you are free to structure as you like. From years of experience with seeing a lot of bad ideas, we at Old World Computing have developed a standardized best-practice for repository design. We believe that avoiding unnecessary diversity helps focusing on the real problems and passing a standardized project structure to a fellow colleague also makes it much easier for him to understand what is where...
Another advantage of this standardized layout is that you can automate several tasks that become important in productive projects:
However, beside the standard structure you need tool support to implement that. The Jackhammer extension brings new operators that allow you to solve these tasks. We found, that another important point for larger scale projects, especially if involving a full team of data scientists, is to ensure the re-usability and control of processes. So we added specific operators to handle
Ever copied a set of operators to the beginning of a process? Ever sweared about having to shift the entire process AND all notes to have a nice, readable process design?
Seems to be tiny but once you got used to it, you will never want to live without the Shift Operator action:
Insert or paste a set of operators, mark it, select Shift Operators or press the hot key, and the remaining process will be shifted automatically to clear the space for the new operators. This includes all notes added to the process!
Of course there are also multiple operators included making life easier in many situations. A very common pattern is to generate data randomly and then discarding all generated data just to get an empty example set to initiate loops, etc. Our extension now provides an easier to read, one operator solution. Same for generating number sequences, group index sequences, etc.
Other operators allow the same operation as already present in core RapidMiner, but now for multiple attributes at the same time, saving the complexity of loops.
The Jackhammer's caching functionality can be useful in many situations, as well during development as deployment. Let's imagine you have external resources that take a long time to load like for example large databases being queried by your process. Especially during the modelling phase of a project you will need to load the data very often while it doesn't change at all. With the new operators you can simply put a cache operator around the part of the process loading and transforming the data. The cache operator will store the results of the subprocess in memory and if too big on a local disk. If the process is re-executed the cache operator will check if the cache is still valid or if the generating process has been changed. In case it is still valid and not too old, the cached version is returned immediately and the process continues.
While this allows you to concentrate on the modelling by eliminating unnecessary waiting times, it becomes even more valuable in a deployment scenario. Let's assume that you are doing two queries from different databases for a process monitoring some machines. The one database contains the new signals being sent from the machines, while the other one contains meta information for this particular machine that is relatively stable over time. You can avoid the additional runtime caused by the second query by putting it into a cache. The cache can be parameterized to be sensitive to process variables / macros, so that it will store different results for different machines. And if the process time is not important to you, well, it's most probably important to your database admin.
Especially in a web service deployment that can be called thousand of times per minute, you do want to eliminate anything that delays the execution. That not only includes database access but also any access on the repository (and hence into the underlying database). If you have a scoring process with a large model like a k-NN or SVM, you save a lot of time and database load with placing it within a cache operator. As it is time restricted, any update of the model will be reflected after a limited time or you can use one of the cache control operators to empty the cache when a new model is trained.
If you are creating a web application with RapidMiner server, the caching becomes invaluable to improve responsiveness, too. For that particular use case it is possible to access the cache filled in one process from also different processes while ensuring user rights being obeyed.
The Jackhammer extension contains an additional set of operators for flexible loops. In contrast to the ones from the RapidMiner core they do not suffer from compatibility requirements to older versions. Hence they not only all share the same design principles so that all have the same set of ports, but they also contain every necessary port to avoid any complex setup. Additionally they can be controlled from within to prematurely terminate the loop or skip single iterations. They also all support parallel processing, speeding up the calculations especially for larger data sets. They also allow to process more data in memory by processing it in chunks.
Every loop operator has input ports on the outside. What is delivered there will be forwarded to the inner input ports of the subprocess and remains there as copy at every iteration. If a loop continuously works on one object and changes it at every iteration, this object can be forwarded to one of the loop output ports. Their results will be forwarded to the loop input ports in the next iteration. As this usually requires an initialization, e.g. passing an empty example set, the loop input ports are initialized with the input ports' objects in the first iteration. Results are collected for the inner output ports and forwarded for the inner loop results. Both are available on the outside.
Additionally the loops may have an optional port for their specific purpose. E.g. the Loop Files (Advanced) Operator can receive an ExampleSet containing file names. It will loop over them, instead of searching a parameter specified directory.
The following looping operators are available. All Advanced versions duplicate the core functionality and add the features as described above:
Collections in RapidMiner are very useful in situations where you repeat a particular task several times. In a monitoring scenario we might have hundreds of machines in dozens of factories, in a churn related application we might have thousands of customers, ...
It is a standard procedure to loop over the particular entities we are analyzing. Usually you either store the results for each entity in RapidMiner's repository or if you output data, you append the resulting collection. Both approaches have draw backs in very large settings with thousands of entities: If you store too many entries in the repository, it will become slow and as well management as deployment becomes a hell due to a large per entry overhead. If you combine data in one large table and you just want to access the data for a particular entity, you need to load all and filter all, which is quite inefficient unless you go through external infrastructure like a database. The later approach unfortunately only applied to data and not to complex objects like prediction models.
Our indexed collections solve both tasks: like in the standard collections you can simply collect all objects with them, but instead of linearly chaining them, you need to assign one more more keys to each added objects. These values can be used to later retrieve the object again. So if your entities have identifiers like factory place and machine id or customer number, you simply get the entry for these values. This way you only need to store one object which is way more efficient. If loading is eliminated with a Cache operator the access is instantaneous so that you don't need to filter again.
The extension also contains a special operator to loop over these collections with the key's being made available as process variable / macro. This can be nicely used to report on model performance for every single entity, etc. Beside the loop there are also single operators to access the index for single objects, which can be helpful in web service or web app deployment.
The Jackhammer extension contains a huge set of operators, that address different parts of the daily life of a data scientist. Here's an extensive list of operators:
Professional consulting for your Data Science problems