RapidMiner

RM Certified Expert
‎05-18-2017 02:16 PM
153 Views
1 Comment

If you happen to work at WeWork Tysons in VA or in the general Washington DC area, come down and say hi! I'm going to be speaking at the Spark DC Meetup group this coming Tuesday (5/23) at 6PM. You can get more Meetup details here.

Read more...

If you happen to work at WeWork Tysons in VA or in the general Washington DC area, come down and say hi! I'm going to be speaking at the Spark DC Meetup group this coming Tuesday (5/23) at 6PM. You can get more Meetup details here.

  • Meetup
  • radoop
  • Spark
RM Certified Expert
‎05-15-2017 10:53 AM
258 Views
0 Comments

Last week I started the process of hiding a few Easter Eggs inside the community. I can't tell you what they are but there are instructions for you to execute when you find them. Good luck!

Read more...

Last week I started the process of hiding a few Easter Eggs inside the community. I can't tell you what they are but there are instructions for you to execute when you find them. Good luck!

  • Easter Eggs
RM Staff
‎05-12-2017 10:48 AM
303 Views
0 Comments

People often think a given model can just be put into deployment forever. In fact, the opposite is true. You need to maintain your models like you maintain a machine. Machine Learning models can get off or broken overtime. This sounds odd to you because they have no moving pieces? Well, you might want to have a close look on change and drifts of concept.

Read more...

People often think a given model can just be put into deployment forever. In fact, the opposite is true. You need to maintain your models like you maintain a machine. Machine Learning models can get off or broken overtime. This sounds odd to you because they have no moving pieces? Well, you might want to have a close look on change and drifts of concept.

Change of Concept

Let’s start off with an example. If you try to build a predictive maintenance model for an air plane, you often create columns like

Error5_occured_last_5_mins

as an input for your model. But what happens if error number 5 is not error number 5 anymore? Software updates can drastically change the data you have. They fix known issues but also encode your data in a different way. If you take the post-update data as an input for your pre-update model — it will do something, but not what you expected. This phenomenon is called change of concept.

Drift of Concept

A very similar phenomenon is drift of concept. This happens if change is not drastic but emerging slowly. An industrial example is encrustment of a sensor. This happens over time and a measured 100 degrees are not 100 degrees anymore. An example in customer analytics are adoption processes of new technology. People did not use iPhones at once, but slowly adopted to it. A column like “HasAnIphone” would mean a very tech-savvy person — in 2007. Today this indicates an average person.

What Can I Do?

 An example for window based relearning. The pattern to detect circles moves over time. Only recent data points are included to built a model.An example for window based relearning. The pattern to detect circles moves over time. Only recent data points are included to built a model.
 

A common approach to overcome concept drifting is window based relearning. ....... Read more on my medium.com page

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
RM Certified Expert
‎05-11-2017 08:25 AM
271 Views
0 Comments

Welcome to another edition of the Data Science Link Roundup! In today's post we talk GPS Coordinates, JSON, and much more!

Read more...

Welcome to another edition of the Data Science Link Roundup! In today's post we talk GPS Coordinates, JSON, and much more!

 

From the Community 

  1. CausalityvsCorr starts a great thread on Clustering GPS Coordinate Data!
  2. Uenge-san gets a solution to their Join question with multiple timestamps with the help of some Groovy Script!
  3. Mschmitz works with wirtcal to solve a RapidMiner and JSON table problem.
  4. Did you know you can see the results of each fold in a Cross Validation?
  5. Can’t download from the Marketplace? Could it be your proxy? Check out this article for more info!

Out on the Interwebz

  1. Martin muses how a Human-AI interaction can beat deep learning!
  2. How does the TensorFlow team handles open source support?
  3. Visit KD Nuggets and vote!
  4. Do you like free programming books? I sure do. O’Reilly is giving away a bunch!
  5. Will Data Science eliminate Data Science? Good question, great read.

As always, take it to the limit!

Mind Blown Cat.jpg

  • Data Science
  • links
  • Roundup
RM Certified Expert
‎05-08-2017 08:43 AM
181 Views
0 Comments

RapidMiner needs your help! Every year KD Nuggets runs a poll for what Analytic software you've used over the past year, and we'd love it if you vote for us!

Read more...

RapidMiner needs your help! Every year KD Nuggets runs a poll for what Analytic software you've used over the past year, and we'd love it if you vote for us!
 
It takes only a minute and you can check off other software that you used too!
 
 
Note: they'll send you a confirmation email to make sure you're not scamming the poll, so look for that in your email.
 
Thanks so much!
  • 2017
  • KD Nuggets
  • Poll
RM Certified Expert
‎05-04-2017 09:57 AM
308 Views
0 Comments

RapidMiner's Educational group just released 3 awesome videos on Model Selection, Optimizing Models, and Auto Model Selction and Optimization! 

Read more...

RapidMiner's Educational group just released 3 awesome videos on Model Selection, Optimizing Models, and Auto Model Selction and Optimization! These are a must watch and should get your mind racing with possibilities.

 

 

 

  • Automation
  • Model Selection
  • Optimization
  • ROC
RM Certified Expert
‎05-01-2017 08:34 AM
726 Views
0 Comments

By: Edwin Yaqub, PhD

 

At RapidMiner Research, we are addressing problems that are becoming increasingly pertinent to businesses. As part of the German research project DS4DM (http://ds4dm.de), we now released the ‘Data Search for Data Mining’ extension, which provides data enrichment capabilities in RapidMiner.

Read more...

By: Edwin Yaqub, PhD

 

At RapidMiner Research, we are addressing problems that are becoming increasingly pertinent to businesses. As part of the German research project DS4DM (http://ds4dm.de), we now released the ‘Data Search for Data Mining’ extension, which provides data enrichment capabilities in RapidMiner.

 

Motivation:

 

Data analysts are increasingly confronted with the situation that data which they need for a data mining project exists somewhere on the web or in an organization’s intranet but they are not able to find it. On the web, data is generally searched from search engines using keywords or text. This is an example of unstructured search. In cases where structured data exists, e.g. in the form of a table, structured and contextualized search is possible. The objective is to enrich an existing table with additional data by harnessing diverse sources of data in an efficient manner. In the literature, this topic is often referred to as Entity Augmentation or Search-Join [1,2]. Search-Joins are useful within a wide range of application scenarios. For example, given a dataset containing attributes like the name, GDP and the region of a country, we would like to enrich the dataset by:

 

  • Searching for relevant datasets that contain an attribute of our interest e.g., the language that is spoken in a country or the currency used there.
  • Integrate the new attribute to our original dataset, either automatically filtering it out from potentially large candidate datasets or allowing a human to manually refine the integration.

 

The ‘Data Search’ extension implements both of these capabilities and thus brings the Search-Join data enrichment method to RapidMiner.

 

Besides the subject matter, this post also shows that Java developers can reuse RapidMiner libraries to customize visualizations, add GUI panels and controls in their extensions to suit their needs.


Data enrichment through the Search-Join method

 

The Backend: For the search function, the extension uses a Search-Join data server at the backed. This is developed by our project partner, the University of Mannheim (Data and Web Science group). The backend comprises a corpus of heterogeneous data tables, which are indexed and stored after extracting from data sources. The current implementation uses subset of Wikipedia as a source but more sources will be added in future. The extension (frontend) interfaces with the backend through a web-service, which uses algorithms to discover candidate tables. The discovery is based on schema (column level) and instance (row level) matches between the provided query and the tabular corpus.

 

The Frontend: The extension is composed of three operators. The Data Search, Translate and Fuse operator which work together in an operator chain as seen in Fig. 1.

 

Fig.  1 RapidMiner process for data search and integrationFig. 1 RapidMiner process for data search and integration

 

Data Search operator: This operator queries the web-service for relevant tables by submitting an entity query. The entity query comprises of an existing dataset; one attribute of this dataset is recognized as the subject identifier (primary identifier of a row) and a keyword for the additional attribute to be discovered. The server returns a collection of relevant tables. The schema level and instance level matches are also made available at the output ports.

 

If you select the checkbox ‘apply manual refinements’ in the operator parameter panel, the process execution is halted in real time and you are taken to a Control Panel graphical view. Here, you see the discovered data tables matching your query as shown in Fig.  2. The customized tree view lists candidate tables, which can contribute values for your new attribute. The red legend indicates that the table (shown as a named node in the tree panel) has an attribute (columnar) match to your original table. Similarly, the blue legend indicates a match at the instance (row) level and both legends together indicate both matches, which is the ideal case.

 

The panel shows distribution of two statistics over the collection to give high level view at a glance:

  • Coverage: the number of examples that matched between the query (your original) table and the fetched (candidate) table, divided by the number of examples in the query table.
  • Ratio: the number of examples that matched between the query table and the fetched table divided by the number of examples in the fetched table.

 

Fig.  2 Results of the Data Search operatorFig. 2 Results of the Data Search operator

 

Noise Removal: It is a fact that data search is susceptible to noise. If the analyst deems certain discovered table to be noisy, it is necessary to delete it before the process execution is resumed.

 

A noisy table can be removed by selecting it in the tree, right click mouse and then selecting the Delete menu item. This changes the data model of the operator and therefore, these changes need to be committed in-memory by clicking the ‘Commit Updates’ button before resuming the process execution. If you accidentally delete a node, the original collection can be restored through the ‘Restore Original’ button at any time. These controls are shown in Fig.  3. Notice that the examples sets at the output ports of the operator i.e. schema and instance match tables are updated accordingly. The idea is that only refined output reaches the next (Translate) operator in chain.

 Fig.  3 Delete from list and commit changes in memoryFig. 3 Delete from list and commit changes in memoryVisual aids

Care must be taken when deleting tables to prevent the loss of potentially valuable tables. To assist the data analyst in this exploratory task, two visualizations are provided.

 

Interactive Document Map: RapidMiner provides a Self-Organizing Map (SOM) visualization which can be used to expose patterns in data. We reuse and customize it to tag the dot (points shown on the map) with text showing key properties of the table i.e. its full name, the count of schema and instance matches. The map also provides a drill-down mechanism in that each dot is implemented as a hyperlink. If clicked, it opens the associated table in the tree-tabular view. This eases localization and filtering.

 

The document map helps to understand how the candidate space of discovered tables shows up in a landscape like layout. For example, tables with higher schema or instance matches might be (but not necessarily) stronger candidates. You may not want to delete these table, while others may not be so interesting. The map can also reveal neighbourhoods based on (dis)similarities among the tables based on table properties, which are fed internally to the underlying neural network. Fig.  4 shows a document map for the results of a sample query.

 

 

Fig.  4 Interactive Document Map showing discovered tablesFig. 4 Interactive Document Map showing discovered tables

 

3D Labelled Scatter Plot: While the interactive map provides a landscape view of the search space, the 3D scatter plot shows the tables as points along x-y-z axes. The points are labelled with the table name. This visualization is intended to see how/if the tables clutter along individual axis and if a Pareto frontier exists. If so, the Pareto-efficient tables are stronger trade-off candidates which you may want to keep. Fig.  5 shows such a plot for the results of a sample query.

 

Fig.  5 Labelled scatter 3D plot showing discovered tablesFig. 5 Labelled scatter 3D plot showing discovered tables

 

Translate operator:

The outputs of the Data Search operator are passed on to the Translate operator. This is where data integration or the Join step in Search-Join starts. Translate processes the candidate tables using the schema and instance matches. As a result, a new collection of tables in the image of your original dataset is created. This collection of 'translated' tables is composed from only those candidate tables, each of which have at least one cell value to contribute to your new attribute. Here again, the 'apply manual refinements' checkbox can be selected to filter out unwanted tables from reaching to the Fuse operator. The interested readers are referred to [3] for conceptual details.

 

Fuse operator:

The last operator in the Search-Join process is the Fuse operator. Fuse takes the outputs of the Translate operator as input. It then selects a particular cell value for the new attribute from the collection of translated tables. The decision which value to choose from which table is made by a fusion policy, which uses criteria provided by the user in operator parameter panel. At this stage, we provide a default fusion policy. Finally, chosen cell values are fused to the corresponding instance (row) of your original dataset and an enriched dataset with the new attribute is produced. This concludes the data integration (Join) step. Fig.  6 shows the enriched dataset(s) where a new attribute ‘language’ and ‘currency’ has been added to the original dataset.

 Fig.  6 Dataset enriched with 'language' and 'currency' attributesFig. 6 Dataset enriched with 'language' and 'currency' attributes

 

Conclusion

In this blog post, you learned about the ‘Data Search for Data Mining’ extension, which can be used to enrich an existing dataset with relevant new attributes. The GUI features shown here reused RapidMiner source to achieve necessary customizations. If you perform similar customizations, just ensure that RapidMiner security guidelines [4] are respected. The project DS4DM [5] is under active development and new features being developed at the backend and the front end will be rolled out in subsequent releases. I will stop here and urge you to go ahead, install the extension from the marketplace and simply execute the sample process (attached and below) for a first hand experience.

 

Acknowledgments

The Data Search extension is developed as part of Data Search for Data Mining (DS4DM project, http://ds4dm.de) sponsored by the German ministry of education and research (BMBF).

 

References

[1] Bizer, Christian et al. entitled, 'The Mannheim Search Join Engine', published in Web Semantics: Science, Services and Agents on the World Wide Web. Vol.35, Part 3, Dec. 2015.

[2] Bizer, Christian, Tom Heath, and Tim Berners-Lee. Linked data-the story so far, published in Semantic services, interoperability and web applications: emerging concepts (2009): 205-227.

[3] Bizer, Christian, Schema Mapping and Data Translation, lecture notes, weblink: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/Lehre/WebDataIntegration/HWS2015/WDI0...

[4] RapidMiner documentation on Security and Restrictions, weblink: http://docs.rapidminer.com/developers/security

[5] Data Search for Data Mining (DS4DM) project, weblink: http://ds4dm.de

 

 

 

 

 

  • Data
  • Extension
  • search
RM Certified Expert
‎04-26-2017 01:53 PM
487 Views
0 Comments

As promised in the first annoucment of the Write for Us program, here is Phase II.  In this phase we want to start giving people cold hard cash for writing and sharing interesting Building Blocks and Knowledge Base articles. I'll update the Community News Section with all the particulars, but below is the jist of it all.

Read more...

As promised in the first annoucment of the Write for Us program, here is Phase II.  In this phase we want to start giving people cold hard cash for writing and sharing interesting Building Blocks and Knowledge Base articles. I'll update the Community News Section with all the particulars, but below is the jist of it all.

 

Building Blocks

If you have an interesting Building Block that you'd like to share, then send over a paragraph and a sample process with it to me. We'll review it and if accepted, go ahead and write up a detailed explanation of what the Building Block does, along with the Building Block itself and a sample process. We'll pay you $20 (USD) per Building Block once we received your writeup, reviewed it, and posted it.

 

Knowledge Base Articles

Knowledge Base (KB) articles are also a great way to show off how to do something with RapidMiner Studio, Server, or Radoop. The same process applies to this as with the Building Blocks above. If you have an idea for a great Knowledge Base Article, send me a short paragraph with your idea. If accepted, then go ahead and start writing.


A few things to do when writing a KB article, make sure to:

  1. include images;
  2. detailed explanations of each step, and;
  3. include sample process that can easily be reproduced.

 

For a short KB article we'll pay you $25 (USD) and for a long KB article we'll pay $50 (USD). What's the difference between a short and long KB article? That's easy.

 

A short KB article is on how do a simple thing. Like pass a JSON file to RapidMiner Server or how to save a R or Python model in RapidMiner. A long KB article would include a full use case and the application of RapidMiner thru the entire development cycle. It could include the mashup of scripts too (Groovy, Python, R, etc). Of course you should provide sample process and data (anonymized if required).

 

Disclaimers apply, of course. Partners and Employees are not allowed to take part in the program, and we reserve to right to change anything at anytime. 

  • Building Block
  • Cash
  • Knowledge Base
  • Write for Us
RM Certified Expert
‎04-24-2017 08:52 AM
380 Views
0 Comments

So will there be 30 years of pain? Maybe. Will automation kill a lot of jobs, probably. Will the smart companies keep highly skilled workers in the next 30 years? Yes. In this new knowledge economy, brain power is key.

Read more...

Today I read an interesting article about a speech that Jack Ma, founder of Alibaba, gave. In this speech he warned that the Internet and Robots will continue to disrupt the world we know. This disruption will cause social conflicts and we should get ready for 30 years of pain. He also said that if companies don't get on board with new technologies (i.e. cloud computing), they will die.
 
The most interesting comment he made was this:
 
He also warned that longer lifespans and better artificial intelligence were likely to lead to both aging labor forces and fewer jobs. “Machines should only do what humans cannot,” he said. “Only in this way can we have the opportunities to keep machines as working partners with humans, rather than as replacements.”
 
My colleagues and I do talk about how machine learning and this new startup space is disruptive. It's usually the "water cooler" banter but we talk about a living wage and how automation, in theory, will make society better.
 
I take a similar view, based on my experiences working with customers. For example, one customer I had managed a small group of R and Python programers. They did a lot of data science for their company. These data scienctists, programmers, and engineers all were busy but it got hard managing things.
 
They created one off scripts and tried to write systems to do ETL and modeling work, but there never was enough time to get all the projects done. They wanted a better way to automate, encapuslate R and Python scripts, and collaborate across the team.
 
How did they solve this problem? They ended up buying a few seats of Studio and a Server.
 
It took a few months to migrate a few of their processes to Studio and Server. They ended up dropping a lot of Python and R code into Building Blocks and automated a lot of ETL work. The teams started tossing models and processes back and forth to each other on the Server. REST API's were exposed quickly and put into use on their website.
 
In the end, they start clearing out the backlog of data science projects they had. The got more productive and did more with the same staff they had.
 
If I took Jack Ma's view of the world, I would say the company should automate more with RapidMiner and "let go" a few R and Python programmers. Fortunately I do not and neither did that company.
 
They kept the same anount of R/Python programmers but crosstrained a few to be RapidMiner Rock Stars. Then they tackled new projects and got back to doing more data science. Something they really enjoyed doing.
 
So will there be 30 years of pain? Maybe. Will automation kill a lot of jobs, probably. Will the smart companies keep highly skilled workers in the next 30 years? Yes. In this new knowledge economy, brain power is key. You would remiss to waste your talent on things that can be automated. This way you can free up their most valuable resource, time.
 
  • Automation
  • Jobs
  • machine learning
  • Python
  • R
  • server
  • Startup
  • Studio
RM Certified Expert
‎04-24-2017 12:21 PM
258 Views
0 Comments

For years we've hearing how Big Data will unlock all kinds of insights in a corporation's data. Everyone raced to stand up clusters, jam all kinds of data into them, and then stumble when extracting insight. The cluster became hard to tame, hard to use, and seemed like a big waste of money.

Read more...

I've been at this startup thing for a few years now and I've seen a thing or two. If you read KDNuggets, you'll stumble across the Gartner Hype Cycle. Right now Big Data is entering the trough of disillusionment. While that sounds sad, it kinda makes sense.
 
For years we've hearing how Big Data will unlock all kinds of insights in a corporation's data. Everyone raced to stand up clusters, jam all kinds of data into them, and then stumble when extracting insight. The cluster became hard to tame, hard to use, and seemed like a big waste of money.
 
Of course RapidMiner Radoop came along and actually delivered on this promise but many companies decided to use a single tool to extract their insight. Maybe it was PySpark or Pig Script? Maybe something else completely. They married themselves to one or two ways of getting insight.
 
Now many companies are realizing they're not just an R shop, they're an R, Python, and Spark shop. Now they need to use all three or more tools in the Data Science toolkit to get anything done. Now they're looking around for a platform to bring all these tools together.
 
Imagine their surprise when they find RapidMiner. We've been a Data Science platform from day 1. Ninety percent of the time you can do all your data science and model building right in the Studio platform. The rest of the time you might need some esoteric algorithm to finish your work. So, if you married yourself to one tool and that esoteric algorithm wasn't available, you were SOL.
 
With RapidMiner it's always been different. Need that Tweedie algorithm in R? Use the R Scripting extension and pull it in. Need to do some PySpark on your cluster? Put that script right inside Radoops Spark Script operator.
 
It's that easy. After all, isn't that what a real Data Science platform is supposed to do?
  • big data
  • Data Science
  • Platform
RM Certified Expert
‎04-17-2017 08:08 AM
264 Views
0 Comments

One of the cool things about RapidMiner is the extension ecoystem. The default installation of RapidMiner Studio has a complete suite to do 90% of any ETL, Modeling, and Testing that you need to do on a daily basis. Sometimes you'll need that extra 10% to do something special, like Text Mining!

Read more...

One of the cool things about RapidMiner is the extension ecoystem. The default installation of RapidMiner Studio has a complete suite to do 90% of any ETL, Modeling, and Testing that you need to do on a daily basis. Sometimes you'll need that extra 10% to do something special, like Text Mining!
 
This is where extensions come in. Extension are just that, they extend the capability of RapidMiner Studio and Server in some way. Want to do Time Series? Then download the Series extension. Want to do R or Python scripting? Then download either the R or Python Scripting extension.
 
There are close to 100 different extensions available at the RapidMiner Marketplace. Each one is easy to install from the RapidMiner Studio interface. Look for the "Extensions" pull down menu.
 
RapidMiner Supported
 
There are many extensions on the marketplace. Some are made by RapidMiner, and others by third parties. RapidMiner supports (i.e. fixes problems with the extension) a few of these extensions. To find out, look for the blue check mark and the words "supported" next to the extension name. We'll fix bugs and make updates to that extension.
 
Blue check mark means supported!Blue check mark means supported!Then there are extensions without a blue checkmark. Those extensions were built by third parties such as Basis Tech or Old World Computing. Usually those extensions are supported by those third parties.
 
I bring this up because sometimes Community members have an issue with a 3rd party extension. The best way to try and resolve this issue is to make a post in the Community and ping the 3rd party developer. This way both the Community and the 3rd party developer can learn about the problem and try to resolve it quickly.
 
  • extensions
  • Marketplace
RM Certified Expert
‎04-13-2017 08:15 AM
503 Views
0 Comments

We're getting ready unveil and new program at the RapidMiner Community where our members can take part in the growth and influence this place has. We'll be rolling out in phases a new "Write for Us" program where Community members can submit guest blog articles, knowledge base articles, and building blocks for cash and swag.

Read more...

We're getting ready unveil and new program at the RapidMiner Community where our members can take part in the growth and influence this place has. We'll be rolling out in phases a new "Write for Us" program where Community members can submit guest blog articles, knowledge base articles, and building blocks for cash and swag.

 

The first phase of the Write for Us program is the Guest Blogging. There isn't any cash payout to write a guest blog post for us but you do get a link back to your blog/site and lots of Community kudos. Guest blogging is open to anyone who's using RapidMiner to do some really cool data science stuff. This includes tips and tricks or even some neat Groovy Script, Python, or R hacks with RapidMiner. Think of anything cool you do with RapidMiner and share it!

 

The second phase of the Write for Us program is where you can earn cold hard cash and swag. There is a wealth of knowledge stored in our Community members heads. We get a glimpse of it when you all post in the forums and come up with novel solutions. Why not take what you've worked hard to solve an earn some $$$ with it?  If you created a Building Block that does something neat and cool? Submit it and get $$$. Have a great idea of a Knowledge Base article? Submit it and get $$$!  We'll come up with an extra swag contest for biggest contributor to the Community too.

 

Of course there will be terms and conditions to both of these phases so check out the Community News section as we roll out this program. 

  • Cash
  • Community
  • Swag
  • Write for Us
RM Certified Expert
‎04-10-2017 12:16 PM
1990 Views
0 Comments

RapidMiner’s new data core is a big thing. It improves data and memory management and it allows you to work with much bigger data sets keeping your memory demand at bay.

 

 

Read more...

By Jesus Puente, PhD.

 

Let’s start from the beginning: what is a data core?

 

The data core is the component that manages the data inside any RapidMiner process. When you “ingest” data into a process from any data source (database, Excel file, Twitter, etc.) it is always converted into what we call an ExampleSet. No matter which format it had before, inside RapidMiner data always has a tabular form with Attributes being columns and Examples being rows. Because anything can be added to an ExampleSet, from integers to text or documents, the way this table is internally handled is very important and it has a lot of impact in how much data one can process and how fast. Well, that is exactly what the Data Core does: it keeps the data in the memory taking types and characteristics into account and making sure memory is effectively used.

 

Fig1. An ExampleSetFig1. An ExampleSet

 Fig 2. Another representation of the same ExampleSetFig 2. Another representation of the same ExampleSet

 

 Yes, but, how does it affect me?

 

Well, the more efficient the Data Core is managing memory, the larger ExampleSets you can use in your processes. And, as an additional consequence, some processes can get much faster buy improving access to elements of the ExampleSet.

 

Can you give an example?

 

Sure! There are different use cases, one of them is sparse data. By that, we mean data which is mostly zeros and only a few meaningful numbers here and there. Let’s imagine you run a market basket analysis in a supermarket chain. You have lots of customer receipts on one hand and lots of products and brands in your shelves on the other. If you want to represent that in a table, you end up with a matrix of mostly zeros. The reason is that most people only buy a few products from you, so most buyer-product combinations have a zero in the table. That doesn’t mean that your table is useless, on the contrary! It contains all the information you need.

Another example is text processing. Sometimes you end up with a table whose columns (i.e. Attributes) are the words that appear in the texts and the rows (i.e. Examples) are the sentences. Obviously, each sentence only contains a few words so, again, most word-sentence combinations have a zero in their cells.

 

Fig 3. Sparse dataFig 3. Sparse data

Well, RapidMiner’s new Data Core automatically detects sparse data and greatly decreases the memory footprint of those tables. A much more compressed internal representation is used and the ExampleSets become easier to handle and processes are speeded up.

 

Another use case is related to categorical (nominal) data in general. Even in the “dense” (non-sparse) case, data sizes within cells in a table can vary a lot. Integers are small in terms of memory use, while text can be much bigger. The new DataCore is also optimizing the representation of this kind of data, allowing for very heterogeneous ExampleSets without unnecessarily wasting memory.

 

Tell me more!

 

As often in life, in some cases, there is a tradeoff between speed and memory usage. Operators like Read CSV and Materialize now have an option to be speed-optimized, auto or memory-optimized. These options allow the user to choose between a faster, but potentially more memory intensive data management, or a more compact but probably slower representation. Auto, of course, decides automatically based on the properties of the data. This is the default and recommended option.

  

Columnar representation

 

The representation of data within the new data core is based on columns (Attributes) instead of rows (Examples). This improves performance especially whenever the data transformation is based on columns, which is the most common case in data science. Some examples are:

 

Data generation

 

In many processes, it’s necessary to generate new data from the existing columns. The Generate Data operator does that. Also, loops and optimization operators often create temporary attributes. The new data core also provides a nice optimization of these use cases by handling the new data in a new, much more performant way.

Loop attributes, then values

 

In many data preparation processes, attributes are changed, re-calculated or used in various ways. The columnar representation is ideal for this.

 

Extensions

 

We have already mentioned text use cases. It’s not pointless to mention that Text Processing and other extensions already benefit from the new core. Moreover, we have published the data core API for any of the extension developers in our community to adapt their existing extensions or create new ones that use the improved mechanism.

 

 Time for some numbers, how good is the improvement?

 

As it should have become clear in the past paragraphs, the degree of improvement is quite dependent on the use case. Some processes will benefit a lot and others not so much.

 

As a benchmarking case, we have chosen a web clickstream use case. We start with a table that contains user web activity. Each row is composed of a user ID, a ‘click’ (a URL) and a timestamp. One of the typical transformations that one would like to do is to move from an event-based table to a user-based table. Just as an example, we’ll transform the data to get a table with all users and the maximum duration of their sessions. This is a process that needs a lot of data shuffling, looping on values and, even for a relatively small data set, it can take a lot of time.

 

Let’s start with some small amount of data: 10,000 examples. I ran the process on my 8-core laptop with 32 GB of RAM. These are the results (runtimes in seconds) by threads used for parallelization.

 

 

Fig 4 - Benchmark resultsFig 4 - Benchmark results

With a single core (what’s available in the Free license), the new Data Core already provides 2x performance. As more cores are used, the times get smaller and smaller. See numbers below: with the old core using 1 thread, the job took more than 2 minutes to complete and, now with the parallelization and the new data core, it only takes 10 seconds!

Fig. 5 - Benchmark dataFig. 5 - Benchmark data

In this case, the new data core helped improving performance. However, the data core is all about memory and we’ll see that in the next example. Let’s run the same process, but with a 5 times larger data set (50,000 rows). Take a look at the numbers:

 

Fig. 6 Benchmark data (larger data set)Fig. 6 Benchmark data (larger data set)

 

 This time runtimes are in minutes. As you can see, the new data core pattern is similar to that in the previous example. It’s more data, so it takes more time, but times are reasonable. However, with the old data core, times simple blow up. And here’s the reason:

 

 


Figure 7Figure 7

 Very soon, my 32GB of main memory are fully used and everything gets extremely slow. The same process with the new data core looks like this:

 Fig 8Fig 8

 

 

 

 

It never goes beyond 65%. Therefore, the new data core allows you to work with data set sizes which were unmanageable before given a certain memory size.

 

Conclusion

 

RapidMiner’s new data core is a big thing. It improves data and memory management and it allows you to work with much bigger data sets keeping your memory demand at bay.

 

It’s already available as a beta. Try it NOW!

  • Beta
  • RapdMiner 7.5
RM Certified Expert
‎04-06-2017 09:51 AM
291 Views
0 Comments

Greetings Community! Here's a quick interesting link roundup for your Data Science needs!

Read more...

Greetings Community! Here's a quick interesting link roundup for your Data Science needs!

 

From the Community

  • RapidMiner's new Web Table Extraction extension is pretty popular. Check out this blog post.
  • Can you Cluster text documents? Yes, of course you can
  • To do Excel like pivoting you'll need an Aggregate and Pivot operator in RapidMiner. Check it out here.
  • How to score an entire CSV file through a RapidMiner Server REST API.
  • Want to send XML data to a RapidMiner Server REST API? Here's how you do it.

Interesting Links from the Interwebz

 

  • Community
  • Data Science
  • links
  • Roundup
RM Certified Expert
‎03-31-2017 06:28 AM
795 Views
0 Comments

In my last post, I introduced the ‘Web Table Extraction’ extension, which provides a convenient way to retrieve data tables from Wiki-like HTML pages. In this post, I will introduce you to the ‘PDF Table Extraction’ - another extension developed at RapidMiner Research, as part of the Data Search for Data Mining (DS4DM project, http://ds4dm.de) and released today. So let us see how this extension adds value to RapidMiner processes.

Read more...

By: Edwin Yaqub, Phd

 

In my last post, I introduced the ‘Web Table Extraction’ extension, which provides a convenient way to retrieve data tables from Wiki-like HTML pages. In this post, I will introduce you to the ‘PDF Table Extraction’ - another extension developed at RapidMiner Research, as part of the Data Search for Data Mining (DS4DM project, http://ds4dm.de) and released today. So let us see how this extension adds value to RapidMiner processes.

 

Problem: You may have already faced a situation where you wanted to use data tables from PDF documents. PDF has become a de-facto standard for read-only documents. It is certainly possible and sometimes unavoidable to extract data tables out of PDF using fine grained scraping techniques, but content parsing in this way is a meticulous activity. In the worst case, your efforts might not be reusable if tables in other documents use a different header structure. The problem is to raise the level of abstraction so data tables (having arbitrary header structure) can be extracted out of the PDF document in an easy way.

Solution: The ‘Read PDF Table’ operator solves this problem. It provides a generic solution to automatically detect and extract data tables from a PDF document as RapidMiner example sets. Simply provide it the path of your PDF file, or its URL address if the file resides on the web and execute the process. The output is a collection, as the operator tries to calibrate the detection of tables in the document. One of these example sets is highly likely to be the most accurate representation of your table. Let’s try some examples, with which I will share a few hints you might find useful when dealing with tables whose headers are complex.


Examples: The first example is rather simple. We use a document where tables have a clear single layer header, available here [1]. The operator accurately detects and extracts tables as seen below.

 

Read PDF Table operatorRead PDF Table operator

Read PDF Table ResultsRead PDF Table Results

 

In the second example, the document [2] contains a table with 3-layer header. The operator uses the first layer to construct example set attributes. We can imagine that the second row serves as a more descriptive table header. The ‘Rename by Example Values’ operator easily resolves this task.

 

Renaming!Renaming!

 

The Rename ProcessThe Rename Process

 

 

The Renamed ResultsThe Renamed Results

 

 

Now that we have the ability to extract data tables from a PDF document, let’s make use of some interesting statistics data from the European Commission (Eurostat). Eurostat offers many datasets [3] downloadable as PDF files. One such dataset, stored at [4] shows the percentage of individuals that obtain information from public authorities’ websites (per year between 2008-16). Governments use websites for educating the public on a variety of issues such as health awareness creation, political canvassing, travel warnings, development plans, etc. The question is, if in certain countries more attention (and how much) is being paid to this information? If this is found, spending could be optimized and different means can be used to expand audience in specific groups of countries. As we have no means to classify data, we turn to RapidMiner Clustering to discover groupings. Here we go:

 

Read PDF Table and Cluster DataRead PDF Table and Cluster Data

 

After reading the PDF document from this url [4], we realize that the example set has an arbitrary attribute at the second place, which shifts the rest of the attributes one step to the right. We can easily fix this by using the Data Editor view from Text processing extension to rename the attributes and delete the last redundant attribute. Owing to my programmer instincts, I wrote a short Groovy script that automates this and renames the first column. RapidMiner does not require you to do coding, but if you have small scripts that do big things, you can of course use the Execute operators.

 

Next, some pre-processing is performed. We remove the redundant attribute, trailing whitespaces, useless examples from top and bottom, clean alpha-numeric values to keep the numeric only, filter out examples with missing values, type the data, convert nominal to numeric and perform k-means clustering. Now we face the moment of truth - what value to set for k? As we are clueless, here is the good deal about RapidMiner: situations like these are ideal to leverage its Wisdom of the Crowds [5] – a guidance feature that suggests parameter values based on how community members used the same operator. Empowered with this knowledge, we quickly try k with 4 and 5, and it becomes clear that 5 provides the better inflection point in reducing the error rate, also considering the output of the Cluster Performance operator (for average in-cluster distance as well as Davies Bouldin index).

 

Although our dataset was relatively small, it was not easy to draw conclusions manually. Clustering allowed us to identify five groups of countries. The Centroid table view of the cluster model provides more details on attributes (Country, usage data for years 2008-16) in each cluster. A simpler way to interpret the clusters in this case can be to use the overall mean value of attributes (for 2008-16).

 

Results - Davies Bouldin IndexResults - Davies Bouldin Index

We find that individuals of cluster 2 (Croatia and Poland) obtained the least information from public authorities’ websites, while those of cluster 4 (Netherlands, Sweden and Norway) obtained the most.

Conclusion: In this post, the RapidMiner extension for PDF data table extraction was introduced. This can boost your productivity by expanding your reach to data tables inside PDF - the universal data format. Feel free to reuse the example process (attached), extend the dataset by joining more PDF data tables (from Eurostat or another source) that interest you, and hand over the complexity to RapidMiner clustering. Have fun discovering more insights!

 

References:

[1] https://bitbucket.org/ds4dm/repository-of-pdf-documents/raw/3fcdfcf2ff3b3f61b38bc2a93fb8354f7beb0d95...

[2] https://bitbucket.org/ds4dm/repository-of-pdf-documents/raw/b1631df4542b0a9a73fedb12a9477473ec8ee001...

[3] http://ec.europa.eu/eurostat/web/digital-economy-and-society/data/main-tables

[4] https://bitbucket.org/ds4dm/repository-of-pdf-documents/raw/b1631df4542b0a9a73fedb12a9477473ec8ee001...

[5] https://rapidminer.com/wisdom-crowds-guiding-light/

 

 

 

 

 

 

 

 

 

  • Extension
  • PDF
  • Table
RM Certified Expert
‎03-24-2017 09:07 AM
1429 Views
4 Comments

Data scientists are often confronted with a situation where data must be read from web pages. For instance, there are a lot of data tables available on Wikipedia, which can be utilized but the fine-grained data scraping approaches get complicated for ordinary users as they often require regular expressions based parsing and extraction of data from a web page’s content.

Read more...

By: Edwin Yaqub, Phd

 

Within the RapidMiner Research team, I’m developing extensions that target data enrichment and extraction as part of my work on the research project DS4DM (Data Search for Data Mining, http://ds4dm.de), so data mining processes would produce improved results. Today we have released the ‘Web Table Extraction’ extension on the Marketplace and here is an introduction to it.

 

Problem: Data scientists are often confronted with a situation where data must be read from web pages. For instance, there are a lot of data tables available on Wikipedia, which can be utilized but the fine-grained data scraping approaches get complicated for ordinary users as they often require regular expressions based parsing and extraction of data from a web page’s content.

 

Solution: To ease this task, the ‘Web Table Extraction’ extension offers a convenient alternative to extract data tables from Wiki-like websites and converts them to RapidMiner example sets.

 

You simply provide a url of the web page e.g. [1] to the ‘Read HTML Table’ operator and execute the process. Bingo! The operator extracted 9 data tables as example sets in the blink of an eye.

Read HTML Table ResultsRead HTML Table Results

 

Example: Now that we have an encyclopedia at our disposal, let us use a simple example. One of the tables on [1] gives the GDP (Gross Domestic Product) values for past years and projections for the future. GDP is a measure of a country’s economic activity. Another table on the same page gives us GDP per capita, which can be interpreted as the productivity of a country’s work force or their affluence. I’d like to see how these values are affected between 2015 and 2020. I’m also curious to see if affluence relates to obesity levels. For latter, we can use the BMI data at this web page [2].

 

Thanks to ‘Read HTML Table’ operator, we got the tables as example sets. Next, we apply inner join on GDP, GDP per capita and the BMI tables using the Country attribute. Here is the snapshot of the RapidMiner process for this (the process file is attached as well):

 

Extract HTML Table ProcessExtract HTML Table Process

 

 

We perform basic pre-processing to rename numeric attributes to be descriptive, we replace comma from attribute values before applying the Guess Types operator, which assigns integer and real data types to our attributes so we can process them. Finally, we filter out six attributes of interest.

 

A picture is worth a thousand words

The Results view of RapidMiner Studio provides an Advanced Charts module. This is excellent to visualize our dataset. We drag the attribute 2015_gdp on the domain dimension (the x-axis), the attributes 2015_per_capita and 2020_per_capita are dragged to a Numerical axis. These now appear on the left vertical axis. Next, we drag the 2020_gdp attribute as a new Numerical axis. This makes it appear on the right vertical axis. We use Country as the Color dimension and yes you guessed it, we use Obesity as the Size dimension – hence, the higher the obesity percentage, the bigger the legend.

 

This multi-series plot provides insights in a glance. The squares show how the GDP of countries compares between 2015 and 2020. The vertical lift between the triangles and the circles shows how the per capita income will increase from 2015 to 2020. Japan’s growth is highest among the industrialized nations. Assuming obesity levels stay same, we see that highly affluent nation like US has the highest obesity (33.7%) but again Japan provides a counter example (3.3%). We also see that lesser affluent nations can have high obesity. Based on these quick data-driven insights, we can now consider other attributes, perhaps related to culture, eating or work habits to understand the causes of obesity.

 

Obesity ChartObesity Chart

 

Conclusion

 

In this post, you learned how the new extension ‘Web Table Extraction’ can support in conveniently extracting data tables from Wiki-like pages. You also learned how the originally disparate data can now be unified in RapidMiner and displayed as a multi-series visualization using the Advanced Charts module. To try out yourself, go ahead and download the extension from the Marketplace and then try the attached process below. Have fun!

 

References

[1] https://en.wikipedia.org/wiki/BRIC

[2] https://en.wikipedia.org/wiki/List_of_countries_by_Body_Mass_Index_(BMI)

 

 

 

 

 

 

 

 

 

 

RM Certified Expert
‎03-29-2017 03:39 PM
1134 Views
7 Comments

Today Old World Computing is happy to announce the Advanced Reporting Extension for RapidMiner. With it's three operators, it looks tiny in comparison to some of the more bulky extensions out there, but it adds operators in a blind spot of RapidMiner and is designed to take away some worries from the common data scientist.

The idea is to use the capabilities of RapidMiner to automate any regular reporting task that results in an Excel sheet. There have been many projects and data science departments that simply drown in these kind of request, consuming all resource before you can get to the really fun part of data science. Now you can simply start at the beginning to create a nearly zero overhead reporting, even if you don't have or can't use real business intelligence tools like tableau or qlik.

Read more...

The idea of the Advanced Reporting Extension published by Old World Computing is to use the capabilities of RapidMiner to automate any regular reporting task that results in an Excel sheet. There have been many projects and data science departments that simply drown in these kind of request, consuming all resource before you can get to the really fun part of data science. Now you can simply start at the beginning to create a nearly zero overhead reporting, even if you don't have or can't use real business intelligence tools like tableau or qlik.

How does that work?

 

 

Step 1: Create a template in Excel

 First we create a dummy sheet and add all of the desired layout components, diagrams, texts and of course areas for data.

We can use any formatting, chart type or conditional coloring that we like, including the nice spark lines. Just one thing is important: We need to reserve space for inserting the data. What will happen later is, that we overwrite parts of the content of the table with data from RapidMiner. So if we have more than three employees, we would need either let more space between the table and the diagram, or just put the data into a separate sheet and reference this in the diagram. But if you are used to Excel reporting, you probably know all these tricks...

Insert some dummy values so that you can see the charts in action.

Don't forget to save the file. We will need it later.

 

 

 

Step 2: Create a process in RapidMiner to load the data

RapidMiner is very versatile to get the data into the shape you want. It can read and combine many different formats and sources and then aggregate, join, pivot and process the data into the shape that you need it.

On the right you see a process combining data from four different sources with multiple joins and preprocessing steps to match the data. Such a process could just deliver us the data we want to put into our nice Worktime sheet. 

Of course it could be much simpler and just contain a single SQL query or also be very much more complex involving calling of webservices, Big Data and analytics on hadoop, some machine learning or whatever. The trick is that we can leverage the entire flexibility of RapidMiner to get the data we want to put into an Excel sheet.

 

 

Step 3: Open Report

Once we have the data in the desired format, we add an Open Report (Excel) operator from our extension. You see it on the right hand side in the operator tree. We need to point the operator on two files: The template file we created and saved in Step 1. You can either use the parameter form template file or the tem input port. The second file can be specified as target file parameter or by using the tar output port.

Why are there ports for the files? Because it allows you to handle the files conveniently in scenarios where you want to do stuff with them in the process later. You could even create a template file in a RapidMiner process, or less fancy and more realistic: Store the file in the repository of a RapidMiner Server to share among many users. The output file port is most useful if you want to either zip the result or return it as a webservice result in a RapidMiner Server Webservice or Web Application.

Any data we want to insert into the Excel file, we need to forward to the input ports of the Open Report (Excel) operator. Don't worry, there will always be another input port if you connect the last one. We will use the data delivered to these ports in the inner subprocess to do the actual insertion.

 

 

Step 4: Insert Tabular Data

If we entered the inner process of the Open Report (Excel), we can add the Write Data Entry (Excel) operator to insert an ExampleSet into the excel. We have done so with the first ExampleSet on the screenshot on the right. The operator allows to select which attributes to use and where to place it. Therefore you specify the sheet where it will be insert by it's index. Then you point it to a fill range. A range can be either open ended by specifying the left upper cell of the area or closed, if followed by a colon and the right lower cell. So B2 would start in the second column, second row. B2Smiley Very Happy4 would allow to fill 2 rows and 2 columns.

For our little employee table from Step 1, we set it to B11:C13. Unless we select fit to range, the process will now fail if our data does not fit into this range.

We will add another operator of this type to output the second table.

 

 

Step 5: Insert Data

The only thing missing is the version tag, so that people know what this report was about, when they open them at some point later.

Therefore we first use a Generate Macro operator from RapidMiner's core functionality to create a process variable (or macro as they call it) containing the current date and time. We then add a Write Cell (Excel) operator from the Advanced Reporting Extension and connect the ports. Although there will be no data flowing from the Generate Macro operator to the Write Cell (Excel) operator, the connection makes sure that the Generate Macro will be executed first and set the process variable before it is read.

Then we just need to point the Write Cell (Excel) operator to the right fill position, which is F5 in our case. Setting the value and type correctly and we are good to go.

Short notice on dates: There is an unlimited number of different date formats out there. If you want to write a date to excel, you first need to parse the date format that the value has in RapidMiner. So if you enter something like 2017-03-29 23:59:59 as value, you should enter "yyyy-MM-dd HH:mm:ss" in the date format parameter of the Write Cell (Excel) operator. Once it knows the date, it will automatically transform it in the correct format of the Excel Template Sheet, where you set it with the Cell Format.

 

 

Once the subprocess is finished the target file will be written and you just need to mail it to someone else and be done with it.

We would like to recommend to just automate about everything right from the beginning. There will be nothing like a "I just need to do this once". In 90% of all cases, you will need to do it twice and then the additional overhead of the automation already would have paid off. So please feel free to download the extension, order a license and ask any questions you might have. In case you are not convinced, yet, the free version let's you access the full functionality and only limits the number of Write operators to one within each subprocess.

Download it here.

Old World Computing - Establishing the Future

Professional consulting for your Data Science problems

RM Certified Expert
‎03-28-2017 07:05 AM
372 Views
0 Comments

This Spring, join us for the new online training season. We are introducing new options for our Data Science courses RapidMiner Basics Part 1 and RapidMiner Basics Part 2. For the first time, you can also enhance your data science skills in Text and Web mining with RapidMiner online. 

Read more...

Editors note: As a former live and in person trainer, these online training courses are a great way to go IMHO!  See the classes and link to more details.

 

This Spring, join us for the new online training season. We are introducing new options for our Data Science courses RapidMiner Basics Part 1 and RapidMiner Basics Part 2. For the first time, you can also enhance your data science skills in Text and Web mining with RapidMiner online. 
 

Can’t attend the live sessions? We've got you covered! We provide access to the live session recordings for 60 days after the class has taken place, as well as access to the instructor via a message board during that period. So don’t let the time zone or your calendar stop you from joining.


Weekly Lecture

Each course is delivered in a four-week program that runs on Mondays. It entails 2 hours of online, Instructor Led Training and requires amd additional 2 hours of offline lab and self-study time each week. 
RM Basics Part 1 and Text and Web Mining with RM starting Apr 3rd
RM Basics Part 2 starting May 22nd


2 Day Classes

This is a 2-day program that runs on Mondays & Tuesdays or Wednesdays & Thursdays respectively, plus an optional Q/A session on Fridays of the same week. For each course you will attend two 4 hour sessions of live Instructor Led Training, and spend up to 4 hours of offline lab & self-study time following each live session.
RM Basics Part 1 May 15&16
RM Basics Part 2 May 17&18

The content covered in the weekly lectures and the 2 day classes is of course equivalent so you can mix and match both delivery options as needed.


Analyst Bootcamp

The Analyst Bootcamp is a value bundle for people attending both 2-day classes during one season. Sign up for this bundle at the same rate as the individual 2-day classes and receive a complimentary seat on our RapidMiner Analyst Certification worth $250.

 

For a more detailed schedule of ALL events, please visit our Training page

  • RapidMiner Basics
  • text mining
  • Training
RM Certified Expert
‎04-24-2017 09:02 AM
512 Views
0 Comments

A few weeks ago the RapidMiner Research Team published two new extensions to the Marketplace that are making a splash, the Operator Toolbox and Converters! We didn't stop there! Today I'm happy to announce the release of the version 0.2.0 for both extensions!  Here's a quick preview of the new enhancements you'll find!

Read more...

By: Fabian Temme PhD.

 

A few weeks ago the RapidMiner Research Team published two new extensions to the Marketplace that are making a splash, the Operator Toolbox and Converters! We didn't stop there! Today I'm happy to announce the release of the version 0.2.0 for both extensions!  

 

New in the Operator Toolbox!

 

Introducing the Get Decision Tree Path.

 

Do you want to know why a Decision Tree classify a specific Example in the way it does? With this Operator, you can find out how. It works rather similar as the ‘Apply Model’ Operator, it takes a trained Decision Tree and an Example Set at its Input Port. But instead of calculating the confidence of the examples, the Operator creates a new Attribute holding the Path the corresponding Example takes in the Decision Tree.

 

This example process applies the Operator on a Decision Tree trained on the Golf data sample.

Get Decision Tree Path ProcessGet Decision Tree Path Process

 Once the process executes, here are the results:

 

Get Decision Tree Path ResultGet Decision Tree Path Result

 

 Introducing the Generate Date Series Operator

 

 

Do you need some Date Time Series data, covering a specific time range and interval? Now you can ge that from the new Generate Date Series Operator. You can specify the start and end date and the interval from years down to milliseconds!

 

Check out the statistics overview for a daily date series for the year 2012, created with the new Operator:Get Date Series Operator ResultsGet Date Series Operator Results

 

Introducing the Get Parameters Operator

 

Sometimes you want to retrieve all Parameters of an Operator in your Process. For example, if you want to store all Parameters of the trained model inside an Optimize Operator, not only the ones which were optimized.

 

The new Get Parameters Operator enables you to do so. You specify the name of the Operator whose Parameters you want to extract and the Operator creates a new Parameter Set containing all Parameters of the specified Operator.

 

Here an example of the Parameters of a Decision Tree Operator:

 

Parameter Set Operator resultsParameter Set Operator results

 

The Parameter Set can now be stored in the Repository, written to file with the Write Parameters Operator or used by the Set Parameters Operator. You can even convert it into an Example Set using the new Parameter Set to ExampleSet Operator in the version 0.2.0 of the Converters Extension.

 

New in the Converters! 

 

Introducting the Parameter Set to ExampleSet Operator

 

This new Operator converts a given Parameter Set (coming from an Optimize Operator or for example from the Get Parameters Operator in the Operator Toolbox Extension) into an Example Set.

 

The resulting ExampleSet contain one Attribute for each Parameter in the Parameter Set. You can even let the Operator try to estimate the type of the Parameter (Integer, Real, Nominal).

 

This is the result of the converted Parameter Set of an Optimized Decision Tree:

 

Parameter to Exampleset OperatorParameter to Exampleset Operator

 

And finally...

 

Introducing the Normalization to ExampleSet Operator

 

Do you ever want to get a Normalization model as an ExampleSet? Now you can! This Operator takes the preprocessing model and creates an ExampleSet with the corresponding Attributes out of it.

 

Here the results for each of the four Normalization methods (Z-transformation, range transformation, proportion transformation and interquartile range, respectively) on the Golf dataset:

 

Normalization to Exampleset OperatorNormalization to Exampleset Operator

Thanks to everyone in the community for downloading our Operator Toolbox and Convertor Extensions. We welcome any comments or questions you have. Just post them in the community forums! If you have any ideas for new operators for the toolbox or converter, just visit our Product Ideas forum and post your wish list items there. 

 

 

 

 

 

 

  • Converters
  • Decision Tree
  • extensions
  • parameters
  • Toolbox
RM Certified Expert
‎03-23-2017 08:40 AM
255 Views
0 Comments

Just some interesting community and data science links I've come across this week. Enjoy!

Read more...

Just some interesting community and data science links I've come across this week. Enjoy!

 

From the Community

Interesting links from the Interwebz

  • links
  • Roundup
RM Certified Expert
‎03-17-2017 09:12 AM
388 Views
0 Comments

The General Online Research Conference is annually organized by the German Society for Online Research in cooperation with a local partner. In 2017 the GOR conference will take place in Berlin, Germany, with the HTW Hochschule für Technik und Wirtschaft Berlin/University of Applied Sciences being the local organizer.

Read more...

By: David Arnu, M. Sc 

 

On behalf of the RapidMiner Research Team, Edwin Yaqub, Phd and I went to GOR to present our work about Automated Mechanisms to Discover and Integrate Data from Web-based Tabular Collections, which is part of our results from the results of our DS4DM research project (http://ds4dm.de/en/). Our poster explains the concept of extending the value of your data by automatically finding and adding additional attributes (see attached PDF of the poster).

 

The conference is a great mix of researchers from very different fields. There is a bunch data scientist like us, that show the value of analytics and how to use big data techniques for applied social science. Besides that there are many people from market research and political science, who analyse for example how social media influenced the latest election and how to build better prediction models for next polls.

 

If you want to find out more about GOR, just check out #GOR17 on Twitter.

 

Photos!

The Conference in full swing!The Conference in full swing!

David showing off how RapidMiner works!David showing off how RapidMiner works!

Edwin photobombing us!Edwin photobombing us!

A proper photo of Edwin showing off RapidMiner1A proper photo of Edwin showing off RapidMiner1

 

RM Staff
‎03-14-2017 05:44 PM
287 Views
0 Comments

This is Ingo, the founder of RapidMiner.  Today we are here to look for a new team member for our Boston office.  Are you a data scientist with some experience in RapidMiner?  Then the following might be of interest to you!

Read more...

Hey,

 

This is Ingo, the founder of RapidMiner.  Today we are here to look for a new team member for our Boston office.  Are you a data scientist with some experience in RapidMiner?  Then the following might be of interest to you!

we-want-you.png

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We are looking for a presales engineer.  Sounds fancy, but what this means is that you are working as data scientist, learning about the analytical problems our users want to share, introduce our products to people, and even create proof of concepts.  If you mix this with a good amount of communication (including the problem understanding part!), then this is a pretty exciting role.  Well, it is what I did for many years myself so I might be a bit biased here Smiley Wink

 

The job requires a physical presence in Boston and you need to be eligible to work in the US.  If you like data science, know RapidMiner, and have fun working with great people - then you should consider this!

 

Here is more information and some guidance about how to apply: https://rapidminer.workable.com/jobs/440004

 

Looking forward to welcoming somebody of you in our team soon!

 

Cheers,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!
RM Certified Expert
‎03-10-2017 08:18 AM
749 Views
0 Comments

Last week I attended the Gartner Data Analytics (DA) summit in Grapevine Texas. It was quite an event, filled with great exhibition booths and presentations. I did booth duty, along with my colleagues, but managed to attend a few great presentations.

Read more...

Last week I attended the Gartner Data Analytics (DA) summit in Grapevine Texas. It was quite an event, filled with great exhibition booths and presentations. I did booth duty, along with my colleagues, but managed to attend a few great presentations.
 
For three days we staffed our RapidMiner booth and interacted with some fantastic people. Some came from the BI space and were curious about what RapidMiner did, was it BI or somethine else? Other's knew us and wanted to find out more about our Data Science platform.
 
I love these shows, not because I demo the product, but because of the people I have deep conversations with. Roughly 10% of the people I met had some hard implementation problem to overcome. They all understand that data science will solve their problems but getting their team up and running was hard.
 
One guy was a Data Scientist that moved to another company and into a managerial position. His task was to get a DS team up and running at the new firm. He knew the tools out there (i.e. Python, R, etc) but was looking for something he could use (like RapidMiner) to get his new team productive, and quickly.
 
Photos!
 

 

Ingo holding court!Ingo holding court!

 The RapidMiner Booth was hopping!The RapidMiner Booth was hopping!

 

 PerlMonky Wins!

  

On the last day of exhibitions, RapidMiner had a drawing to give away the orange spectacles. We created a RapidMiner process "on the fly" to select a random person on Twitter that tweeted with the hashtags #GartnerDA and #RapidMiner.  Twitter user PerlMonky won the spectacles and here's a post of him wearing them! 

 

And in case you want to see the process we used to build the random drawing? It's below and only three operators long. Now that's Lighting Fast Data Science!

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
        <parameter key="connection" value="ThomasOtt"/>
        <parameter key="query" value="#gartnerda #rapidminer"/>
        <parameter key="limit" value="1000"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34">
        <parameter key="invert_filter" value="true"/>
        <list key="filters_list">
          <parameter key="filters_entry_key" value="From-User.equals.Ingo Mierswa"/>
          <parameter key="filters_entry_key" value="From-User.equals.Thomas Ott"/>
          <parameter key="filters_entry_key" value="From-User.equals.Tom Wentworth"/>
          <parameter key="filters_entry_key" value="From-User.equals.RapidMiner"/>
        </list>
        <parameter key="filters_logic_and" value="false"/>
      </operator>
      <operator activated="true" class="sample" compatibility="7.4.000" expanded="true" height="82" name="Sample" width="90" x="313" y="34">
        <parameter key="sample_size" value="1"/>
        <list key="sample_size_per_class"/>
        <list key="sample_ratio_per_class"/>
        <list key="sample_probability_per_class"/>
        <parameter key="use_local_random_seed" value="true"/>
        <parameter key="local_random_seed" value="2000"/>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Sample" to_port="example set input"/>
      <connect from_op="Sample" from_port="example set output" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

 

 

 

 
  • GarterDA
RM Certified Expert
‎03-01-2017 07:20 AM
1154 Views
0 Comments

I recently reached a point in my daily work for the PRESED project (Predictive Sensor Data mining for Product Quality Improvement, www.presed.eu) within the funded R&D Research Team at RapidMiner, which probably sounds familiar for many data scientists.

Read more...

By: Fabian Temme, PhD

 

Editor's Note: Fabian shares some of his experiences with creating Model Management Applications in RapidMiner Server as part of his daily Data Science work.

 

I recently reached a point in my daily work for the PRESED project (Predictive Sensor Data mining for Product Quality Improvement, www.presed.eu) within the funded R&D Research Team at RapidMiner where I needed to put something in production. This probably sounds familiar for many data scientists after they found their insight.

 

It all started like the usual way. I did some complex preprocessing where I transformed my data into a nice table. Once I had hundreds to thousands of examples with their attributes and assigned labels I started training models, validating their performances, with the intent to bring them into production.

 

Here's the problem. Which model do I choose to put into production?

 

RapidMiner offers a great variety of different models, and also the possibility to combine them (for example by grouping, stacking or boosting). But I still had to answer the question, which one? 

 

I decided to test several models and needed an easy way to visualize and test the different models. I wanted to do a "model bake-off" and here's how I did it.

 

For this example we'll use the Sonar data sample provided in RapidMiner and start with a typical standard classification process: 

 

 

Classification Process.png

 

Here I retrieved the input data, traineda Random Forest inside a Cross Validation operator to extract a final model with an average performance from the cross validation. I then store the model and the performance vector inside a 'Results' folder in my repository (see below). I used a macro (process/global variabel) to define a name (in this case 'Random Forest') for the model and store the results in a subfolder with this name. 

 

For another model I could simply copy the process, exchange the algorithm, use the macro to automatically name the model, and then hit run. I do this over a few times for each algorithm I want to use.

 

This is how my 'Results' folder looked like.

Repository.png

 

 

But how easy is it to compare the different methods and do it automatically?

 

For that I designed a simple Web App on the RapidMiner Server I was working on. First I needed a process to loop over the 'Results' folder, automatically retrieves the performance vectors and transforms them to one ExampleSet. With the 'Publish to App' Operator I made the ExampleSet accessible by the new Web App.

 

Process Chain.png

 

 

Switching back to the App Designer I added two visualization components both subscribing to the ExampleSet published by my process.

The first component is using a 'Chart (HTML5)' format with the Chart-Type 'series' to show a graph of the results, the other the 'table' format to show the results directly.

A button which reruns the publishing process finalized the Web App.

 

This is the resulting view.

 

WebApp.png

 

 Done!  These processes can easily be adapted to test more algorithms and then visually display them with any performance vector results you want to see. 

 

PS: Check out the attached zip file for process examples!

 

 

  • Model Management
RM Certified Expert
‎02-27-2017 09:32 AM
1263 Views
3 Comments

As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?

Read more...

By: Martin Schmitz, PhD

 

As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?

 

In this case you can use the scripting interfaces, build a building block, or write your own extension. The extension might be the slowest way but it has the clear benefit of making your results easily usable for others. Recently, I've joined forces with the RapidMiner Research Team and we want to share our tools with you - the community. The result are two new extensions packed with new tools making your life easier.

 

I am happy to introduce the Operator Toolbox and the Converter Extensions!

 

Generate Levenshtein Distance
In text analytics you often challenge the problem of misspelled words. One of the most common ways to find misspelled words is to use a distance between the two words. The most frequently used distance measure is the Levenshtein Distance. The Levenshtein distance is defined as the minimum number of single-character edits to transform one string into another.


Levenshtein.png
This can be used to generate a replacement dictionary.

 

Generate Phonetic Encoding


During text processing you might encounter the problem that words are differently spelled but pronounced the same way. Often you want to map these words to the same string. A good example are names like like Jennie, Jenny and Jenni. Algorithms doing these kind of encodings are called phonetic encoders. Scott Genzer posted a building block on our Community Portal to generate the Daitch-Mokotoff Soundex encoding. Driven by this we created an operator which can use various algorithms to do this kind of encoding.

 

Phonetic - 1.png

 

A typical result in depicted above. The current version of the operator supports a broad range of possible algorithms namely: BeiderMorse, Caverphone2, Cologne Phonetic, Double Metaphone, Metaphone, NYSIIS, Refined Soundex, Soundex.

 

Tukey Test


When is a value an outlier? is one of the most frequently asked question in anomaly detection. No matter if you do univariate outlier detection on single attributes or use RapidMiner's Anomaly Detection extension to generate a multivariate score - you still need to define a threshold. A common technique to do this is the Tukey Test (or criterion). It results in a outlier flag as well as a confidence for each example. It can also be applied on several attributes at a time.

 

Tukey.png

 

Group Into Collection
This operator enables you to split an ExampleSet into various ExampleSets using a Group By. The result is a collection of ExampleSets. This can be used in combination with a Loop Collection to apply arbitrary functions with a group by statement. A possible example would be to find the last 3 transaction for each customer in transactional data.

 

 

 Loop Collection_cropped.png

 

  

 

Get Last Modifying Operator

If you dive a bit deeper into modelling you might want to try different feature selection techniques and treat it as a parameter of your modelling process. This can be achieved using a Select Subprocess in a Optimize Parameters Operator. In order to add figure out which Feature Selection technique has won you would need to add at least one additional operator per method. To overcome this it is possible to extract the last modifying Operator for every object. This way you can easier annotate which feature selection technique was the best.

 

Extracting PCA, Association Rules, ROC

 

The Converter extension let's you do a lot of things that our users have asked for. Want to extract those Association Rules? You can do that now.  What to extract PCA results into a exampleset table? You can do that now.  Just check out the extension on the Marketplace to see all the neat things you can do now.

  • extensions
RM Certified Expert
‎02-20-2017 08:31 AM
1399 Views
0 Comments

SparkRM is a new Radoop operator - but not just any new operator to be added to the 70+ collection that the Radoop extension includes - it’s an operator that opens a wealth of new use cases for exploiting and analyzing Hadoop data with RapidMiner.

Read more...

By: Jesus Puente

 

SparkRM is a new Radoop operator - but not just any new operator to be added to the 70+ collection that the Radoop extension includes - it’s an operator that opens a wealth of new use cases for exploiting and analyzing Hadoop data with RapidMiner.

 

SparkRM is a meta-operator, which means that you can double-click on it and a new canvas is open where you can design a new process (similar to what you would find in the “Split Validation”, for instance). What’s special in SparkRM is that, even though it is a Radoop operator, the inner process has to be designed using non-Radoop, regular RapidMiner operators. And, whatever operator or sub-process one places inside SparkRM, they will be packaged and pushed to Hadoop for execution in a parallel way.

 

le. Let’s imagine you have a lot of text data in your Hadoop environment and you want to analyze it using RapidMiner’s Text Processing Extension. Well, now you can. You can read them and feed them into the SparkRM operator.

The data will be passed onto the non-Radoop sub-process inside. You can process, tokenize, create word lists, find expressions, n-grams, etc. and everything within the Hadoop cluster.

 

A typical process would look like:

 SparkRM ProcessSparkRM Process

 

And this is what you would have inside the SparkRM operator:

 

SparkRM Text Mining ExtensionSparkRM Text Mining Extension

 

Some typical parameters of SparkRM include the file format (textfile or parquet) and the partitioning mode.

 

SparkRM ParametersSparkRM Parameters

 

Once the task is finished, the result is returned as usual through the output ports. The first output port is for data sets, and it can be merged. If the data coming from the different partitions is consistent (same metadata), the operator simply appends everything together. If not, then there is an option to “resolve schema conflicts” and add the necessary missing values so that the full dataset contains all the information from all the partitions. This is especially useful when analyzing text, because the word-list of a certain text will not probably be the same as that of another.

I have described an example for text processing, but you can imagine any other extension or algorithm that’s not in Radoop: Series Forecasting, Deep Learning, Neural Networks, Process Mining, etc.

 

 

 

  • radoop
  • SparkRM
Maven
‎09-06-2016 11:34 AM
417 Views
0 Comments

IMG_8990.JPG

 

Getting those data scientists young. 

Read more...

IMG_8990.JPG

 

... but they only stay together for the sake of the kids, or so the old joke goes. 

 

https://www.linkedin.com/pulse/goats-monogamous-steve-farr?trk=hp-feed-article-title-like

 

 

RM Staff
‎07-28-2016 03:03 PM
589 Views
0 Comments

As most of you are already aware, RapidMiner is a kick-ass platform offering pretty much everything you need for doing data science in a very efficient way.  But what you don’t know is that …

 

RapidMiner Studio just got even more awesome!

 

Wait… is this even possible?  Well, it was no easy task – but we have done it: Introducing RapidMiner Studio 7.2. Let’s take a look at some of the new features.

Read more...

New Machine Learning Algorithms

We’ve added 4 new algorithms for machine learning, and I am still having a hard time figuring out which one I like the most:

  • Gradient Boosted Trees
  • Deep Learning
  • Generalized Linear Models
  • A brand-new implementation of Logistic Regression

Picture1


Naturally, I gave them a test run on some data sets, and was pretty freakin’ impressed with the prediction accuracy, automatic tuning capabilities, and runtimes.  On the well-known Sonar data set, for example, I consistently achieved performance results of 78% to 80% without any parameter tuning. This is a nice bump over other algorithms which only get up to 70% to 75% after heavy optimization circles.

 

This lift in performance can in part be attributed to the fact that these algorithms tune themselves. They are designed to find the best parameter settings for optimizing prediction accuracy.  This not only delivers better accuracy; but also reduces some of the effort required for tuning these bad boys.

 

You can find more on the RapidMiner blog at https://rapidminer.com/gradient-boosted-trees-deep-learning-less-5-minutes-bet/

 

Cheers,

Ingo


How to load processes in XML from the forum into RapidMiner: Read this!
RM Staff
‎07-28-2016 05:49 AM
760 Views
1 Comment

For my recent blog post i needed to filter out all attributes having at least one value above a threshold. Traditionally i did this with Transpose, Filter Examples, Transpose again.

I realized that there is a way nicer way which i would like to share with you.

Read more...

For my recent blog post i needed to filter out all attributes having at least one value above a threshold. Traditionally i did this with Transpose, Filter Examples, Transpose again.

I realized that there is a way nicer way which i would like to share with you.

If you have a look at Select Attributes you can choose attribute filter type "numeric_value_filter". This can use used like this

 

For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example

Which is nice, but not exactly what i wanted to have. To filter not on any all values of the attributes but on the overall minimum we can use an Aggregate operator to get the minimum for all attributes. On this we can use the Select Attributes with the numeric _value_filter_option. After removing the minimum(...) with a Rename by Replacing Operator we have the schema we wanted.

The trick is now to use Data to Weights to get a weight vector of all present attributes. Applying this weight vector with Select by Weights yields to the desired result.

The complete process looks like this

 

Replace.png

 

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
Maven
‎07-27-2016 07:22 AM
513 Views
0 Comments

We have moved some things - to help you!

 

rapidminer_logoC5_RGB_v1.png

Read more...

Hi all

 

Well we are 6 weeks in to the new community website and I have to say a very big thank you to the thousands of you that have joined (notice that counter now standing at over 150,000!) and all of you that have contributed in some way. You may have noticed that you have been awarded badges, or that your Rank in the Community has changed. That's called 'gamification'  and its there to make it more fun and to give all our contributors recognition for their contributions. 

 

So, what about these changes?

We took a good look at the first 6 weeks of data, what you searched for, what you accessed and where you posted, and we felt that the structure of the menus was too complicated.  So here's what we have done:

 

  1. All the Product Help forums are now in one place  - on the Product Help menu
  2. Knowledgebases - more structured articles about the products and data science in general -  are in Learning
  3. Networking is unchanged
  4. Ideas get their own area and menu
  5. The Community menu is gone, as a large number of posts about product issues were cropping up here. Instead, there is now a direct link from the home page to all things about the administration of the Community. 

New Menus.PNGWe hope that these changes are helpful to you. Please give us feedback here

 

 Steve Farr

Community Director