If you happen to work at WeWork Tysons in VA or in the general Washington DC area, come down and say hi! I'm going to be speaking at the Spark DC Meetup group this coming Tuesday (5/23) at 6PM. You can get more Meetup details here.
People often think a given model can just be put into deployment forever. In fact, the opposite is true. You need to maintain your models like you maintain a machine. Machine Learning models can get off or broken overtime. This sounds odd to you because they have no moving pieces? Well, you might want to have a close look on change and drifts of concept.
Change of Concept
Let’s start off with an example. If you try to build a predictive maintenance model for an air plane, you often create columns like
as an input for your model. But what happens if error number 5 is not error number 5 anymore? Software updates can drastically change the data you have. They fix known issues but also encode your data in a different way. If you take the post-update data as an input for your pre-update model — it will do something, but not what you expected. This phenomenon is called change of concept.
Drift of Concept
A very similar phenomenon is drift of concept. This happens if change is not drastic but emerging slowly. An industrial example is encrustment of a sensor. This happens over time and a measured 100 degrees are not 100 degrees anymore. An example in customer analytics are adoption processes of new technology. People did not use iPhones at once, but slowly adopted to it. A column like “HasAnIphone” would mean a very tech-savvy person — in 2007. Today this indicates an average person.
What Can I Do?
A common approach to overcome concept drifting is window based relearning. ....... Read more on my medium.com page
Welcome to another edition of the Data Science Link Roundup! In today's post we talk GPS Coordinates, JSON, and much more!
From the Community
Out on the Interwebz
As always, take it to the limit!
RapidMiner's Educational group just released 3 awesome videos on Model Selection, Optimizing Models, and Auto Model Selction and Optimization! These are a must watch and should get your mind racing with possibilities.
By: Edwin Yaqub, PhD
At RapidMiner Research, we are addressing problems that are becoming increasingly pertinent to businesses. As part of the German research project DS4DM (http://ds4dm.de), we now released the ‘Data Search for Data Mining’ extension, which provides data enrichment capabilities in RapidMiner.
Data analysts are increasingly confronted with the situation that data which they need for a data mining project exists somewhere on the web or in an organization’s intranet but they are not able to find it. On the web, data is generally searched from search engines using keywords or text. This is an example of unstructured search. In cases where structured data exists, e.g. in the form of a table, structured and contextualized search is possible. The objective is to enrich an existing table with additional data by harnessing diverse sources of data in an efficient manner. In the literature, this topic is often referred to as Entity Augmentation or Search-Join [1,2]. Search-Joins are useful within a wide range of application scenarios. For example, given a dataset containing attributes like the name, GDP and the region of a country, we would like to enrich the dataset by:
The ‘Data Search’ extension implements both of these capabilities and thus brings the Search-Join data enrichment method to RapidMiner.
Besides the subject matter, this post also shows that Java developers can reuse RapidMiner libraries to customize visualizations, add GUI panels and controls in their extensions to suit their needs.
Data enrichment through the Search-Join method
The Backend: For the search function, the extension uses a Search-Join data server at the backed. This is developed by our project partner, the University of Mannheim (Data and Web Science group). The backend comprises a corpus of heterogeneous data tables, which are indexed and stored after extracting from data sources. The current implementation uses subset of Wikipedia as a source but more sources will be added in future. The extension (frontend) interfaces with the backend through a web-service, which uses algorithms to discover candidate tables. The discovery is based on schema (column level) and instance (row level) matches between the provided query and the tabular corpus.
The Frontend: The extension is composed of three operators. The Data Search, Translate and Fuse operator which work together in an operator chain as seen in Fig. 1.
Data Search operator: This operator queries the web-service for relevant tables by submitting an entity query. The entity query comprises of an existing dataset; one attribute of this dataset is recognized as the subject identifier (primary identifier of a row) and a keyword for the additional attribute to be discovered. The server returns a collection of relevant tables. The schema level and instance level matches are also made available at the output ports.
If you select the checkbox ‘apply manual refinements’ in the operator parameter panel, the process execution is halted in real time and you are taken to a Control Panel graphical view. Here, you see the discovered data tables matching your query as shown in Fig. 2. The customized tree view lists candidate tables, which can contribute values for your new attribute. The red legend indicates that the table (shown as a named node in the tree panel) has an attribute (columnar) match to your original table. Similarly, the blue legend indicates a match at the instance (row) level and both legends together indicate both matches, which is the ideal case.
The panel shows distribution of two statistics over the collection to give high level view at a glance:
Noise Removal: It is a fact that data search is susceptible to noise. If the analyst deems certain discovered table to be noisy, it is necessary to delete it before the process execution is resumed.
A noisy table can be removed by selecting it in the tree, right click mouse and then selecting the Delete menu item. This changes the data model of the operator and therefore, these changes need to be committed in-memory by clicking the ‘Commit Updates’ button before resuming the process execution. If you accidentally delete a node, the original collection can be restored through the ‘Restore Original’ button at any time. These controls are shown in Fig. 3. Notice that the examples sets at the output ports of the operator i.e. schema and instance match tables are updated accordingly. The idea is that only refined output reaches the next (Translate) operator in chain.
Care must be taken when deleting tables to prevent the loss of potentially valuable tables. To assist the data analyst in this exploratory task, two visualizations are provided.
Interactive Document Map: RapidMiner provides a Self-Organizing Map (SOM) visualization which can be used to expose patterns in data. We reuse and customize it to tag the dot (points shown on the map) with text showing key properties of the table i.e. its full name, the count of schema and instance matches. The map also provides a drill-down mechanism in that each dot is implemented as a hyperlink. If clicked, it opens the associated table in the tree-tabular view. This eases localization and filtering.
The document map helps to understand how the candidate space of discovered tables shows up in a landscape like layout. For example, tables with higher schema or instance matches might be (but not necessarily) stronger candidates. You may not want to delete these table, while others may not be so interesting. The map can also reveal neighbourhoods based on (dis)similarities among the tables based on table properties, which are fed internally to the underlying neural network. Fig. 4 shows a document map for the results of a sample query.
3D Labelled Scatter Plot: While the interactive map provides a landscape view of the search space, the 3D scatter plot shows the tables as points along x-y-z axes. The points are labelled with the table name. This visualization is intended to see how/if the tables clutter along individual axis and if a Pareto frontier exists. If so, the Pareto-efficient tables are stronger trade-off candidates which you may want to keep. Fig. 5 shows such a plot for the results of a sample query.
The outputs of the Data Search operator are passed on to the Translate operator. This is where data integration or the Join step in Search-Join starts. Translate processes the candidate tables using the schema and instance matches. As a result, a new collection of tables in the image of your original dataset is created. This collection of 'translated' tables is composed from only those candidate tables, each of which have at least one cell value to contribute to your new attribute. Here again, the 'apply manual refinements' checkbox can be selected to filter out unwanted tables from reaching to the Fuse operator. The interested readers are referred to  for conceptual details.
The last operator in the Search-Join process is the Fuse operator. Fuse takes the outputs of the Translate operator as input. It then selects a particular cell value for the new attribute from the collection of translated tables. The decision which value to choose from which table is made by a fusion policy, which uses criteria provided by the user in operator parameter panel. At this stage, we provide a default fusion policy. Finally, chosen cell values are fused to the corresponding instance (row) of your original dataset and an enriched dataset with the new attribute is produced. This concludes the data integration (Join) step. Fig. 6 shows the enriched dataset(s) where a new attribute ‘language’ and ‘currency’ has been added to the original dataset.
In this blog post, you learned about the ‘Data Search for Data Mining’ extension, which can be used to enrich an existing dataset with relevant new attributes. The GUI features shown here reused RapidMiner source to achieve necessary customizations. If you perform similar customizations, just ensure that RapidMiner security guidelines  are respected. The project DS4DM  is under active development and new features being developed at the backend and the front end will be rolled out in subsequent releases. I will stop here and urge you to go ahead, install the extension from the marketplace and simply execute the sample process (attached and below) for a first hand experience.
The Data Search extension is developed as part of Data Search for Data Mining (DS4DM project, http://ds4dm.de) sponsored by the German ministry of education and research (BMBF).
 Bizer, Christian et al. entitled, 'The Mannheim Search Join Engine', published in Web Semantics: Science, Services and Agents on the World Wide Web. Vol.35, Part 3, Dec. 2015.
 Bizer, Christian, Tom Heath, and Tim Berners-Lee. Linked data-the story so far, published in Semantic services, interoperability and web applications: emerging concepts (2009): 205-227.
 Bizer, Christian, Schema Mapping and Data Translation, lecture notes, weblink: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/Lehre/WebDataIntegration/HWS2015/WDI0...
 RapidMiner documentation on Security and Restrictions, weblink: http://docs.rapidminer.com/developers/security
 Data Search for Data Mining (DS4DM) project, weblink: http://ds4dm.de
As promised in the first annoucment of the Write for Us program, here is Phase II. In this phase we want to start giving people cold hard cash for writing and sharing interesting Building Blocks and Knowledge Base articles. I'll update the Community News Section with all the particulars, but below is the jist of it all.
If you have an interesting Building Block that you'd like to share, then send over a paragraph and a sample process with it to me. We'll review it and if accepted, go ahead and write up a detailed explanation of what the Building Block does, along with the Building Block itself and a sample process. We'll pay you $20 (USD) per Building Block once we received your writeup, reviewed it, and posted it.
Knowledge Base Articles
Knowledge Base (KB) articles are also a great way to show off how to do something with RapidMiner Studio, Server, or Radoop. The same process applies to this as with the Building Blocks above. If you have an idea for a great Knowledge Base Article, send me a short paragraph with your idea. If accepted, then go ahead and start writing.
A few things to do when writing a KB article, make sure to:
For a short KB article we'll pay you $25 (USD) and for a long KB article we'll pay $50 (USD). What's the difference between a short and long KB article? That's easy.
A short KB article is on how do a simple thing. Like pass a JSON file to RapidMiner Server or how to save a R or Python model in RapidMiner. A long KB article would include a full use case and the application of RapidMiner thru the entire development cycle. It could include the mashup of scripts too (Groovy, Python, R, etc). Of course you should provide sample process and data (anonymized if required).
Disclaimers apply, of course. Partners and Employees are not allowed to take part in the program, and we reserve to right to change anything at anytime.
He also warned that longer lifespans and better artificial intelligence were likely to lead to both aging labor forces and fewer jobs. “Machines should only do what humans cannot,” he said. “Only in this way can we have the opportunities to keep machines as working partners with humans, rather than as replacements.”
We're getting ready unveil and new program at the RapidMiner Community where our members can take part in the growth and influence this place has. We'll be rolling out in phases a new "Write for Us" program where Community members can submit guest blog articles, knowledge base articles, and building blocks for cash and swag.
The first phase of the Write for Us program is the Guest Blogging. There isn't any cash payout to write a guest blog post for us but you do get a link back to your blog/site and lots of Community kudos. Guest blogging is open to anyone who's using RapidMiner to do some really cool data science stuff. This includes tips and tricks or even some neat Groovy Script, Python, or R hacks with RapidMiner. Think of anything cool you do with RapidMiner and share it!
The second phase of the Write for Us program is where you can earn cold hard cash and swag. There is a wealth of knowledge stored in our Community members heads. We get a glimpse of it when you all post in the forums and come up with novel solutions. Why not take what you've worked hard to solve an earn some $$$ with it? If you created a Building Block that does something neat and cool? Submit it and get $$$. Have a great idea of a Knowledge Base article? Submit it and get $$$! We'll come up with an extra swag contest for biggest contributor to the Community too.
Of course there will be terms and conditions to both of these phases so check out the Community News section as we roll out this program.
By Jesus Puente, PhD.
Let’s start from the beginning: what is a data core?
The data core is the component that manages the data inside any RapidMiner process. When you “ingest” data into a process from any data source (database, Excel file, Twitter, etc.) it is always converted into what we call an ExampleSet. No matter which format it had before, inside RapidMiner data always has a tabular form with Attributes being columns and Examples being rows. Because anything can be added to an ExampleSet, from integers to text or documents, the way this table is internally handled is very important and it has a lot of impact in how much data one can process and how fast. Well, that is exactly what the Data Core does: it keeps the data in the memory taking types and characteristics into account and making sure memory is effectively used.
Yes, but, how does it affect me?
Well, the more efficient the Data Core is managing memory, the larger ExampleSets you can use in your processes. And, as an additional consequence, some processes can get much faster buy improving access to elements of the ExampleSet.
Can you give an example?
Sure! There are different use cases, one of them is sparse data. By that, we mean data which is mostly zeros and only a few meaningful numbers here and there. Let’s imagine you run a market basket analysis in a supermarket chain. You have lots of customer receipts on one hand and lots of products and brands in your shelves on the other. If you want to represent that in a table, you end up with a matrix of mostly zeros. The reason is that most people only buy a few products from you, so most buyer-product combinations have a zero in the table. That doesn’t mean that your table is useless, on the contrary! It contains all the information you need.
Another example is text processing. Sometimes you end up with a table whose columns (i.e. Attributes) are the words that appear in the texts and the rows (i.e. Examples) are the sentences. Obviously, each sentence only contains a few words so, again, most word-sentence combinations have a zero in their cells.
Well, RapidMiner’s new Data Core automatically detects sparse data and greatly decreases the memory footprint of those tables. A much more compressed internal representation is used and the ExampleSets become easier to handle and processes are speeded up.
Another use case is related to categorical (nominal) data in general. Even in the “dense” (non-sparse) case, data sizes within cells in a table can vary a lot. Integers are small in terms of memory use, while text can be much bigger. The new DataCore is also optimizing the representation of this kind of data, allowing for very heterogeneous ExampleSets without unnecessarily wasting memory.
Tell me more!
As often in life, in some cases, there is a tradeoff between speed and memory usage. Operators like Read CSV and Materialize now have an option to be speed-optimized, auto or memory-optimized. These options allow the user to choose between a faster, but potentially more memory intensive data management, or a more compact but probably slower representation. Auto, of course, decides automatically based on the properties of the data. This is the default and recommended option.
The representation of data within the new data core is based on columns (Attributes) instead of rows (Examples). This improves performance especially whenever the data transformation is based on columns, which is the most common case in data science. Some examples are:
In many processes, it’s necessary to generate new data from the existing columns. The Generate Data operator does that. Also, loops and optimization operators often create temporary attributes. The new data core also provides a nice optimization of these use cases by handling the new data in a new, much more performant way.
Loop attributes, then values
In many data preparation processes, attributes are changed, re-calculated or used in various ways. The columnar representation is ideal for this.
We have already mentioned text use cases. It’s not pointless to mention that Text Processing and other extensions already benefit from the new core. Moreover, we have published the data core API for any of the extension developers in our community to adapt their existing extensions or create new ones that use the improved mechanism.
Time for some numbers, how good is the improvement?
As it should have become clear in the past paragraphs, the degree of improvement is quite dependent on the use case. Some processes will benefit a lot and others not so much.
As a benchmarking case, we have chosen a web clickstream use case. We start with a table that contains user web activity. Each row is composed of a user ID, a ‘click’ (a URL) and a timestamp. One of the typical transformations that one would like to do is to move from an event-based table to a user-based table. Just as an example, we’ll transform the data to get a table with all users and the maximum duration of their sessions. This is a process that needs a lot of data shuffling, looping on values and, even for a relatively small data set, it can take a lot of time.
Let’s start with some small amount of data: 10,000 examples. I ran the process on my 8-core laptop with 32 GB of RAM. These are the results (runtimes in seconds) by threads used for parallelization.
With a single core (what’s available in the Free license), the new Data Core already provides 2x performance. As more cores are used, the times get smaller and smaller. See numbers below: with the old core using 1 thread, the job took more than 2 minutes to complete and, now with the parallelization and the new data core, it only takes 10 seconds!
In this case, the new data core helped improving performance. However, the data core is all about memory and we’ll see that in the next example. Let’s run the same process, but with a 5 times larger data set (50,000 rows). Take a look at the numbers:
This time runtimes are in minutes. As you can see, the new data core pattern is similar to that in the previous example. It’s more data, so it takes more time, but times are reasonable. However, with the old data core, times simple blow up. And here’s the reason:
Very soon, my 32GB of main memory are fully used and everything gets extremely slow. The same process with the new data core looks like this:
It never goes beyond 65%. Therefore, the new data core allows you to work with data set sizes which were unmanageable before given a certain memory size.
RapidMiner’s new data core is a big thing. It improves data and memory management and it allows you to work with much bigger data sets keeping your memory demand at bay.
It’s already available as a beta. Try it NOW!
Greetings Community! Here's a quick interesting link roundup for your Data Science needs!
From the Community
Interesting Links from the Interwebz
By: Edwin Yaqub, Phd
In my last post, I introduced the ‘Web Table Extraction’ extension, which provides a convenient way to retrieve data tables from Wiki-like HTML pages. In this post, I will introduce you to the ‘PDF Table Extraction’ - another extension developed at RapidMiner Research, as part of the Data Search for Data Mining (DS4DM project, http://ds4dm.de) and released today. So let us see how this extension adds value to RapidMiner processes.
Problem: You may have already faced a situation where you wanted to use data tables from PDF documents. PDF has become a de-facto standard for read-only documents. It is certainly possible and sometimes unavoidable to extract data tables out of PDF using fine grained scraping techniques, but content parsing in this way is a meticulous activity. In the worst case, your efforts might not be reusable if tables in other documents use a different header structure. The problem is to raise the level of abstraction so data tables (having arbitrary header structure) can be extracted out of the PDF document in an easy way.
Solution: The ‘Read PDF Table’ operator solves this problem. It provides a generic solution to automatically detect and extract data tables from a PDF document as RapidMiner example sets. Simply provide it the path of your PDF file, or its URL address if the file resides on the web and execute the process. The output is a collection, as the operator tries to calibrate the detection of tables in the document. One of these example sets is highly likely to be the most accurate representation of your table. Let’s try some examples, with which I will share a few hints you might find useful when dealing with tables whose headers are complex.
Examples: The first example is rather simple. We use a document where tables have a clear single layer header, available here . The operator accurately detects and extracts tables as seen below.
In the second example, the document  contains a table with 3-layer header. The operator uses the first layer to construct example set attributes. We can imagine that the second row serves as a more descriptive table header. The ‘Rename by Example Values’ operator easily resolves this task.
Now that we have the ability to extract data tables from a PDF document, let’s make use of some interesting statistics data from the European Commission (Eurostat). Eurostat offers many datasets  downloadable as PDF files. One such dataset, stored at  shows the percentage of individuals that obtain information from public authorities’ websites (per year between 2008-16). Governments use websites for educating the public on a variety of issues such as health awareness creation, political canvassing, travel warnings, development plans, etc. The question is, if in certain countries more attention (and how much) is being paid to this information? If this is found, spending could be optimized and different means can be used to expand audience in specific groups of countries. As we have no means to classify data, we turn to RapidMiner Clustering to discover groupings. Here we go:
After reading the PDF document from this url , we realize that the example set has an arbitrary attribute at the second place, which shifts the rest of the attributes one step to the right. We can easily fix this by using the Data Editor view from Text processing extension to rename the attributes and delete the last redundant attribute. Owing to my programmer instincts, I wrote a short Groovy script that automates this and renames the first column. RapidMiner does not require you to do coding, but if you have small scripts that do big things, you can of course use the Execute operators.
Next, some pre-processing is performed. We remove the redundant attribute, trailing whitespaces, useless examples from top and bottom, clean alpha-numeric values to keep the numeric only, filter out examples with missing values, type the data, convert nominal to numeric and perform k-means clustering. Now we face the moment of truth - what value to set for k? As we are clueless, here is the good deal about RapidMiner: situations like these are ideal to leverage its Wisdom of the Crowds  – a guidance feature that suggests parameter values based on how community members used the same operator. Empowered with this knowledge, we quickly try k with 4 and 5, and it becomes clear that 5 provides the better inflection point in reducing the error rate, also considering the output of the Cluster Performance operator (for average in-cluster distance as well as Davies Bouldin index).
Although our dataset was relatively small, it was not easy to draw conclusions manually. Clustering allowed us to identify five groups of countries. The Centroid table view of the cluster model provides more details on attributes (Country, usage data for years 2008-16) in each cluster. A simpler way to interpret the clusters in this case can be to use the overall mean value of attributes (for 2008-16).
We find that individuals of cluster 2 (Croatia and Poland) obtained the least information from public authorities’ websites, while those of cluster 4 (Netherlands, Sweden and Norway) obtained the most.
Conclusion: In this post, the RapidMiner extension for PDF data table extraction was introduced. This can boost your productivity by expanding your reach to data tables inside PDF - the universal data format. Feel free to reuse the example process (attached), extend the dataset by joining more PDF data tables (from Eurostat or another source) that interest you, and hand over the complexity to RapidMiner clustering. Have fun discovering more insights!
By: Edwin Yaqub, Phd
Within the RapidMiner Research team, I’m developing extensions that target data enrichment and extraction as part of my work on the research project DS4DM (Data Search for Data Mining, http://ds4dm.de), so data mining processes would produce improved results. Today we have released the ‘Web Table Extraction’ extension on the Marketplace and here is an introduction to it.
Problem: Data scientists are often confronted with a situation where data must be read from web pages. For instance, there are a lot of data tables available on Wikipedia, which can be utilized but the fine-grained data scraping approaches get complicated for ordinary users as they often require regular expressions based parsing and extraction of data from a web page’s content.
Solution: To ease this task, the ‘Web Table Extraction’ extension offers a convenient alternative to extract data tables from Wiki-like websites and converts them to RapidMiner example sets.
You simply provide a url of the web page e.g.  to the ‘Read HTML Table’ operator and execute the process. Bingo! The operator extracted 9 data tables as example sets in the blink of an eye.
Example: Now that we have an encyclopedia at our disposal, let us use a simple example. One of the tables on  gives the GDP (Gross Domestic Product) values for past years and projections for the future. GDP is a measure of a country’s economic activity. Another table on the same page gives us GDP per capita, which can be interpreted as the productivity of a country’s work force or their affluence. I’d like to see how these values are affected between 2015 and 2020. I’m also curious to see if affluence relates to obesity levels. For latter, we can use the BMI data at this web page .
Thanks to ‘Read HTML Table’ operator, we got the tables as example sets. Next, we apply inner join on GDP, GDP per capita and the BMI tables using the Country attribute. Here is the snapshot of the RapidMiner process for this (the process file is attached as well):
We perform basic pre-processing to rename numeric attributes to be descriptive, we replace comma from attribute values before applying the Guess Types operator, which assigns integer and real data types to our attributes so we can process them. Finally, we filter out six attributes of interest.
A picture is worth a thousand words
The Results view of RapidMiner Studio provides an Advanced Charts module. This is excellent to visualize our dataset. We drag the attribute 2015_gdp on the domain dimension (the x-axis), the attributes 2015_per_capita and 2020_per_capita are dragged to a Numerical axis. These now appear on the left vertical axis. Next, we drag the 2020_gdp attribute as a new Numerical axis. This makes it appear on the right vertical axis. We use Country as the Color dimension and yes you guessed it, we use Obesity as the Size dimension – hence, the higher the obesity percentage, the bigger the legend.
This multi-series plot provides insights in a glance. The squares show how the GDP of countries compares between 2015 and 2020. The vertical lift between the triangles and the circles shows how the per capita income will increase from 2015 to 2020. Japan’s growth is highest among the industrialized nations. Assuming obesity levels stay same, we see that highly affluent nation like US has the highest obesity (33.7%) but again Japan provides a counter example (3.3%). We also see that lesser affluent nations can have high obesity. Based on these quick data-driven insights, we can now consider other attributes, perhaps related to culture, eating or work habits to understand the causes of obesity.
In this post, you learned how the new extension ‘Web Table Extraction’ can support in conveniently extracting data tables from Wiki-like pages. You also learned how the originally disparate data can now be unified in RapidMiner and displayed as a multi-series visualization using the Advanced Charts module. To try out yourself, go ahead and download the extension from the Marketplace and then try the attached process below. Have fun!
The idea of the Advanced Reporting Extension published by Old World Computing is to use the capabilities of RapidMiner to automate any regular reporting task that results in an Excel sheet. There have been many projects and data science departments that simply drown in these kind of request, consuming all resource before you can get to the really fun part of data science. Now you can simply start at the beginning to create a nearly zero overhead reporting, even if you don't have or can't use real business intelligence tools like tableau or qlik.
How does that work?
First we create a dummy sheet and add all of the desired layout components, diagrams, texts and of course areas for data.
We can use any formatting, chart type or conditional coloring that we like, including the nice spark lines. Just one thing is important: We need to reserve space for inserting the data. What will happen later is, that we overwrite parts of the content of the table with data from RapidMiner. So if we have more than three employees, we would need either let more space between the table and the diagram, or just put the data into a separate sheet and reference this in the diagram. But if you are used to Excel reporting, you probably know all these tricks...
Insert some dummy values so that you can see the charts in action.
Don't forget to save the file. We will need it later.
RapidMiner is very versatile to get the data into the shape you want. It can read and combine many different formats and sources and then aggregate, join, pivot and process the data into the shape that you need it.
On the right you see a process combining data from four different sources with multiple joins and preprocessing steps to match the data. Such a process could just deliver us the data we want to put into our nice Worktime sheet.
Of course it could be much simpler and just contain a single SQL query or also be very much more complex involving calling of webservices, Big Data and analytics on hadoop, some machine learning or whatever. The trick is that we can leverage the entire flexibility of RapidMiner to get the data we want to put into an Excel sheet.
Once we have the data in the desired format, we add an Open Report (Excel) operator from our extension. You see it on the right hand side in the operator tree. We need to point the operator on two files: The template file we created and saved in Step 1. You can either use the parameter form template file or the tem input port. The second file can be specified as target file parameter or by using the tar output port.
Why are there ports for the files? Because it allows you to handle the files conveniently in scenarios where you want to do stuff with them in the process later. You could even create a template file in a RapidMiner process, or less fancy and more realistic: Store the file in the repository of a RapidMiner Server to share among many users. The output file port is most useful if you want to either zip the result or return it as a webservice result in a RapidMiner Server Webservice or Web Application.
Any data we want to insert into the Excel file, we need to forward to the input ports of the Open Report (Excel) operator. Don't worry, there will always be another input port if you connect the last one. We will use the data delivered to these ports in the inner subprocess to do the actual insertion.
If we entered the inner process of the Open Report (Excel), we can add the Write Data Entry (Excel) operator to insert an ExampleSet into the excel. We have done so with the first ExampleSet on the screenshot on the right. The operator allows to select which attributes to use and where to place it. Therefore you specify the sheet where it will be insert by it's index. Then you point it to a fill range. A range can be either open ended by specifying the left upper cell of the area or closed, if followed by a colon and the right lower cell. So B2 would start in the second column, second row. B24 would allow to fill 2 rows and 2 columns.
For our little employee table from Step 1, we set it to B11:C13. Unless we select fit to range, the process will now fail if our data does not fit into this range.
We will add another operator of this type to output the second table.
The only thing missing is the version tag, so that people know what this report was about, when they open them at some point later.
Therefore we first use a Generate Macro operator from RapidMiner's core functionality to create a process variable (or macro as they call it) containing the current date and time. We then add a Write Cell (Excel) operator from the Advanced Reporting Extension and connect the ports. Although there will be no data flowing from the Generate Macro operator to the Write Cell (Excel) operator, the connection makes sure that the Generate Macro will be executed first and set the process variable before it is read.
Then we just need to point the Write Cell (Excel) operator to the right fill position, which is F5 in our case. Setting the value and type correctly and we are good to go.
Short notice on dates: There is an unlimited number of different date formats out there. If you want to write a date to excel, you first need to parse the date format that the value has in RapidMiner. So if you enter something like 2017-03-29 23:59:59 as value, you should enter "yyyy-MM-dd HH:mm:ss" in the date format parameter of the Write Cell (Excel) operator. Once it knows the date, it will automatically transform it in the correct format of the Excel Template Sheet, where you set it with the Cell Format.
Once the subprocess is finished the target file will be written and you just need to mail it to someone else and be done with it.
We would like to recommend to just automate about everything right from the beginning. There will be nothing like a "I just need to do this once". In 90% of all cases, you will need to do it twice and then the additional overhead of the automation already would have paid off. So please feel free to download the extension, order a license and ask any questions you might have. In case you are not convinced, yet, the free version let's you access the full functionality and only limits the number of Write operators to one within each subprocess.
Professional consulting for your Data Science problems
Editors note: As a former live and in person trainer, these online training courses are a great way to go IMHO! See the classes and link to more details.
This Spring, join us for the new online training season. We are introducing new options for our Data Science courses RapidMiner Basics Part 1 and RapidMiner Basics Part 2. For the first time, you can also enhance your data science skills in Text and Web mining with RapidMiner online.
Can’t attend the live sessions? We've got you covered! We provide access to the live session recordings for 60 days after the class has taken place, as well as access to the instructor via a message board during that period. So don’t let the time zone or your calendar stop you from joining.
Each course is delivered in a four-week program that runs on Mondays. It entails 2 hours of online, Instructor Led Training and requires amd additional 2 hours of offline lab and self-study time each week.
RM Basics Part 1 and Text and Web Mining with RM starting Apr 3rd
RM Basics Part 2 starting May 22nd
This is a 2-day program that runs on Mondays & Tuesdays or Wednesdays & Thursdays respectively, plus an optional Q/A session on Fridays of the same week. For each course you will attend two 4 hour sessions of live Instructor Led Training, and spend up to 4 hours of offline lab & self-study time following each live session.
RM Basics Part 1 May 15&16
RM Basics Part 2 May 17&18
The content covered in the weekly lectures and the 2 day classes is of course equivalent so you can mix and match both delivery options as needed.
The Analyst Bootcamp is a value bundle for people attending both 2-day classes during one season. Sign up for this bundle at the same rate as the individual 2-day classes and receive a complimentary seat on our RapidMiner Analyst Certification worth $250.
For a more detailed schedule of ALL events, please visit our Training page.
By: Fabian Temme PhD.
A few weeks ago the RapidMiner Research Team published two new extensions to the Marketplace that are making a splash, the Operator Toolbox and Converters! We didn't stop there! Today I'm happy to announce the release of the version 0.2.0 for both extensions!
New in the Operator Toolbox!
Introducing the Get Decision Tree Path.
Do you want to know why a Decision Tree classify a specific Example in the way it does? With this Operator, you can find out how. It works rather similar as the ‘Apply Model’ Operator, it takes a trained Decision Tree and an Example Set at its Input Port. But instead of calculating the confidence of the examples, the Operator creates a new Attribute holding the Path the corresponding Example takes in the Decision Tree.
This example process applies the Operator on a Decision Tree trained on the Golf data sample.
Once the process executes, here are the results:
Introducing the Generate Date Series Operator
Do you need some Date Time Series data, covering a specific time range and interval? Now you can ge that from the new Generate Date Series Operator. You can specify the start and end date and the interval from years down to milliseconds!
Check out the statistics overview for a daily date series for the year 2012, created with the new Operator:
Introducing the Get Parameters Operator
Sometimes you want to retrieve all Parameters of an Operator in your Process. For example, if you want to store all Parameters of the trained model inside an Optimize Operator, not only the ones which were optimized.
The new Get Parameters Operator enables you to do so. You specify the name of the Operator whose Parameters you want to extract and the Operator creates a new Parameter Set containing all Parameters of the specified Operator.
Here an example of the Parameters of a Decision Tree Operator:
The Parameter Set can now be stored in the Repository, written to file with the Write Parameters Operator or used by the Set Parameters Operator. You can even convert it into an Example Set using the new Parameter Set to ExampleSet Operator in the version 0.2.0 of the Converters Extension.
New in the Converters!
Introducting the Parameter Set to ExampleSet Operator
This new Operator converts a given Parameter Set (coming from an Optimize Operator or for example from the Get Parameters Operator in the Operator Toolbox Extension) into an Example Set.
The resulting ExampleSet contain one Attribute for each Parameter in the Parameter Set. You can even let the Operator try to estimate the type of the Parameter (Integer, Real, Nominal).
This is the result of the converted Parameter Set of an Optimized Decision Tree:
Introducing the Normalization to ExampleSet Operator
Do you ever want to get a Normalization model as an ExampleSet? Now you can! This Operator takes the preprocessing model and creates an ExampleSet with the corresponding Attributes out of it.
Here the results for each of the four Normalization methods (Z-transformation, range transformation, proportion transformation and interquartile range, respectively) on the Golf dataset:
Thanks to everyone in the community for downloading our Operator Toolbox and Convertor Extensions. We welcome any comments or questions you have. Just post them in the community forums! If you have any ideas for new operators for the toolbox or converter, just visit our Product Ideas forum and post your wish list items there.
Just some interesting community and data science links I've come across this week. Enjoy!
From the Community
Interesting links from the Interwebz
By: David Arnu, M. Sc
On behalf of the RapidMiner Research Team, Edwin Yaqub, Phd and I went to GOR to present our work about Automated Mechanisms to Discover and Integrate Data from Web-based Tabular Collections, which is part of our results from the results of our DS4DM research project (http://ds4dm.de/en/). Our poster explains the concept of extending the value of your data by automatically finding and adding additional attributes (see attached PDF of the poster).
The conference is a great mix of researchers from very different fields. There is a bunch data scientist like us, that show the value of analytics and how to use big data techniques for applied social science. Besides that there are many people from market research and political science, who analyse for example how social media influenced the latest election and how to build better prediction models for next polls.
If you want to find out more about GOR, just check out #GOR17 on Twitter.
This is Ingo, the founder of RapidMiner. Today we are here to look for a new team member for our Boston office. Are you a data scientist with some experience in RapidMiner? Then the following might be of interest to you!
We are looking for a presales engineer. Sounds fancy, but what this means is that you are working as data scientist, learning about the analytical problems our users want to share, introduce our products to people, and even create proof of concepts. If you mix this with a good amount of communication (including the problem understanding part!), then this is a pretty exciting role. Well, it is what I did for many years myself so I might be a bit biased here
The job requires a physical presence in Boston and you need to be eligible to work in the US. If you like data science, know RapidMiner, and have fun working with great people - then you should consider this!
Here is more information and some guidance about how to apply: https://rapidminer.workable.com/jobs/440004
Looking forward to welcoming somebody of you in our team soon!
On the last day of exhibitions, RapidMiner had a drawing to give away the orange spectacles. We created a RapidMiner process "on the fly" to select a random person on Twitter that tweeted with the hashtags #GartnerDA and #RapidMiner. Twitter user PerlMonky won the spectacles and here's a post of him wearing them!
And in case you want to see the process we used to build the random drawing? It's below and only three operators long. Now that's Lighting Fast Data Science!
<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000"> <context> <input/> <output/> <macros/> </context> <operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process"> <process expanded="true"> <operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34"> <parameter key="connection" value="ThomasOtt"/> <parameter key="query" value="#gartnerda #rapidminer"/> <parameter key="limit" value="1000"/> </operator> <operator activated="true" class="filter_examples" compatibility="7.4.000" expanded="true" height="103" name="Filter Examples" width="90" x="179" y="34"> <parameter key="invert_filter" value="true"/> <list key="filters_list"> <parameter key="filters_entry_key" value="From-User.equals.Ingo Mierswa"/> <parameter key="filters_entry_key" value="From-User.equals.Thomas Ott"/> <parameter key="filters_entry_key" value="From-User.equals.Tom Wentworth"/> <parameter key="filters_entry_key" value="From-User.equals.RapidMiner"/> </list> <parameter key="filters_logic_and" value="false"/> </operator> <operator activated="true" class="sample" compatibility="7.4.000" expanded="true" height="82" name="Sample" width="90" x="313" y="34"> <parameter key="sample_size" value="1"/> <list key="sample_size_per_class"/> <list key="sample_ratio_per_class"/> <list key="sample_probability_per_class"/> <parameter key="use_local_random_seed" value="true"/> <parameter key="local_random_seed" value="2000"/> </operator> <connect from_op="Search Twitter" from_port="output" to_op="Filter Examples" to_port="example set input"/> <connect from_op="Filter Examples" from_port="example set output" to_op="Sample" to_port="example set input"/> <connect from_op="Sample" from_port="example set output" to_port="result 1"/> <portSpacing port="source_input 1" spacing="0"/> <portSpacing port="sink_result 1" spacing="0"/> <portSpacing port="sink_result 2" spacing="0"/> </process> </operator> </process>
By: Fabian Temme, PhD
Editor's Note: Fabian shares some of his experiences with creating Model Management Applications in RapidMiner Server as part of his daily Data Science work.
I recently reached a point in my daily work for the PRESED project (Predictive Sensor Data mining for Product Quality Improvement, www.presed.eu) within the funded R&D Research Team at RapidMiner where I needed to put something in production. This probably sounds familiar for many data scientists after they found their insight.
It all started like the usual way. I did some complex preprocessing where I transformed my data into a nice table. Once I had hundreds to thousands of examples with their attributes and assigned labels I started training models, validating their performances, with the intent to bring them into production.
Here's the problem. Which model do I choose to put into production?
RapidMiner offers a great variety of different models, and also the possibility to combine them (for example by grouping, stacking or boosting). But I still had to answer the question, which one?
I decided to test several models and needed an easy way to visualize and test the different models. I wanted to do a "model bake-off" and here's how I did it.
For this example we'll use the Sonar data sample provided in RapidMiner and start with a typical standard classification process:
Here I retrieved the input data, traineda Random Forest inside a Cross Validation operator to extract a final model with an average performance from the cross validation. I then store the model and the performance vector inside a 'Results' folder in my repository (see below). I used a macro (process/global variabel) to define a name (in this case 'Random Forest') for the model and store the results in a subfolder with this name.
For another model I could simply copy the process, exchange the algorithm, use the macro to automatically name the model, and then hit run. I do this over a few times for each algorithm I want to use.
This is how my 'Results' folder looked like.
But how easy is it to compare the different methods and do it automatically?
For that I designed a simple Web App on the RapidMiner Server I was working on. First I needed a process to loop over the 'Results' folder, automatically retrieves the performance vectors and transforms them to one ExampleSet. With the 'Publish to App' Operator I made the ExampleSet accessible by the new Web App.
Switching back to the App Designer I added two visualization components both subscribing to the ExampleSet published by my process.
The first component is using a 'Chart (HTML5)' format with the Chart-Type 'series' to show a graph of the results, the other the 'table' format to show the results directly.
A button which reruns the publishing process finalized the Web App.
This is the resulting view.
Done! These processes can easily be adapted to test more algorithms and then visually display them with any performance vector results you want to see.
PS: Check out the attached zip file for process examples!
By: Martin Schmitz, PhD
As RapidMiner users we are used to one operator solutions. Want to add a PCA? Add the operator. Want to do an ensemble? Add the operator. Over time the RapidMiner ecosystem evolved in a way that most tasks are easy to handle like this. However, doing data science every day, I experienced a few things where RapidMiner has no one operator solution. How do we solve that?
In this case you can use the scripting interfaces, build a building block, or write your own extension. The extension might be the slowest way but it has the clear benefit of making your results easily usable for others. Recently, I've joined forces with the RapidMiner Research Team and we want to share our tools with you - the community. The result are two new extensions packed with new tools making your life easier.
Generate Levenshtein Distance
In text analytics you often challenge the problem of misspelled words. One of the most common ways to find misspelled words is to use a distance between the two words. The most frequently used distance measure is the Levenshtein Distance. The Levenshtein distance is defined as the minimum number of single-character edits to transform one string into another.
This can be used to generate a replacement dictionary.
Generate Phonetic Encoding
During text processing you might encounter the problem that words are differently spelled but pronounced the same way. Often you want to map these words to the same string. A good example are names like like Jennie, Jenny and Jenni. Algorithms doing these kind of encodings are called phonetic encoders. Scott Genzer posted a building block on our Community Portal to generate the Daitch-Mokotoff Soundex encoding. Driven by this we created an operator which can use various algorithms to do this kind of encoding.
A typical result in depicted above. The current version of the operator supports a broad range of possible algorithms namely: BeiderMorse, Caverphone2, Cologne Phonetic, Double Metaphone, Metaphone, NYSIIS, Refined Soundex, Soundex.
When is a value an outlier? is one of the most frequently asked question in anomaly detection. No matter if you do univariate outlier detection on single attributes or use RapidMiner's Anomaly Detection extension to generate a multivariate score - you still need to define a threshold. A common technique to do this is the Tukey Test (or criterion). It results in a outlier flag as well as a confidence for each example. It can also be applied on several attributes at a time.
Group Into Collection
This operator enables you to split an ExampleSet into various ExampleSets using a Group By. The result is a collection of ExampleSets. This can be used in combination with a Loop Collection to apply arbitrary functions with a group by statement. A possible example would be to find the last 3 transaction for each customer in transactional data.
Get Last Modifying Operator
If you dive a bit deeper into modelling you might want to try different feature selection techniques and treat it as a parameter of your modelling process. This can be achieved using a Select Subprocess in a Optimize Parameters Operator. In order to add figure out which Feature Selection technique has won you would need to add at least one additional operator per method. To overcome this it is possible to extract the last modifying Operator for every object. This way you can easier annotate which feature selection technique was the best.
Extracting PCA, Association Rules, ROC
The Converter extension let's you do a lot of things that our users have asked for. Want to extract those Association Rules? You can do that now. What to extract PCA results into a exampleset table? You can do that now. Just check out the extension on the Marketplace to see all the neat things you can do now.
By: Jesus Puente
SparkRM is a new Radoop operator - but not just any new operator to be added to the 70+ collection that the Radoop extension includes - it’s an operator that opens a wealth of new use cases for exploiting and analyzing Hadoop data with RapidMiner.
SparkRM is a meta-operator, which means that you can double-click on it and a new canvas is open where you can design a new process (similar to what you would find in the “Split Validation”, for instance). What’s special in SparkRM is that, even though it is a Radoop operator, the inner process has to be designed using non-Radoop, regular RapidMiner operators. And, whatever operator or sub-process one places inside SparkRM, they will be packaged and pushed to Hadoop for execution in a parallel way.
le. Let’s imagine you have a lot of text data in your Hadoop environment and you want to analyze it using RapidMiner’s Text Processing Extension. Well, now you can. You can read them and feed them into the SparkRM operator.
The data will be passed onto the non-Radoop sub-process inside. You can process, tokenize, create word lists, find expressions, n-grams, etc. and everything within the Hadoop cluster.
A typical process would look like:
And this is what you would have inside the SparkRM operator:
Some typical parameters of SparkRM include the file format (textfile or parquet) and the partitioning mode.
Once the task is finished, the result is returned as usual through the output ports. The first output port is for data sets, and it can be merged. If the data coming from the different partitions is consistent (same metadata), the operator simply appends everything together. If not, then there is an option to “resolve schema conflicts” and add the necessary missing values so that the full dataset contains all the information from all the partitions. This is especially useful when analyzing text, because the word-list of a certain text will not probably be the same as that of another.
I have described an example for text processing, but you can imagine any other extension or algorithm that’s not in Radoop: Series Forecasting, Deep Learning, Neural Networks, Process Mining, etc.
... but they only stay together for the sake of the kids, or so the old joke goes.
We’ve added 4 new algorithms for machine learning, and I am still having a hard time figuring out which one I like the most:
Naturally, I gave them a test run on some data sets, and was pretty freakin’ impressed with the prediction accuracy, automatic tuning capabilities, and runtimes. On the well-known Sonar data set, for example, I consistently achieved performance results of 78% to 80% without any parameter tuning. This is a nice bump over other algorithms which only get up to 70% to 75% after heavy optimization circles.
This lift in performance can in part be attributed to the fact that these algorithms tune themselves. They are designed to find the best parameter settings for optimizing prediction accuracy. This not only delivers better accuracy; but also reduces some of the effort required for tuning these bad boys.
You can find more on the RapidMiner blog at https://rapidminer.com/gradient-boosted-trees-deep-learning-less-5-minutes-bet/
For my recent blog post i needed to filter out all attributes having at least one value above a threshold. Traditionally i did this with Transpose, Filter Examples, Transpose again.
I realized that there is a way nicer way which i would like to share with you.
If you have a look at Select Attributes you can choose attribute filter type "numeric_value_filter". This can use used like this
For example the numeric condition '> 6' will keep all nominal attributes and all numeric attributes having a value of greater than 6 in every example
Which is nice, but not exactly what i wanted to have. To filter not on any all values of the attributes but on the overall minimum we can use an Aggregate operator to get the minimum for all attributes. On this we can use the Select Attributes with the numeric _value_filter_option. After removing the minimum(...) with a Rename by Replacing Operator we have the schema we wanted.
The trick is now to use Data to Weights to get a weight vector of all present attributes. Applying this weight vector with Select by Weights yields to the desired result.
The complete process looks like this
Well we are 6 weeks in to the new community website and I have to say a very big thank you to the thousands of you that have joined (notice that counter now standing at over 150,000!) and all of you that have contributed in some way. You may have noticed that you have been awarded badges, or that your Rank in the Community has changed. That's called 'gamification' and its there to make it more fun and to give all our contributors recognition for their contributions.
So, what about these changes?
We took a good look at the first 6 weeks of data, what you searched for, what you accessed and where you posted, and we felt that the structure of the menus was too complicated. So here's what we have done:
We hope that these changes are helpful to you. Please give us feedback here.