RapidMiner

84 Views
0 Comments

 

You might have read our previous blogs (here for a general motivation in our company blog and here for a more technical description) about the new architecture in the upcoming RapidMiner Server 8.0 release. One might think this is just an internal technical improvement, but not quite, this change has deep consequences in the many ways you can use RapidMiner Server. That’s why we’ve moved the needle, said goodbye to the 7, and happy to welcome 8.0!

Read more...

New ways: scale out

 

The most obvious way in which RapidMiner environments will change with this new release is the option to scale out. If your computational needs exceed that of a single machine, now you can deploy multiple RapidMiner Job Agents across multiple machines or VMs and leverage all your resources. Now your environment can scale both vertically and horizontally.

Job Agents are connected to queues and are constantly polling and asking for something to do. This way, RapidMiner can work in a grid-like fashion, sending jobs to free resources that can work on them.

 

Adding some structure: the new queues

 

That grid-like architecture would be a basic configuration with all the available nodes connected to the same queue. But that’s not the only option. In RapidMiner 8.0, queues have acquired a new meaning.

 

 

queues.png

 

 

 

 

Each Job Agent can pick up jobs from only one queue, but multiple Job Agents can connect to each queue. With this ‘one queue to many agents’ relationship one can effectively configure sub-clusters that can serve different purposes. This is a great tool for administrators to achieve good resource management.

 

For example, different teams can have their own sub-clusters, but they can also share a common one. Or, within a group, there might be a standard queue and a high-priority one where only certain users or applications are allowed to send jobs.

 

Another option is to split the cluster depending on the needs of the user processes. Typically, one would send big training processes to a large machine with enough memory (“training queue”), while lightweight scoring processes go to another sub-cluster with less memory, but maybe more CPUs to take advantage of parallelization (the “scoring queue”).

 

One could also have Job Agents specialized in certain extensions with particular needs, like Keras (Deep Learning), which has specific installation pre-requisites.

 

Reliability and Fault tolerance

 

By the way, this is all about having local or remote dedicated resources for processes, which gives us an interesting and powerful feature: now that everything runs independently and no process going amok will affect others, we get a highly reliable and robust system.

Another interesting side effect is an increased fault tolerance, especially in the execution pieces (the Job Agents). They are set to be fault tolerant by default as soon as more than one Job Agent is connected to each queue. If, for any reason, any one of them fail, another Job Agent will continue picking up jobs from the same queue and users will not be affected. Only the job that fails will be lost.

ft.png 

Future outlook

RapidMiner Server 8.0 is just a first step. We still have a lot in store for future releases, like full high availability, a centralized configuration and improved UI. Stay tuned!

93 Views
0 Comments

Soon, RapidMiner will start the beta program for its 8.0 version and the final public version will be out a few weeks later. Get ready!

Read more...

Why a major version?

 

RapidMiner Server 8.0 comes with a brand-new architecture, delivering some exciting changes: full scalability, more useful queues, improved resource management, a nicer UI and a lot of changes that will open many new use cases. This represents a big leap in what RapidMiner Server can do for you.

 

So, what does the new architecture look like?

 

architecture.png

 

Each blue box represents a separate machine. The big box to the left represents the central RapidMiner Server, which provides the web UI and receives all the user requests. All this is done by the RapidMiner Server central node:

  • Scheduling of user jobs (processes)
  • User, queue and permissions management.
  • Execution of processes scheduled to the local queue (this queue is optional).
  • Execution of processes scheduled through web services, web apps or triggers.

 

The Job Agents (in dark blue) are the new kids on the block. They can be deployed in remote machines or locally in the central node and have to be configured to point to one queue each. The Job Agents check their queue and pick up any pending jobs, spawn a Job Container and execute the job (the user process designed in Studio). This means increased scalability and resource sharing and management among users or projects.

 

Although scalability is the main new feature, it's still possible to run RapidMiner Server in a single machine with one (or more) local Job Agents executing the jobs. Even if you run everything in a single machine, the new architecture will provide better fault tolerance and improved reliability.

 

More about each component. What should I install?

 

There are two components:

 

  • RapidMiner Server

You will be able to download it from our website. The installation process is equivalent to what you know from the older Server versions. During the installation, you will be able to select whether you want to have a local Job Agent or not. And, if so, what resources will be dedicated to it (memory and CPUs).

 

If you already have a Server running, you can migrate it to the new version. There will be two migration options:

 

  1. You can migrate all your queues for an equivalent, but scaled-out environment. In that case, you’ll need to deploy and configure the Job Agents manually.
  2. You can select the single-queue option and the installer will “collapse” all your queues into one, so you can keep working in a single machine as usual.

Potentially, you could have several local queues and Job Agents, but each will take up its share of memory. That kind of configuration could be good if you have a big machine that you want to split in a logical way to share its resources among user groups or applications.

 

  • RapidMiner Job Agent

You can download the Job Agent from our downloads page. It is a zip package that you need to decompress wherever you want it to run. You need to edit its configuration file to point it to the right Server and queue, but alternatively, every time you create a new queue in the Server’s UI, a new link will appear to download the configuration and you can directly copy and paste in the Job Agent’s folder.

 

How does it work?

When a user schedules a process from Studio or from the Server's UI, the process is placed into the corresponding queue. Any of the Job Agents connected to that queue can pick up the work and run the process. The RapidMiner Server (and the user, through the UI or Studio) gets notified and logs become available.

 

The process is fully executed in the Job Agent. It connects to the repository, external data sources or whatever is needed for the process independently. There is no data flow from the Server to the Job Agents.

 

Queues and Scheduling

Differently from what happened in previous versions, queues are now linked to Job Agents. Queues have user permissions and sending a process/job to a queue determines which Job Agents will work on it or how many resource will be available. Many processes can be run in parallel if there are enough free resources, but a single process is always run by a single Job Agent.

If no free resources are available when a process is scheduled, it waits in the queue until it's picked up by an available Job Agent.

 

 

queues.png

 

 

What doesn't change

Only processes launched or scheduled from Studio or from the Server's GUI are executed in the Job Agents. Jobs requested through Web Services, Web apps or triggers are not affected by the architecture change and they will continue to run in the central RapidMiner Server.

 

In summary

In a nutshell, these are the most noticeable differences from the old version:

 

  • Multiple Job Agents can be installed in multiple machines for process execution. You can scale your environment as much as you need.
  • Queues have now a clear role in resource management. Each Job Agent is configured to only one queue (but each queue can be connected to multiple Job Agents). Job Agents are configured to use certain resources (memory and processes). Those resources become available for jobs scheduled in the corresponding queue. Therefore, queues are a means to share and limit the system's resources among users.
  • The new queues allow you to have dedicated machines or resources for groups of users or specialize in different use cases: training, scoring, text analytics, etc.
  • Better resource control also provides a more orderly environment: jobs will run on any Job Agent with free resources connected to the queue. If there are no free resources, jobs are queued.
  • Logs stay inside (possibly remote) Job Agents. They can be retrieved from the central RapidMiner Server as long as the Job Agent is running.
  • First steps taken towards a fresh new UI design of Studio. Improved process list with filters and the possibility to stop running processes. There is a new UI for creating queues.
  • Extensions have to be manually deployed on every Job Agent. Each Job Agent may have a different list of extensions, so it’s possible to create dedicated Job Agents for a particular use case.
  • Executions are run on separate JVMs and even separate machines. All processes are fully independent and the whole system becomes much more robust and tolerant to problems in individual processes. This makes the whole system much more fault tolerant.

 

Future outlook

RapidMiner 8.0 is a big step for scalability and management, but we are just getting started! Take a look at this other post in our company blog. There are more architectural issues that we want to address, like moving the web-services executions to the Job Agents, improving latency and performance, going for a fully highly available environment, and much more. Stay tuned!

 

225 Views
3 Comments

Dear Community,

 

This is the second of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

In this post I want to give you a short overview over the features already provided in the alpha version 0.1.2.

 

Time Series Extension Samples Folder

 

After downloading the extension from the marketplace it adds a new folder, called Time Series Extension Samples Folder to your repository panel. It contains some time series data sets and some process templates to play around with.

Read more...

Dear Community,

 

This is the second of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

In this post I want to give you a short overview over the features already provided in the alpha version 0.1.2.

 

Figure 1: Image of the Time Series Extension Samples folder in the RapidMiner Repository panel after the installation of the Time Series Extension.Figure 1: Image of the Time Series Extension Samples folder in the RapidMiner Repository panel after the installation of the Time Series Extension.

Time Series Extension Samples Folder

 

After downloading the extension from the marketplace it adds a new folder, called Time Series Extension Samples Folder to your repository panel. It contains some time series data sets and some process templates to play around with.

 

I will also use these data sets and variations of the template processes to demonstrate the features of the Time Series Extension in this blog post.
The processes shown in this post are also attached to the post, so you can try them out for yourself if you want.

 

Moving Average Filter

 

The first Operator I want to show is the Moving Average Filter. To demonstrate its purpose, I want to analyse the 'Lake Huron' data set. It describes the surface levels of Lake Huron (Wikipedia) in the years 1875 - 1972.

 

When you load the data from the Samples folder (red line in figure 2) you can see that the surface level shows some variation at different scales. There are some time windows with high and with low surface levels. But there are also small variations where noisy data can be seen.

 

To smooth this data a bit we can use the Moving Average Filter Operator. The Moving Average Filter calculates the filtered values as the weighted sum of values around the corresponding value. The weights depend on the type of the filter. Currently three different types are supported: "SIMPLE, "BINOM, and "SPENCERS_15_POINTS". 

For "SIMPLE" weighting the weights are all equal. This filter is also known as a rolling mean, rolling average, or similar terms. The result is shown as the blue line in figure 2.

 

Figure 2: Result view of the Lake Huron data set. The original data (red line) and the result of a SIMPLE Moving Average Filter (blue line) are shown.Figure 2: Result view of the Lake Huron data set. The original data (red line) and the result of a SIMPLE Moving Average Filter (blue line) are shown.

The smoothing effect is cleary visible, but also less evident features such as the removal of the large spike in the year 1929 from the filtered data. The "BINOM" filter type could improve the filtering.In the case of this filter type the weights follows the expansion of the binomial expression (1/2 + 1/2s)^(2q). For example, for q = 2 the weights are [1/16, 4/16, 6/16, 4/16, 1/16].

For a large filter size the weights approximate to a normal (Gaussian) curve. This filter type is capable of smoothing the data, but preserving more features in the data. The result is shown in figure 3.

 

Figure 3: Result view of the Lake Huron data set. The original data (red line) and the result of a BINOM Moving Average Filter (blue line) are shown.Figure 3: Result view of the Lake Huron data set. The original data (red line) and the result of a BINOM Moving Average Filter (blue line) are shown.

The third filter type (SPENCERS_15_POINTS) is a special filter and not applicable for this use case.

 

ARIMA

 

In many use cases we not only want to analyse historic data, but also want to forecast future values. Therefore we can use an ARIMA model (Wikipedia) to predict the next values of a time series which is described by this model.

For example we can use the ARIMA Trainer Operator to fit an ARIMA model to the time series values of the Lake Huron data set. For now we use the default parameters of the ARIMA Trainer Operator: p = 1 autoregressive terms and q = 1 moving average terms.

Figure 4 shows the RapidMiner process (including the above described Moving Average Filter Operators).

 

 

Figure 4: RapidMiner process to analyse the Lake Huron data set. The two Moving Average Filter Operators are included as well as the fitting of the ARIMA model and the forecasting of the next 10 years of the data set.Figure 4: RapidMiner process to analyse the Lake Huron data set. The two Moving Average Filter Operators are included as well as the fitting of the ARIMA model and the forecasting of the next 10 years of the data set.

The Apply Forecast Operator calculates the forecasted values of the next 10 years. The result of the forecast and the original ExampleSet (containing the original data and the filtered data) are joined together and delivered to the result port. 

Figure 5 shows the original Lake Huron data (red line) and the forecasted values (blue line).

 

Figure 5: Result view of the Lake Huron data set. The original data (red line) and the result of the forecast (blue line) using the ARIMA model are shown.Figure 5: Result view of the Lake Huron data set. The original data (red line) and the result of the forecast (blue line) using the ARIMA model are shown.

 

Differentation

 

To demonstrate the usage of the Differentation Operator I use the Monthly Milk Production data set from the Time Series Extension Samples folder. The data is visualized in figure 6 (red line).

 

Figure 6: Result view of the Monthly Milk Production data set. The original data (red line) and the result of a Differentation Operator with lag = 1 (blue line) are shown.Figure 6: Result view of the Monthly Milk Production data set. The original data (red line) and the result of a Differentation Operator with lag = 1 (blue line) are shown.

It is clearly visible that there is a seasonal variation in the data. Also the milk production increases from 1962 to 1972 and stays roughly on the same level since then. 

 

If we are interested in the increase of the milk production itself, we can use the Differentation Operator to differentiate the data.The result (with the parameter lag set to 1) is also shown in figure 6 (blue line). Again the data is dominated by the seasonality, so it is hard to find time windows where the increase in milk production changes its behavior.

 

At this point the parameter lag can be used. The Differentation Operator calculates the new values as y(t+lag) - y(t). So with lag = 1 we calculate the increase from month to month. If we use lag = 12 we calculate the increase from one month to the same month next year, removing the seasonality in the differentiated data. The result is shown in figure 7 (red line).

 

Figure 7: Result view of the differentiated Monthly Milk Production data set. The Differentation was applied with a lag = 12, removing the seasonality from the data set.Figure 7: Result view of the differentiated Monthly Milk Production data set. The Differentation was applied with a lag = 12, removing the seasonality from the data set.

We can now see that between 1963 and 1973 the yearly increase is roughly 15 pounds, with some time windows showing an even higher increase in 1964, 1967 and 1972. Also in the years 1973, 1974 and between 1975 and 1976 there is even a decrease in the monthly production.

So here the Differentation Operator gives us the possibility to remove seasonality from the data, to get a better overall picture of our data.

 

Additional Operators

 

In addition there are some more Operators provided by the Time Series Extension:

 

  • The Normalization Operator gives you the possibility to normalize your Time Series data.

  • The Logarithm Operator gives you the possibility to apply the natural or the common logarithm to your Time Series data.

  • The Generate Data (ARIMA) provides you with the possibility to simulate Time Series data, produced by an ARIMA model where parameters can be specified by the user.

  • The Check Equidistance Operator checks an Index Attribute of a Time Series data set if it is equidistance on a milli second level.

Figure 8 shows the RapidMiner process used to analyse the Monthly Milk Production data set. The above described Differentation Operators as well as a Normalization and a Logarithm Operator are used (the latter ones for demonstrating the application of the Operators).

 

Figure 8: RapidMiner process to analyse the Monthly Milk Production data set. The two Differentation Operators are included as well as a Logarithm and a Normalization Operator.Figure 8: RapidMiner process to analyse the Monthly Milk Production data set. The two Differentation Operators are included as well as a Logarithm and a Normalization Operator.

With this second process I end this blog post. In the next one I will go into detail about using the ARIMA Trainer and the Apply Forecast Operator and the possibility to combine this with one of the Optimize Operators.

 

Feel free to post every bug, usability problem, feature request or any feedback you have in the Product Feedback Area in the RapidMiner Community.

 

Time Series Extension Blog Posts:

 

01 Release of the Alpha Version 0.1.2

02 Features of Version 0.1.2

380 Views
2 Comments

Like many of you, I spent my childhood during the late 1970s / early 1980s which was ra RadioShack TRS-80 Model 1 (source: Wikipedia)a RadioShack TRS-80 Model 1 (source: Wikipedia)eally the beginning of the "PC era" - at least through my eyes.  The first computer I ever saw was the Radio Shack TRS-80 Model 1 which my elementary school bought in 1981 for goodness-knows-what-reason.  My buddy and I placed out of the 4th and 5th grade math curricula in one year and hence, for 5th grade, our "math class" consisted of putting the two of us in front of this TRS-80 for 45 min every day and leaving us to our own devices (pardon the pun).  In a year we taught ourselves BASIC and were able to read/write to a wonderful cassette tape drive.  We had a pretty good swagger going around school due to our BASIC prowess - we knew how to use this machine better than anyone in the building.  Life was good...

Read more...

Hello Community,

 

Like many of you, I spent my childhood during the late 1970s / early 1980s which was ra RadioShack TRS-80 Model 1 (source: Wikipedia)a RadioShack TRS-80 Model 1 (source: Wikipedia)eally the beginning of the "PC era" - at least through my eyes.  The first computer I ever saw was the Radio Shack TRS-80 Model 1 which my elementary school bought in 1981 for goodness-knows-what-reason.  My buddy and I placed out of the 4th and 5th grade math curricula in one year and hence, for 5th grade, our "math class" consisted of putting the two of us in front of this TRS-80 for 45 min every day and leaving us to our own devices (pardon the pun).  In a year we taught ourselves BASIC and were able to read/write to a wonderful cassette tape drive.  We had a pretty good swagger going around school due to our BASIC prowess - we knew how to use this machine better than anyone in the building.  Life was good.

 

During this time my mother, a math protegy in her youth, went back to school and was in the 1st cohort of "computa stack of punchcards ready to be compiled (source: Wikipedia)a stack of punchcards ready to be compiled (source: Wikipedia)er science" masters candidates at Pace University (now called the "Seidenberg School of Computer Science and Information Systems", founded in 1983).  I often went with her to the mainframe center where she and I wrote software with stacks of punchcards that took hours to compile.  She received a job right afterwmy mother's computer (circa 1985) (source: Pinterest)my mother's computer (circa 1985) (source: Pinterest)ards with Carl Zeiss, Inc. - charged with pioneering the idea of connecting a microscope with a "PC".  Having a PC back then was a novelty, and hence yet again I had a nice swagger being a person who could navigate MS-DOS at home with my own (ok, my mom's) computer.  Like many of my peers, I made a nice living on the side setting up computers for people, building databases (dB III), creating spreadsheets (Lotus 1-2-3), and word processing (WordPerfect).

 

College was more of the same.  I moved from MS-DOS to Unix and spent much of my time "finger-ing" and "ping-ing" my friends over the new internet, writing email, and using emacs for code.  MS Windows was getting popular by this time, but Apple's computers and its GUI were considered "not serious" and "watered down" for us serious computer people.  It was "good for graphics", some admitted, but all agreed that there was no way that a drag-and-drop GUI would be useful beyond the toy phase.  If you could not see what the computer was doing "under the hood", the thinking went, then you could do things on an Apple without understanding what it was doing.  And this was viewed as very dangerous.  Outwardly we said you could get into real trouble with your computer, and deep down we were probably threatened by the idea of "non-computer-people" intruding into our geeky, members-only world.

 

Time moved on and at some point I "saw the light" - moved 100% from PC to Mac - and still remain a diehard Mac user to this day.  Having Mac OS X built on a Unix kernel was a huge plus, but more importantly, I saw how the My first Mac, circa 1996? (source: Wikipedia)My first Mac, circa 1996? (source: Wikipedia)Mac OS was designed to help you do things correctly, and prevent you from doing stupid things (like accidentally downloading 100 viruses or deleting your hard drive).  In the current age, Mac OS has 100% of the functionality of a PC (if not more so) but does it in a way that lowers the threshold for access to a computer's capabilities.  It is "serious computing for the masses", and the swagger that we all earned in the 1980s has become a source of mockery rather than admiration.  "Why on earth would you use command line operators to do things that I can do with one click?", people would say.  It sounds quaint now, but it really hurt back then.

 

 (source: Twitter)(source: Twitter)

Fast forward to today and the world of data science.  The vast majority of people in this field use Python for this work, followed by R and some other code-based environments (how Excel makes this list is beyond me).  And I will argue that the same 1980s swagger that we had for command-line operating systems like MS-DOS and Unix in the 1980s has resurfaced with the data science community today with Python and R.  "If you're serious about data science, you must be coding" is a common phrase seen on StackExchange and other platforms.  Follow aggregators such as @machinelearnbot on Twitter and you will be innundated with such swagger.  The prevailing school of thought says that using drag-and-drop platforms for data science, like RapidMiner, is "not for serious data scientists. How can you be serious if you're not coding?"

 

Case in point is the Kaggle Competition platform.  I think Kaggle is amazing - it is a platform where people with complex data science problems can leverage the entire world's brains in a fun, cost-effective way.  But the swagger there is tremendous.  If you're (source: kdnuggets.com)(source: kdnuggets.com)not solving these challenges in Python or R, you're not taken seriously.  The challenges are not often even ALLOWED to be solved any other way.  Why?  They will say that it's to keep it all open-source, blah blah blah.  Hogwash.  The entire RapidMiner core is completely open-source and the majority of RapidMiner users work with the free license.  I believe that it's the data science "swagger" that looks at platforms like RapidMiner in the same glasses-down-the-nose manner that we viewed Apple computers in 1989.  "How can you possibly solve a 'serious' data science problem in a few minutes via drag-and-drop?"

 

As someone who now has the privilege to work for RapidMiner, I will say that the onus is on us, and our community, George Santayana (source: Wikipedia)George Santayana (source: Wikipedia)to design the software so that we can continue to lower the threshold for people to access the groundbreaking tools of data science, exactly the way Apple did in the 1980s.  And like Apple, we must guide the user toward effective methods and techniques, and thwart ineffective, unethical, and invalid ones.  It is our mission to take ANY user with data and enable her/him to do real data science - fast and simple.  If we heed the advice of sages such as George Santayana, perhaps RapidMiner can become the "Apple of Data Science."  And wouldn't that be nice?

 

"Progress, far from consisting in change, depends on retentiveness. When change is absolute there remains no being to improve and no direction is set for possible improvement: and when experience is not retained, as among savages, infancy is perpetual. Those who cannot remember the past are condemned to repeat it." (George Santayana - "The Life of Reason" - 1905-1906)

 

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
154 Views
0 Comments

It has been three weeks since my last blog entry about "Pi and Pellets".  Why so long?  It's very simple: it's been too warm to burn any pellets!  We were up over 80°F / 27°C last week - hardly the weather to turn on extra heat.  Fortunately it has turned cool again with nature agreeing with the season change.  The leaves are changing finally and the wild turkey have decided to forage before winter...

Read more...

Greetings Community,

 

It has been three weeks since my last blog entry about "Pi and Pellets".  Why so long?  It's very simple: it's been too warm to burn any pellets!  We were up over 80°F / 27°C last week - hardly the weather to turn on extra heat.  Fortunately it has turned cool again with nature agreeing with the season change.  The leaves are changing finally and the wild turkey have decided to forage before winter.

 

IMG_4085.JPG

 

During this no-burn-pellets time, I decided to enrich my data set with outside weather data.  I have a strong hunch that my optimized model is going to depend, at the very least, on outside temperature.  In the U.S., the National Weather Service provides a literal mountain of data to the public on their website, and even better, via a wide array of webservices.   All you need to do is get an access token, find the nearest weather station to you, and get the data.  If only it were that simple!  To make a LONG story shorter, I eventually was able to find my nearest NOAA weather station (Union Village Dam, Thetford, VT), its station ID number (USC00438556) and the dataset that it provides (Global Historical Climatology Network – Daily, abbreviated CHCND.  See here for more info.).  What's even more interesting is that the data collection is not consistent.  You get different data depending on the day...no idea why.  Here's a PDF export for the month of September 2017:

 

GHCND_USC00438556_2017-9-1.png

So then it's only a question of querying the webservice (one day at a time, for ease of parsing the various data) in JSON format, converting to XML (because RapidMiner does not have a good JSON array parser - yet), and storing the data.  But there is a problem: this is only daily min and max temperatures!  I want hourly temperatures at the very least. 

Screen Shot 2017-10-06 at 7.57.02 PM.png

How to convert these min/max values to approximate hourly values?  Well I know it's approximately sinusoidal with an almost perfect sinusoid at the vernal/autumnal equinoxes and nowhere near that at the summer/spring solstices.  I don't need a perfect hourly temperature, but as I am at latitude 43°, I need to take this into account a bit.  After a quick review of some math and a lot of googling, I found a nice paper that allows me to roughly convert daily max/min temperatures to hourly temperatures. 

 

Screen Shot 2017-10-06 at 8.04.14 PM.png

 

And then finally I can join this information to my piSensor data table for future use:

 

Screen Shot 2017-10-06 at 8.06.02 PM.png

 

 

I am attaching the process as an .rmp file to this post as the xml is rather long.  Some interesting RapidMiner Studio pieces if you are interested...

 

  • I had to use a cURL statement in an Execute Program operator, rather than the usual Enrich Data via Webservice operator (in the Web Mining extension).  For lack of a more updated operator, I sometimes need to resort to shell commands.
  • I needed to recreate the "CIBSE Guide" from 1982 that is shown in the paper, sharing the typical times of day for max/min temperature as a function of the month of the year.
  • You will see a huge mess in Generate Attributes (14).  This is my implementation of the formula shown in the paper.  There are probably more elegant ways to do this, but I needed to do it myself in order to understand the math involved.
  • You will see that I needed to extract the "count" given in the NOAA webservice metadata.  This is because of the issue explained above where there are different data depending on the day (goodness knows why).  I then use "Select Subprocess" depending on the count to extract the respective attributes.  I could do this nicer by extracting the datatype...just did not feel like it!

 

That's about it for this blog post.  Hope you are enjoying the journey!

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
633 Views
2 Comments

Dear Community,

 

This is the first of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

We, the RapidMiner Research Team, are pleased to announce a complete rebuild of the already popular Series Extension.

Read more...

Dear Community,

 

This is the first of a series of blog posts planned for the new Time Series Extension (Marketplace).

 

We, the RapidMiner Research Team, are pleased to announce a complete rebuild of the already popular Series ExtensionThe existing Series Extension was received with much excitement with advanced and cool features such as windowing, FFTs, lagging, and much more. However we have received your feedback about its difficulty to use and hence we are replacing the old Series Extension step-by-step with a completely new product.

 

The Alpha version is now available in the Marketplace: Time Series Extension and updates will be published as additional features are added.

 

The new Time Series Extension will add more functionality, tutorial processes and improved usability. We are also adding more features to the extension, including the possibility to forecast Time Series values using a (currently simple) ARIMA model.

In addition, the new Time Series Extension will come with a set of Time Series data samples and several template processes to explore.

 

In addition, we are focusing to publish this extension with even better documentation to help users with Time Series analysis, including transformations, forecasting, feature extraction and more.

 

Figure 1:  Sample process with documentation for the new Time Series Extension. The shown process is one of the template processes delivered by the extension.Figure 1: Sample process with documentation for the new Time Series Extension. The shown process is one of the template processes delivered by the extension.Figure 2: Results of the above shown sample process. The surface level of Lake Huron is shown, as well as two kinds of filters and a forecast using an ARIMA model.Figure 2: Results of the above shown sample process. The surface level of Lake Huron is shown, as well as two kinds of filters and a forecast using an ARIMA model.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

We plan to improve the extension in the next months and gradually include more features. We are also glad to get your feedback as soon as possible. So feel free to post every bug, usability problem, feature request or any feedback you have in the Product Feedback Area in the RapidMiner Community.

 

Stay tuned for the next blog post comprising details about the features provided by version 0.1.2 of the Time Series Extension!

 

 

Time Series Extension Blog Posts:

 

01 Release of the Alpha Version 0.1.2

02 Features of Version 0.1.2

  • extensions
  • Research
  • Time Series
248 Views
1 Comment

 

Greetings Community,

 

It is a crisp morning here in Vermont - 44°F / 7°C - so my quest to get optimal heating efficiency has a renewed sense of urgency.  Snow in September is not unheard of around here.

 

I am still in the data collection phase of this project, although some data are already coming in nicely for analysis (see later on in this post).  Bought several more sensors that I plan on adding to this project in addition to room temp and infrared flame...

Read more...

Greetings Community,

 

It is a crisp morning here in Vermont - 44°F / 7°C - so my quest to get optimal heating efficiency has a renewed sense of urgency.  Snow in September is not unheard of around here.

 

I am still in the data collection phase of this project, although some data are already coming in nicely for analysis (see later on in this post).  Bought several more sensors that I plan on adding to this project in addition to room temp and infrared flame:

- 20A A/C current sensor (a Hall effect transistor IC chip) for measuring total current into stove (and of course easily converted to kW later on...)

- 433MHz superheterodyne receiver (this one) for getting the outside temperature measurements from my Acu-Rite Temperature Sensor hanging on a tree outside, inspired by this clever post.

- A bunch of A/D converter chips to convert some of my analog signals to the Pi GPIO ports

- A bunch of 5A current transformers to monitor the current of individual power-consuming components (later converted to kW) on the stove: the convection blower, the ignitor, the combustion blower, and the auger motor

- A USB microphone to record audio of the various actions of the stove as a verification of readings elsewhere

 

I had to move my flame sensor MUCH closer to the stove in order to get more reliable data; the IR spectrum of this cheap "flame sensor" is picking up wavelengths from sunlight as well as flame during the daytime.  Once I did that, the Pi was recording great data:

overnight data collection (x-axis unit ≈ 5 sec)overnight data collection (x-axis unit ≈ 5 sec)So I am immediately thinking this is time to master the RapidMiner Time Series Extension! I have played around with this before but never done FFTs or other wave analysis with it.  Last time I did an FFT was in college and I did it long-hand with an HP 42S RPN handheld calculator.  Good times.  Anyone want to give this a crack?  I am attaching the data to this post and will give you kudos in my next post if you can show me!

 

Lastly, I want to join these data with those gathered by the U.S. National Weather Service via their REST API.  They have stations everywhere and all their data are easily accessible for free.  It took some doing to decipher what my "gridpoint" was and to deal with the JSON mess I received, but it was well worth it:

Screen Shot 2017-09-01 at 9.27.34 AM.png

Here's my XML for anyone who wants to use this API for retrieving weather data in the U.S.:

 

<?xml version="1.0" encoding="UTF-8"?><process version="7.6.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.6.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="generate_data_user_specification" compatibility="7.6.001" expanded="true" height="68" name="Generate Data by User Specification" width="90" x="45" y="34">
        <list key="attribute_values"/>
        <list key="set_additional_roles"/>
      </operator>
      <operator activated="true" class="web:enrich_data_by_webservice" compatibility="7.3.000" expanded="true" height="68" name="Enrich Data by Webservice" width="90" x="179" y="34">
        <parameter key="query_type" value="Regular Expression"/>
        <list key="string_machting_queries"/>
        <list key="regular_expression_queries">
          <parameter key="foo" value=".*"/>
        </list>
        <list key="regular_region_queries"/>
        <list key="xpath_queries"/>
        <list key="namespaces"/>
        <list key="index_queries"/>
        <list key="jsonpath_queries"/>
        <parameter key="url" value="https://api.weather.gov/gridpoints/BTV/124,29"/>
        <list key="request_properties"/>
      </operator>
      <operator activated="true" class="subprocess" compatibility="7.6.001" expanded="true" height="82" name="Subprocess (3)" width="90" x="313" y="34">
        <process expanded="true">
          <operator activated="true" class="nominal_to_text" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Text" width="90" x="45" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="foo"/>
          </operator>
          <operator activated="true" class="text:data_to_documents" compatibility="7.5.000" expanded="true" height="68" name="Data to Documents" width="90" x="179" y="34">
            <parameter key="select_attributes_and_weights" value="true"/>
            <list key="specify_weights">
              <parameter key="foo" value="1.0"/>
            </list>
          </operator>
          <operator activated="true" class="text:combine_documents" compatibility="7.5.000" expanded="true" height="82" name="Combine Documents" width="90" x="313" y="34"/>
          <operator activated="true" class="text:json_to_data" compatibility="7.5.000" expanded="true" height="82" name="JSON To Data" width="90" x="447" y="34"/>
          <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (2)" width="90" x="581" y="34">
            <parameter key="attribute_filter_type" value="regular_expression"/>
            <parameter key="regular_expression" value=".*values\[[0-9]+\].*"/>
          </operator>
          <operator activated="true" class="transpose" compatibility="7.6.001" expanded="true" height="82" name="Transpose" width="90" x="715" y="34"/>
          <operator activated="true" class="replace" compatibility="7.6.001" expanded="true" height="82" name="Replace" width="90" x="849" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="replace_what" value="properties[.]|values|valid|apparent"/>
          </operator>
          <operator activated="true" class="split" compatibility="7.6.001" expanded="true" height="82" name="Split" width="90" x="983" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id"/>
            <parameter key="include_special_attributes" value="true"/>
            <parameter key="split_pattern" value="[.]+"/>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (3)" width="90" x="1117" y="34">
            <parameter key="attribute_filter_type" value="single"/>
            <parameter key="attribute" value="id_3"/>
            <parameter key="regular_expression" value=".*values\[[0-9]+\].*"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="trim" compatibility="7.6.001" expanded="true" height="82" name="Trim" width="90" x="1251" y="34"/>
          <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes (2)" width="90" x="1385" y="34">
            <list key="function_descriptions">
              <parameter key="type" value="if(contains(att_1,&quot;+&quot;),&quot;datetime&quot;,&quot;value&quot;)"/>
            </list>
          </operator>
          <operator activated="true" class="concurrency:loop_values" compatibility="7.6.001" expanded="true" height="82" name="Loop Values" width="90" x="1519" y="34">
            <parameter key="attribute" value="id_1"/>
            <parameter key="enable_parallel_execution" value="false"/>
            <process expanded="true">
              <operator activated="false" class="de_pivot" compatibility="7.6.001" expanded="true" height="82" name="De-Pivot" width="90" x="45" y="187">
                <list key="attribute_name">
                  <parameter key="datetime" value="sss"/>
                </list>
              </operator>
              <operator activated="true" class="filter_examples" compatibility="7.6.001" expanded="true" height="103" name="Filter Examples (2)" width="90" x="45" y="34">
                <list key="filters_list">
                  <parameter key="filters_entry_key" value="id_1.equals.%{loop_value}"/>
                </list>
              </operator>
              <operator activated="true" class="extract_macro" compatibility="7.6.001" expanded="true" height="68" name="Extract Macro" width="90" x="179" y="136">
                <parameter key="macro" value="value"/>
                <parameter key="macro_type" value="data_value"/>
                <parameter key="attribute_name" value="id_1"/>
                <parameter key="example_index" value="1"/>
                <list key="additional_macros"/>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (4)" width="90" x="179" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="id_4"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="pivot" compatibility="7.6.001" expanded="true" height="82" name="Pivot" width="90" x="313" y="34">
                <parameter key="group_attribute" value="id_2"/>
                <parameter key="index_attribute" value="type"/>
                <parameter key="consider_weights" value="false"/>
                <parameter key="skip_constant_attributes" value="false"/>
              </operator>
              <operator activated="true" class="rename" compatibility="7.6.001" expanded="true" height="82" name="Rename" width="90" x="447" y="34">
                <parameter key="old_name" value="att_1_value"/>
                <parameter key="new_name" value="value"/>
                <list key="rename_additional_attributes">
                  <parameter key="att_1_datetime" value="datetime"/>
                  <parameter key="id_1_datetime" value="measurement"/>
                </list>
              </operator>
              <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (5)" width="90" x="581" y="34">
                <parameter key="attribute_filter_type" value="regular_expression"/>
                <parameter key="attribute" value="id_2"/>
                <parameter key="attributes" value="id_1_datetime|id_2"/>
                <parameter key="regular_expression" value="id.*"/>
                <parameter key="invert_selection" value="true"/>
              </operator>
              <operator activated="true" class="replace" compatibility="7.6.001" expanded="true" height="82" name="Replace (2)" width="90" x="715" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="datetime"/>
                <parameter key="include_special_attributes" value="true"/>
                <parameter key="replace_what" value="T"/>
                <parameter key="replace_by" value=" "/>
              </operator>
              <operator activated="true" class="replace" compatibility="7.6.001" expanded="true" height="82" name="Replace (3)" width="90" x="849" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="datetime"/>
                <parameter key="include_special_attributes" value="true"/>
                <parameter key="replace_what" value="\/.*"/>
              </operator>
              <operator activated="true" class="replace" compatibility="7.6.001" expanded="true" height="82" name="Replace (4)" width="90" x="983" y="34">
                <parameter key="attribute_filter_type" value="single"/>
                <parameter key="attribute" value="datetime"/>
                <parameter key="include_special_attributes" value="true"/>
                <parameter key="replace_what" value="[+]00[:]00"/>
                <parameter key="replace_by" value=" +0000"/>
              </operator>
              <operator activated="true" class="nominal_to_date" compatibility="7.6.001" expanded="true" height="82" name="Nominal to Date (2)" width="90" x="1117" y="34">
                <parameter key="attribute_name" value="datetime"/>
                <parameter key="date_type" value="date_time"/>
                <parameter key="date_format" value="yyyy-MM-dd HH:mm:ss Z"/>
                <parameter key="time_zone" value="SYSTEM"/>
              </operator>
              <connect from_port="input 1" to_op="Filter Examples (2)" to_port="example set input"/>
              <connect from_op="Filter Examples (2)" from_port="example set output" to_op="Extract Macro" to_port="example set"/>
              <connect from_op="Extract Macro" from_port="example set" to_op="Select Attributes (4)" to_port="example set input"/>
              <connect from_op="Select Attributes (4)" from_port="example set output" to_op="Pivot" to_port="example set input"/>
              <connect from_op="Pivot" from_port="example set output" to_op="Rename" to_port="example set input"/>
              <connect from_op="Rename" from_port="example set output" to_op="Select Attributes (5)" to_port="example set input"/>
              <connect from_op="Select Attributes (5)" from_port="example set output" to_op="Replace (2)" to_port="example set input"/>
              <connect from_op="Replace (2)" from_port="example set output" to_op="Replace (3)" to_port="example set input"/>
              <connect from_op="Replace (3)" from_port="example set output" to_op="Replace (4)" to_port="example set input"/>
              <connect from_op="Replace (4)" from_port="example set output" to_op="Nominal to Date (2)" to_port="example set input"/>
              <connect from_op="Nominal to Date (2)" from_port="example set output" to_port="output 1"/>
              <portSpacing port="source_input 1" spacing="0"/>
              <portSpacing port="source_input 2" spacing="0"/>
              <portSpacing port="sink_output 1" spacing="0"/>
              <portSpacing port="sink_output 2" spacing="0"/>
            </process>
          </operator>
          <operator activated="true" class="append" compatibility="7.6.001" expanded="true" height="82" name="Append" width="90" x="1653" y="34"/>
          <operator activated="true" class="pivot" compatibility="7.6.001" expanded="true" height="82" name="Pivot (2)" width="90" x="1787" y="34">
            <parameter key="group_attribute" value="datetime"/>
            <parameter key="index_attribute" value="measurement"/>
            <parameter key="consider_weights" value="false"/>
          </operator>
          <operator activated="true" class="set_role" compatibility="7.6.001" expanded="true" height="82" name="Set Role (2)" width="90" x="1921" y="34">
            <parameter key="attribute_name" value="datetime"/>
            <parameter key="target_role" value="id"/>
            <list key="set_additional_roles"/>
          </operator>
          <operator activated="true" class="rename_by_replacing" compatibility="7.6.001" expanded="true" height="82" name="Rename by Replacing" width="90" x="2055" y="34">
            <parameter key="replace_what" value="value_"/>
          </operator>
          <operator activated="true" class="generate_attributes" compatibility="7.6.001" expanded="true" height="82" name="Generate Attributes (3)" width="90" x="2189" y="34">
            <list key="function_descriptions">
              <parameter key="temperatureNWS" value="(9/5)*parse(temperature)+32"/>
            </list>
          </operator>
          <operator activated="true" class="select_attributes" compatibility="7.6.001" expanded="true" height="82" name="Select Attributes (6)" width="90" x="2323" y="34">
            <parameter key="attribute_filter_type" value="subset"/>
            <parameter key="attribute" value="Temperature"/>
            <parameter key="attributes" value="Temperature|temperature"/>
            <parameter key="invert_selection" value="true"/>
          </operator>
          <operator activated="true" class="parse_numbers" compatibility="7.6.001" expanded="true" height="82" name="Parse Numbers" width="90" x="2457" y="34">
            <parameter key="unparsable_value_handling" value="skip attribute"/>
          </operator>
          <connect from_port="in 1" to_op="Nominal to Text" to_port="example set input"/>
          <connect from_op="Nominal to Text" from_port="example set output" to_op="Data to Documents" to_port="example set"/>
          <connect from_op="Data to Documents" from_port="documents" to_op="Combine Documents" to_port="documents 1"/>
          <connect from_op="Combine Documents" from_port="document" to_op="JSON To Data" to_port="documents 1"/>
          <connect from_op="JSON To Data" from_port="example set" to_op="Select Attributes (2)" to_port="example set input"/>
          <connect from_op="Select Attributes (2)" from_port="example set output" to_op="Transpose" to_port="example set input"/>
          <connect from_op="Transpose" from_port="example set output" to_op="Replace" to_port="example set input"/>
          <connect from_op="Replace" from_port="example set output" to_op="Split" to_port="example set input"/>
          <connect from_op="Split" from_port="example set output" to_op="Select Attributes (3)" to_port="example set input"/>
          <connect from_op="Select Attributes (3)" from_port="example set output" to_op="Trim" to_port="example set input"/>
          <connect from_op="Trim" from_port="example set output" to_op="Generate Attributes (2)" to_port="example set input"/>
          <connect from_op="Generate Attributes (2)" from_port="example set output" to_op="Loop Values" to_port="input 1"/>
          <connect from_op="Loop Values" from_port="output 1" to_op="Append" to_port="example set 1"/>
          <connect from_op="Append" from_port="merged set" to_op="Pivot (2)" to_port="example set input"/>
          <connect from_op="Pivot (2)" from_port="example set output" to_op="Set Role (2)" to_port="example set input"/>
          <connect from_op="Set Role (2)" from_port="example set output" to_op="Rename by Replacing" to_port="example set input"/>
          <connect from_op="Rename by Replacing" from_port="example set output" to_op="Generate Attributes (3)" to_port="example set input"/>
          <connect from_op="Generate Attributes (3)" from_port="example set output" to_op="Select Attributes (6)" to_port="example set input"/>
          <connect from_op="Select Attributes (6)" from_port="example set output" to_op="Parse Numbers" to_port="example set input"/>
          <connect from_op="Parse Numbers" from_port="example set output" to_port="out 1"/>
          <portSpacing port="source_in 1" spacing="0"/>
          <portSpacing port="source_in 2" spacing="0"/>
          <portSpacing port="sink_out 1" spacing="0"/>
          <portSpacing port="sink_out 2" spacing="0"/>
        </process>
      </operator>
      <connect from_op="Generate Data by User Specification" from_port="output" to_op="Enrich Data by Webservice" to_port="Example Set"/>
      <connect from_op="Enrich Data by Webservice" from_port="ExampleSet" to_op="Subprocess (3)" to_port="in 1"/>
      <connect from_op="Subprocess (3)" from_port="out 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Note there is no API key/token needed for this service which is an added bonus.

 

That's it for this post.  Stay warm and enjoy!

 

 

 

 

 

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
298 Views
2 Comments

Greetings Community,

  

As many of you may know, I live in a beautiful place called Vermont where the leaves are gorgeous in the fall, and there is plenty of snow and cold in the winter.  Hence the cost and logistics of heating your house is a frequent topic of conversation up here and, as an avid data geek, I am always striving to get the maximum BTUs out of my heating system.  In this series of blog posts, I am going to share my new journey of turning my heating system into a wicked-amazing IoT optimization system using a Raspberry Pi, a bunch of sensors, and of course RapidMiner to do the heavy lifting...

Read more...

Greetings Community,

  

As many of you may know, I live in a beautiful place called Vermont where the leaves are gorgeous in the fall, and there is plenty of snow and cold in the winter.  Hence the cost and logistics of heating your house is a frequent topic of conversation up here and, as an avid data geek, I am always striving to get the maximum BTUs out of my heating system.  In this series of blog posts, I am going to share my new journey of turning my heating system into a wicked-amazing IoT optimization system using a Raspberry Pi, a bunch of sensors, and of course RapidMiner to do the heavy lifting.

 

This is my house in various times of year...

Scott's house - summerScott's house - summer

Scott's house - early winterScott's house - early winter Scott's house - late winterScott's house - late winter

 

 

 

 

 

 

 

 

 

This is a "pellet stove" - basically a wood-burning stove that burns these small pellets made from sawdust...

my pellet stovemy pellet stovea handful of douglas fir wood pelletsa handful of douglas fir wood pelletspellet stove controller to be replaced by pipellet stove controller to be replaced by pipellets come down the chute via a stepper-motor controlled auger into this burn pot and are ignitedpellets come down the chute via a stepper-motor controlled auger into this burn pot and are ignitedburning pelletsburning pellets

 

 Phase 1: Setup and Data Collection

 

I get my Pi going with the standard RaspbianOS, install a MySQL database that will store sensor data, and hook up two sensors to get started: temperature (in the room) and an infrared "flame" sensor:

Pi3 and breadboardPi3 and breadboardinfrared (flame) sensorinfrared (flame) sensorcapturing data into mysql tablecapturing data into mysql table

 

 

I am probably the worst programmer on the face of the earth - thank goodness for Google.  Here's the Python code for grabbing sensor data and storing into mysql:

 

import time
import datetime
import RPi.GPIO as GPIO
import MySQLdb as mdb

GPIO.setmode(GPIO.BCM)
GPIO.setup(17,GPIO.IN)

db = mdb.connect("localhost","pisensor","<pwd>","pelletdb")
curs = db.cursor()

while 1:
	tempfile = open ("/sys/bus/w1/devices/28-051691a25bff/w1_slave")
	thetext = tempfile.read()
	tempfile.close()
	tempdata = thetext.split("\n")[1].split(" ")[9]
	temperature = float(tempdata[2:])
	temperature = temperature /1000
	flame=GPIO.input(17)
	curs.execute ("INSERT INTO pisensor VALUES (NOW(),%s,%s)",(temperature,flame))
	db.commit()
	print temperature
	print flame
	time.sleep(5)

The 5-second delay is a compromise that may need to be tweaked later...I'm worried about storage in my little Pi.  So I want to pull the data off the pi and store it in my RapidMiner local repository (actually it's a Google Drive repository that RM thinks is local).

RM process to retrieve data from Pi and store in RM local (Google Drive) repositoryRM process to retrieve data from Pi and store in RM local (Google Drive) repository

 

 

proof that things are workingproof that things are working  

 

 

 

 

 

 

 

 

 

 

That's it for now.  Next up: pull data from National Weather Service API to enrich data set....

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
394 Views
1 Comment

Greetings from the woods of Vermont.  For those who don't know me, this is Scott Genzer (@sgenzer) - longtime RapidMiner user/fanboy and the new RapidMiner Community Manager.  I would like to write a blog note to just say hello and share a bit about me, my interests, and our vision for the Community moving forward.

Read more...

Hello RapidMiner Community,

 

Greetings from the woods of Vermont.  For those who don't know me, this is Scott Genzer (@sgenzer) - longtime RapidMiner user/fanboy and the new RapidMiner Community Manager.  I would like to write a blog note to just say hello and share a bit about me, my interests, and our vision for the Community moving forward.

 

A very quick bio of Scott (more on LinkedIn profile): B.S. Engineering (electrical), Columbia University '93, M.A. Mathematics Education, Columbia University '98.  Served in the U.S. Peace Corps as a math teacher in Gabon (Central Africa) then worked in education in various locations (U.S., Poland, Zambia, Jamaica) for 15+ years.  In 2013 I taught myself RapidMiner (version 5 back then!) and started Genzer Consulting, a data science consultancy where I worked with a variety of schools and companies - using RapidMiner as my main software platform.  And now here I am, a very happy new official member of the RapidMiner family!

 

My interests, like my experiences, are quite varied but always have been a proud math / science / computer geek since I was a kid when I played with a RadioShack TRS-80 computer.  I truly enjoy puzzles and solving complicated problems, no matter what the context (I am a terrible programmer and hence love working with the code-optional environment of RapidMiner!).  I also love living in Vermont with walks in the woods, making maple syrup, and working in my woodshop during the long winters.

 

As for the Community, my vision is to pick up from the outstanding work that @Thomas_Ott did and move it forward.  Tom and I both believe that this is a true community where everyone shares and gains from the knowledge and experience of others.  Hence my top priority is to ensure that all 250,000+ users are welcome and we keep this friendly forum running as the user base expands at its current fast clip.  Other priorities include making community resources (processes, datasets, advice, buildingblocks, knowledgebase articles, etc...) accessible in a fast and simple manner, and creating monthly data science contests for fun, skillbuilding, and of course money + prizes.

 

That's about it from here.  Again THANK YOU @Thomas_Ott for all you have done for the Community.  You left big shoes to fill and I am eager to carry the torch from here.

 

Hope to see you all here often!

 

Scott

 

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
117 Views
0 Comments

RapidMiner's Jeff Chowaniec writes about his favorite tips and tricks for RapidMiner Studio.

Read more...

RapidMiner Studio has helped you and your company gain a competitive edge with data-driven decisions made at a rapid pace—are you ready to make those decisions and the processes behind them even faster?

 

In a past webinar, Jeff Chowaniec, data scientist and solutions consultant at RapidMiner, keys you in on the most useful tricks and shortcuts to enhance efficiency in your RapidMiner Studio endeavors. To demonstrate these tips, Jeff utilizes the Titanic Survival Dataset (you’ll find this in the samples folder in RapidMiner Studio).

 

Read the entire post at https://rapidminer.com/ten-tips-to-master-rapidminer-studio/

  • RapidMiner Studio
  • Studio
214 Views
0 Comments

Today we released a new version of our Operator Toolbox extension. It includes one new operator and an update on the existing Get Local interpretation operator.

Read more...

 

Today we released a new version of our Operator Toolbox extension. It includes one new operator and an update on the existing Get Local interpretation operator.

 

Extract Statistics

This operator enables you to get the statistics you can inspect in Results View into an Example set. The resulting example set looks like this:stats.png

 

Leave a comment, if you would like to have additional measures calculated.

 

 

Performance Output for Local Models

Get Local Interpretation allows you to get a local interpretation for complex models. These interpretations include important attributes for this decision.

With the new update, you are now able to measure the performance of these local interpretations using the additional performance port within the nested process.

The main criterion of the delivered performance vector is added to the resulting example set.

GLI.png

 

 

 

 

 

 

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner
416 Views
1 Comment

RapidMiner 7.6 is out - improvements in Radoop, Mail Security and more

Read more...

Hi all -

 

Yes, exciting news today with the release of RapidMiner 7.6.  Here are some highlights:

 

  • Studio: Sending notification emails now support all modern connection security and authentication mechanisms like TLS 1.2 + PFS, help panel text for the most used operators has been fully reviewed and explanations are now clearer and more useful, update Java for Windows and Mac to 8u141, lots of improvements in missing / error data handling

 

  • Server: Allow admin to set recursive folder permissions, improved performance of web services under heavy load, improved logging of installation process, implementation of secure email notifications (like Studio), update for Windows and Mac to 8u141, lots of improvements in missing / error data handling

 

  • Radoop: Support for standard and premium Microsoft Azure HDInsight, container re-use support for Hive-on-Tez as well as Hive-on-Spark, support for HiveServer2 High Availability, upgrade to Hadoop 2.8.1

 

AND an amazing new Studio extension: KERAS Deep Learning.  Check out @jpuente's great new KB article and share your notes on @Thomas_Ott's new thread.

 

Enjoy!


Scott

Scott Genzer
Senior Community Manager
RapidMiner, Inc.
  • Keras
  • mail
  • radoop
  • server
  • Studio
158 Views
0 Comments

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

Read more...

At RapidMiner Research, we just released updates of multiple extensions developed under the DS4DM research project. Here is a highlight of these updates.

 

Web Table Extraction Extension

 

The new version is 0.1.6. In this version, the ‘Read HTML Table’ operator can load the HTML documents from local file path in addition to web URL. This is helpful when dealing with large amounts of HTML data files, that may have been collected through web crawling. Once the HTML data tables are retrieved and being converted into ExampleSets, the operator can also guess the numeric data type of attributes.

 

Spreadsheet Table Extraction Extension

 

The new version is 0.2.1. In this version, the following updates are available:

 

  • The ‘Read Google Spreadsheet’ operator provides type guessing so you can retrieve sheets from an online Google Spreadsheet document and directly process the numeric data.
  • The ‘Read Excel Online’ operator is the completely new operator added to the extension. This extends your reach to the Excel Online spreadsheets. There is a dedicated Blogpost detailing salient features of this operator available here: Link: Reading Excel files directly from your companies OneDrive

 

PDF Table Extraction Extension

 

The new version is 0.1.4. This also adds type guessing to the ‘Read PDF Table’ operator.

 

Data Search for Data Mining Extension

 

The new version is 0.1.2. This update includes various enhancements, most notable of them are made in the ‘Translate’ operator. The extension provides Search-Join mechanism through a joint usage of ‘Data Search’, ‘Translate’ and ‘Fuse’ operators. Translate filters out tables, that have schema and instance match for the new attribute you want to discover and integrate to your original (query) table. Before fusion is performed, the discovered tables are converted to the schema of the query table. This requires statistical measures of interest to be defined on the cell-level and table-level for the new attributes. In this update, we added metrics for defining “trust” in the new data by using similarity and dissimilarity for data discovered by the Data Search operator. To this, the following trust and mistrust measures have been added:

  • Levenstein Mistrust: Mean value of Levenstein cross-distance for each non-empty cell value present in the discovered collection.
  • Jaro Winkler Trust: Mean value of Jaro Winkler cross-distance for each non-empty cell value present in the discovered collection.
  • Fuzzy Trust: Mean value of Fuzzy cross-distance for each non-empty cell value present in the discovered collection.
  • Missing Values: The number of empty values in a translated table.

Other metrics include Coverage and Trust (please refer to the earlier post for more details [1]). The figure below shows the distributions of these metrics on the Control Panel view of the Translate operator.

 

The Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operatorThe Control Panel view of the Translate operator shows list of translated tables and the distribution of statistical metrics such as Coverage, Ratio, Levenstein Mistrust, Jaro Winkler Trust, Fuzzy Trust and Missing Values to be used by the Fuse operator

This update paves the way to perform data fusion not just at data level (by using Voting, Clustered Voting, Intersection, etc.) but also advanced meta-data level such as by optimizing on multiple objectives.

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

 

References

[1] The Data Search for Data Mining, Release post, Web-link: http://community.rapidminer.com/t5/Community-Blog/The-Data-Search-for-Data-Mining-Extension-Release/...

  • DS4DM
  • extensions
  • Research
177 Views
0 Comments

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link).

Read more...

Nowadays many corporate Excel files tend to be online. This seems quite natural, when you consider the wide range of benefits: ease of sharing and collaboration, access from any pc with Internet connection, user right management and constant backups, just to name a few. Wouldn't it be great to include data from those files directly into your Data Mining process, even without downloading the whole file? Now you can! Introducing the new Read Excel Online Operator. It is part of the latest update of the Spreadsheet Table Extraction extension (download link). Example Excel Online SheetExample Excel Online Sheet Above you can see an Excel Sheet stored in a OneDrive for Business instance. Its filename is 'Customer Data.xlsx', it contains one sheet called 'Sheet1' and note, that the 4th column is not given a name, yet. Now let's access this sheet. Open up RapidMiner and search the Marketplace (found in the Extensions menu item) for Spreadsheet Table Extraction. After the installation is complete, you will find a new Operator, called Read Excel Online in your Operator view. Drag it into your process and fill in the parameters. For the sheet shown above it will look like this:

Example process extracting data from OneDriveExample process extracting data from OneDrive

The File Path needed is the relative path in your OneDrive for Business. Unfortunately, Microsoft's API does not support private OneDrive accounts, yet. As you can see, my Excel Sheet is located directly in the upper most folder of my OneDrive for Business. If I would move it into a folder named 'Data', I would need to provide 'Data/Customer Data.xlsx' as the File Path. Note, that no leading slash is used.

The Sheet Name is set per default to Microsoft's current default sheet name. Make sure, that you provide the name of the sheet you intend to access. If you don't provide a Cell Range the whole data containing range will be used. Leading rows and columns that are empty, will be skipped. Providing a Cell Range is done using the A1 notation used in Excel. Selecting all cells starting from the 2nd column (B) and the 3rd row up to the 8th column (H) and the 10th row would require to provide 'B3:H10' as the Cell Range.

Since your OneDrive for Business is an internal resource, you also need to verify, that you have access to the Excel Sheet. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your OneDrive credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

Executing the process results in the ExampleSet shown below:Resulting ExampleSet from the extraction processResulting ExampleSet from the extraction process

 

As you can see, all data was extracted successfully. Empty cells were filled up with missing values and the 4th column, which had no name, was given a generic name (here: Attribute 4). And the best part about it, the file was never completely downloaded to your machine. Only data present in the cells you wanted to access was send to your machine. So now you are able to extract only the data of interest while for example facing a large excel document with many sheets, without creating lots of temporary versions of the file on your machine.

In the rare case, that your Excel Sheet only contains data without any column names, check the No column names given option to generate generic attribute names instead of using the first data row as column names.

Advanced Integration

If you have documents stored in Microsoft SharePoint sites, you can use the SharePoint Connector extension (download link) to obtain a list of existing files and load them directly into your process. Therefore have a look at this articles explaining the usage of the extension, and extract the IDs of the desired files and the site itself as described there. Using these information you can provide them via Macros to the Read Excel Online Operator to directly access the data from your SharePoint even without downloading the files. Checking the Read from SharePoint parameter enables you to enter the SharePoint Site ID as well as the SharePoint File ID.

Read from SharePoint optionRead from SharePoint option

An example process using both extensions for direct access to SharePoint data in Excel Sheets is given below:

<process version="7.5.003">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="7.5.003" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="sharepoint_connector:list_files_sharepoint" compatibility="0.1.000" expanded="true" height="68" name="List SharePoint Files" width="90" x="246" y="238">
        <parameter key="SharePoint Site" value="SharepointTesting"/>
      </operator>
      <operator activated="true" class="filter_examples" compatibility="7.5.003" expanded="true" height="103" name="Filter Examples" width="90" x="581" y="238">
        <list key="filters_list">
          <parameter key="filters_entry_key" value="filename.contains.\.xlsx"/>
        </list>
      </operator>
      <operator activated="true" class="extract_macro_from_annotation" compatibility="7.5.003" expanded="true" height="68" name="Extract Macro from Annotation" width="90" x="782" y="238">
        <parameter key="macro" value="siteId"/>
        <parameter key="annotation" value="sharepointSiteId"/>
      </operator>
      <operator activated="true" class="concurrency:loop_values" compatibility="7.5.003" expanded="true" height="82" name="Loop Values" width="90" x="983" y="238">
        <parameter key="attribute" value="sharepointId"/>
        <parameter key="iteration_macro" value="fileId"/>
        <parameter key="enable_parallel_execution" value="false"/>
        <process expanded="true">
          <operator activated="true" class="spreadsheet_table_extraction:read_excel_online" compatibility="0.2.001" expanded="true" height="68" name="Read Excel Online" width="90" x="447" y="85">
            <parameter key="Read from SharePoint" value="true"/>
            <parameter key="SharePoint Site ID" value="%{siteId}"/>
            <parameter key="SharePoint File ID" value="%{fileId}"/>
          </operator>
          <connect from_op="Read Excel Online" from_port="example set" to_port="output 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="source_input 2" spacing="0"/>
          <portSpacing port="sink_output 1" spacing="0"/>
          <portSpacing port="sink_output 2" spacing="0"/>
          <description align="center" color="orange" colored="true" height="201" resized="true" width="223" x="381" y="29">Obtaining the SharePoint site and file IDs from previously set macros&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;&lt;br&gt;Make sure to check the Sheet Name and to provide a valid token.</description>
        </process>
      </operator>
      <connect from_op="List SharePoint Files" from_port="example set" to_op="Filter Examples" to_port="example set input"/>
      <connect from_op="Filter Examples" from_port="example set output" to_op="Extract Macro from Annotation" to_port="object"/>
      <connect from_op="Extract Macro from Annotation" from_port="object" to_op="Loop Values" to_port="input 1"/>
      <connect from_op="Loop Values" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
      <description align="left" color="orange" colored="false" height="322" resized="true" width="403" x="100" y="38">#1 Login to https://developer.microsoft.com/en-us/graph/graph-explorer and copy the URL from your browsers address bar into the &amp;quot;Auth Token&amp;quot; field.&lt;br&gt;#2 Go to your SharePoint and choose a site you want to access.&lt;br&gt;#3 Fill in the parameters 'SharePoint URL' with the URL you used to access your SharePoint and fill in the parameter 'SharePoint Site' with the name of the site you want to access.</description>
      <description align="center" color="orange" colored="false" height="322" resized="true" width="219" x="513" y="37">Select only entries containing Excle Sheets (ending with .xlsx).</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="185" x="737" y="37">Extract the SharePoint site ID from the annotated ExampleSet and provide it as a macro called 'siteId'</description>
      <description align="center" color="orange" colored="true" height="321" resized="true" width="199" x="928" y="37">Loop over all remaining entries (only the ones being Excel Sheets) to access the SharePoint file IDs from the 'sharepointId' column and providing it via a macro called 'fileId'</description>
    </process>
  </operator>
</process>

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research
236 Views
1 Comment

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

Read more...

Companies and organizations often store and share information via Microsoft SharePoint Sites. They are a great way of collecting and sharing information around a given topic. Many sites therefore contain lots of office documents and files in other formats. Integrating these information into a Data Mining process often involves manual searching through sites and folders as well as downloading files by hand. This isn't fast, nor simple. Therefore we created the SharePoint Connector extension to speed things up. You can download it through the RapidMiner Marketplace. It consists of the List SharePoint Files operator, that creates a list of all available files and folders and the Download from SharePoint Operator which downloads files of interest.

 

Below you can see the document section of a SharePoint site created for a little demonstration. This site groups together a project folder and a few documents with varying file format.Demo SharePoint SiteDemo SharePoint SiteThe first step for integrating your SharePoint data into your Data Mining process is to find out, what the SharePoint URL of your company or organization is. Just have a look into your browsers address bar and extract it along with your sites name. Both things are underlined in the picture above. Now enter these information into the List SharePoint Files Operator, that comes with the SharePoint Connector extension, as shown in the picture below.List SharePoint Files Operator configurationList SharePoint Files Operator configuration

Since your SharePoint site is an internal resource, you also need to verify, that you have access to the information. Therefore you need a so called authentication token. You can get one by visiting the Microsoft Graph Explorer and logging in with your SharePoint credentials (often equivalent to your Microsoft Account, e.g. Office 365). After having logged in, copy the URL from the address bar into the Auth Token field and the Operator will extract the token information automatically.

 

If you now run the process an ExampleSet is created, that contains information about the files stored in the site you accessed. Below you can see the result from scanning my demonstration SharePoint site shown at the beginning of this post. The author and lastModifiedBy columns are redacted for this post.Result view containing all files and folders found in the siteResult view containing all files and folders found in the site

You gain information about the filename, its location within the site (path), a url for downloading it manually, the author's name, the creation date and time (creationDateTime), the person having modified it last (lastModifiedBy), the date and time of the last change (lastModificationDateTime), a unique sharepointId and the information if the entry is a folder or not. The Operator always scans files at the given folder level. If you need to dig deeper you can use the information derived above together with the Scan specific folder parameter to search for files and folders in a subfolder.

 

With this information you can for example filter out all entries created by a given author or of a desired file format in order to download them. Therefore you can add the Filter Examples operator or any other Operator to create a more specific list of files you want to download. Providing this list to the Download from SharePoint Operator enables you to download all files to the destination defined in the Download Path parameter or continue working on them by using the collection of files provided at its output port. An example process using this filtering is shown below and provided as a tutorial process, that comes with the Download from SharePoint Operator. File download and integrationFile download and integrationTo continue using the files directly in your process you can for example use the Loop Collection Operator to handle each file and use one of RapidMiner's many reading Operators to extract the data into your process. Don't worry, you don't need to provide the Auth Token to the Download from SharePoint Operator again. It will be stored alongside the ExampleSet (as an annotation) so you don't need to handle it again. But if you store the ExampleSet in your repository and want to download files later, your token might expire. Hence the operator offers an option to set a new token. Again you can just provide the URL obtained after logging into Microsoft Graph Explorer.

 

Happy Mining,

Philipp for the RapidMiner Research Team

 

Acknowledgments

The extensions are developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com), which is sponsored by the German ministry of education and research (BMBF).

  • DS4DM
  • extensions
  • Research
90 Views
0 Comments

 

New Version 0.4.0 of the Operator Toolbox Extension available.

 

We are happy to announce the release of a new version of the Operator Toolbox Extension. With version 0.4.0 some new enhancements wait for you:

 

Stem Tokens Using ExampleSet

Read more...

 

New Version 0.4.0 of the Operator Toolbox Extension available.

 

We are happy to announce the release of a new version of the Operator Toolbox Extension. With version 0.4.0 some new enhancements wait for you:

 

Stem Tokens Using ExampleSet

 

This Operator is an enhancement to the Text Processing Extension. It can be used inside a Process Documents Operator. It replaces terms in Documents by pattern matching rules. The list of tokens to be filtered out is provided by an ExampleSet containing the replacement rules.

Here are the results on the sentence:
“sunday monday tuesday wednesday thursday friday are all days of week. Sunday and Saturday are not”. The Stem Tokens Using ExampleSet Operator replaces all words matching .*day with weekdays. The left image shows the result of the Process Documents without the Stem Tokens Using ExampleSet Operator, the right one with the Operator.

 

stem_tokens_both.png

 

The new Operator is like the Stem (Dictionary) Operator, but uses an ExampleSet instead of a file.

 

Weight of Evidence

 

This Operator introduces a new method for discretization. The generated value expresses the chance of a binominal attribute (called base of distribution) having positive or negative value in the discretized group. This value will be the same for all the Examples that belong to the same group. Unlike other discretization Operators, this one assigns numeric values to each class.

If the Weight of Evidence value is positive, the Examples from that group are more likely to have the positive value for the base of distribution Attribute than the whole crowd. The higher the Weight of Evidence value, the greater the chance for positive base of distribution value.

The attached tutorial process helps understand the beneficial aspects of using this Operator as a substitute for other discretization methods. The following image shows the result of the tutorial process in which the Weight of Evidence Operator is applied on the Titanic data sample.

 

WoE.png

 

Split Document into Collection

 

This Operator splits a Document into a Collection of Documents according to the split string parameter.

This is, for example, very helpful if you have read in a complete Text File (with the Read Document Operator) and wants to split it into different lines to process the file line by line.

Check out the tutorial process of the Operator to get an impression how it works.

 

Dictionary Based Sentiment & Apply Dictionary Based Sentiment

 

In some cases, you want to build a sentiment model based on a given list of weights. The weight represents the negativity/positivity of a word. This dictionary should have a structure like this:

 

Word                    Weight

Abnormal             -1

Aborted                 -0.4

Absurd                  -1

Agile                      1

Affordable            1

 

The new Dictionary Based Sentiment Operator can handle such an input and creates a model out of it. We use the word list provided at https://www.cs.uic.edu/~liub/ which has two separate files for positive and negative words. After a quick pre-processing, we can build the dictionary based model.

 

Dictionary based sentiment_1.png

 

The created model, shown below, can be used with the Apply Dictionary Based Sentiment Operator. The input for this is a collection of tokenized documents. This gives you the freedom to use all text mining operators to prepare your documents. A typical workflow would be to create a Collection of Documents (e.g. via Read Documents and Loop Files) in conjunction with a Loop Collection. In the Loop Collection, you can use all different Operators of the text processing extension.

 

Dictionary based sentiment_2.png

 

The result of the Apply Dictionary Based Sentiment Operator is an ExampleSet with:

  • The text
  • The Score – e.g. the sum of weights for this document
  • The Positivity – e.g. the sum of positive weights for this document
  • The Negativity – e.g. the sum of negative weights for this document
  • The Uncovered Tokens – e.g. the tokens which were in the document but not in the model

 

Dictionary based sentiment_3.png

The Process shown in the images is attached to this post. Feel free to check it out.

 

Performance (AUPCR)

 

The Performance (AUPRC) Operator enables you to evaluate a binominal classification problem with a new performance measure.

AUPRC stands for Area under Precision Recall Curve and is tightly connected to AUC. AUC measures the curve under the - False Positive Rate - True Positive Rate curve. AUPRC is very similar, but replaces False Positive Rate with precision.

It’s beneficial, because precision might be a more interpretable measure compared to FPR. Precision is on the other hand a measure which depends strongly on the class balance. It is often useful to use AUPRC if you do know the class balance in applications.

 

Papers to Read:

Thanks to @SvenVanPoucke for his helpful contribution

 

Additional Changes

 

  • Improved documentation of the Tukey Test Operator
  • Added several tags for the Operators in the Extension
  • The Create ExampleSet Operator now correctly uses the separator specified in the parameters of the Operator
  • The Get Local Interpretation Operator now has an additional outputPort which contains the collection of all local models
  • The Get Local Interpretation Operator now correctly normalize the input data. It also has now the possibility to use a locality heuristic instead of specifying the locality directly.
158 Views
0 Comments

We are happy to announce the release of a new version of the Operator Toolbox Extension. With version 0.4.0 some new enhancements wait for you:

 

Stem Tokens Using ExampleSet

 

 

Read more...

New Version 0.4.0 of the Operator Toolbox Extension available. 

 

We are happy to announce the release of a new version of the Operator Toolbox Extension. With version 0.4.0 some new enhancements wait for you:

 

Stem Tokens Using ExampleSet

 

This Operator is an enhancement to the Text Processing Extension. It can be used inside a Process Documents Operator. It replaces terms in Documents by pattern matching rules. The list of tokens to be filtered out is provided by an ExampleSet containing the replacement rules.

Here are the results on the sentence:
“sunday monday tuesday wednesday thursday friday are all days of week. Sunday and Saturday are not”. The Stem Tokens Using ExampleSet Operator replaces all words matching .*day with weekdays. The left image shows the result of the Process Documents without the Stem Tokens Using ExampleSet Operator, the right one with the Operator.

 

stem_tokens_both.png

 

The new Operator is like the Stem (Dictionary) Operator, but uses an ExampleSet instead of a file.

 

Weight of Evidence

 

This Operator introduces a new method for discretization. The generated value expresses the chance of a binominal attribute (called base of distribution) having positive or negative value in the discretized group. This value will be the same for all the Examples that belong to the same group. Unlike other discretization Operators, this one assigns numeric values to each class.

If the Weight of Evidence value is positive, the Examples from that group are more likely to have the positive value for the base of distribution Attribute than the whole crowd. The higher the Weight of Evidence value, the greater the chance for positive base of distribution value.

The attached tutorial process helps understand the beneficial aspects of using this Operator as a substitute for other discretization methods. The following image shows the result of the tutorial process in which the Weight of Evidence Operator is applied on the Titanic data sample.

 

WoE.png

 

Split Document into Collection

 

This Operator splits a Document into a Collection of Documents according to the split string parameter.

This is, for example, very helpful if you have read in a complete Text File (with the Read Document Operator) and wants to split it into different lines to process the file line by line.

Check out the tutorial process of the Operator to get an impression how it works.

 

Dictionary Based Sentiment & Apply Dictionary Based Sentiment

 

In some cases, you want to build a sentiment model based on a given list of weights. The weight represents the negativity/positivity of a word. This dictionary should have a structure like this:

 

Word                    Weight

Abnormal             -1

Aborted                 -0.4

Absurd                  -1

Agile                      1

Affordable            1

 

The new Dictionary Based Sentiment Operator can handle such an input and creates a model out of it. We use the word list provided at https://www.cs.uic.edu/~liub/ which has two separate files for positive and negative words. After a quick pre-processing, we can build the dictionary based model.

 

Dictionary based sentiment_1.png

 

The created model, shown below, can be used with the Apply Dictionary Based Sentiment Operator. The input for this is a collection of tokenized documents. This gives you the freedom to use all text mining operators to prepare your documents. A typical workflow would be to create a Collection of Documents (e.g. via Read Documents and Loop Files) in conjunction with a Loop Collection. In the Loop Collection, you can use all different Operators of the text processing extension.

 

Dictionary based sentiment_2.png

 

The result of the Apply Dictionary Based Sentiment Operator is an ExampleSet with:

  • The text
  • The Score – e.g. the sum of weights for this document
  • The Positivity – e.g. the sum of positive weights for this document
  • The Negativity – e.g. the sum of negative weights for this document
  • The Uncovered Tokens – e.g. the tokens which were in the document but not in the model

 

Dictionary based sentiment_3.png

The Process shown in the images is attached to this post. Feel free to check it out.

 

Performance (AUPCR)

 

The Performance (AUPRC) Operator enables you to evaluate a binominal classification problem with a new performance measure.

AUPRC stands for Area under Precision Recall Curve and is tightly connected to AUC. AUC measures the curve under the - False Positive Rate - True Positive Rate curve. AUPRC is very similar, but replaces False Positive Rate with precision.

It’s beneficial, because precision might be a more interpretable measure compared to FPR. Precision is on the other hand a measure which depends strongly on the class balance. It is often useful to use AUPRC if you do know the class balance in applications.

 

Papers to Read:

Thanks to @SvenVanPoucke for his helpful contribution

 

Additional Changes

 

  • Improved documentation of the Tukey Test Operator
  • Added several tags for the Operators in the Extension
  • The Create ExampleSet Operator now correctly uses the separator specified in the parameters of the Operator
  • The Get Local Interpretation Operator now has an additional outputPort which contains the collection of all local models
  • The Get Local Interpretation Operator now correctly normalize the input data. It also has now the possibility to use a locality heuristic instead of specifying the locality directly.
317 Views
0 Comments

Nithin wraps up his KDD Cup project with a higher AUC. Is it even close to what the winner got? Find out in his last post!

Read more...

By Nithin Mahesh

 

In my last post, I talked a little bit about my results and some of the most useful features of RapidMiner. In this post, I will talk about how I got to a much more accurate AUC than the 0.519 I was getting before.

 

After struggling for a couple weeks with this data set I eventually was given advice from @IngoRM who told me to double check my data import. Make sure to always take your time with data prep, before jumping into models and validations. Took me a while to realize that I had imported my label data incorrectly, which is what gave me the bad AUC. When I used the retrieve operator it automatically took the first row as the header. This caused the data to move each label up a row, but once I fixed this all I had to do was to add in some sample operators in order to balance the data which was biased with the -1 label. The data prep was minimal, just had to connect it to a cross validation which as I mentioned before ran my model (boosted gradient tree), split my data automatically, and tested the performance. My results after some simple changes ended up being an AUC of 0.736 compared to the winning IBM results that were 0.7611. This shows how powerful RapidMiner really is, with minimal prep I could try out different models, run them, and test them to see which gave me the best AUC. I attached some screenshots of the results below:

 

Image1.png

 

 

Image2.png

 

Another important thing to keep in mind was how imbalanced the data was, most of my data prep was centered around balancing the data better. I had to try a number of things before settling with the sample operator, including feature selection and reducing the number of attributes. In the end since the gradient boosted tree was the most powerful and dealt with my many missing values I kept that and cut down on my prep.

 

After a final discussion with Ingo to go over my final models vs his models I learned that I didn’t necessarily need to worry about the data being balanced but focusing on optimizing the AUC since that’s what the results were based on. I was under the impression that getting better balance and precision within the confusion matrix would better my AUC. Unfortunately, being early in my data science career I am also still learning!

 

However, after working on this project I have come to realize that RapidMiner is great in optimizing time spent building models and running analysis. The software is very efficient in that sense, if I were to recreate this in R or python, the same tasks would have taken much longer, maybe even a couple weeks more.

 

Ingo had mentioned that a lot of the competitors in the KDD Cup spent months on this data set simply due to the number of models they ran. Using ensemble modeling the cup contestants would have run hundreds of models optimizing/training each one to get that 0.7611. In that way RapidMiner is impressive, I could take about a half an hour (not accounting the time I spent learning the product) and end up with a result of 0.736 AUC.

 

Thanks for following my last couple of blog posts, I hope I provided some useful tips that will aid in unleashing all of RapidMiner Studio’s full potential!

 

 

180 Views
0 Comments

Nithin get's down and dirty with building a high AUC for his task. Follow along as he tries to go from a 0.5 coin flip to a more robust AUC,

Read more...

By: Nithin Mahesh

 

In my last post, I talked about how I began prepping my data, some of the operators I used, and some of the issues I ran into. In this post, I will talk briefly about my results and some of the most useful features in RapidMiner Studio.

 

As I mentioned last week, I ended up not getting the results I wanted. Even after running different validations on my data including cross validation. I kept getting an AUC of about 0.519 which is really bad compared to the results of IBM Research that were at 0.7611.

 

A couple of small things to consider that I wish I had started with before jumping into the data set. I found signing up for the workshop earlier would have been helpful; this gave me a nice review on how to import, prep, model, and interpret my data. The instructor was good at answering any questions I had and it helped a lot that the workshop was interactive. I also picked up a lot of simple productivity features such as how to disable unused operators or how to organize my operators so they weren’t all over the screen. Another feature I learned later about was the background process feature, worth looking at, for commercial users that gives one the ability to work on other processes while running some in the meantime. 

 

Viewing your data table can also be a challenge at times, if you’ve ever run any processes on RapidMiner you have probably ran into the issue of not being able to view your data table after closing the results, application, or after running another process (shown below).

image1a.png

 

It took me a bit to realize that breakpoints let one view the table at any operator in the process, which is really useful to debug and view changes to your set. This can be done by right clicking on an operator as shown below:

 

image2a.png

 

 

 

After a lot of data prep, I ran some models and validations as mentioned in the last post. The problem with this was that my data prep was very process intensive and despite having access to all the cores of my computer I ran into hours of loading time before even getting to my models. I learned later that there is a way to cut down the time using the multiply and store operators meaning I essentially took a copy of my data prep (multiply) and then stored it (store). I then created a new process in which I used a retrieve operator to grab the data prep. In my new process, I could run my cross validation and models without having to reload all my data prep, which saved me some hours of waiting time. One thing to note was that any time I changed a parameter in data prep I would have to run that process again so that the model process had the change.

 

This brings me to another important feature to keep in mind, the logs. With a large data set some of the validations I ran would take too long to load. I would wait for these for hours just to get an error telling me the computer was out of memory. I eventually found the logs, under view then show panel, which gave me warning errors during the process, so I wouldn’t have to waste time for the process to eventually end.

 

The help tab on RapidMiner Studio is another useful resource that gives a nice overview of all the parameters and their functions for any of the hundreds of operators there are. The documentation includes links to hands on tutorials right in RapidMiner under the Tutorial Process. RapidMiner’s Wisdom of Crowds feature was another useful feature within Studio, great for finding operators that would be the most useful for that task, especially when I was unsure of what to use. The community page was the next best resource, any specific questions I had were either mentioned in past posts or I could make my own post. The response time was quick to any questions I posted as well!

 

In my next post, I will talk about my end results and what I did to my data prep to finally get the AUC I was looking for.

 

  • Data Science
  • Journey
  • New User
345 Views
0 Comments

We continue Nithin's journey using RapidMiner Studio for the very first time. How quickly does a young data scientist grasp this data science platform and get productive?

Read more...

By: Nithin Mahesh

 

In my last post, I gave an introduction of my work at RapidMiner for this summer and the Churn data science project I was given to work on. I touched briefly on the data set provided by the KDD Cup 2009 from the French Telecom Company Orange’s large marketing database. In this post, I will talk about how I began to learn RapidMiner starting with my data prep.

 

The first challenge was figuring out which models and prep were needed, since the majority of the data contained numeric values. Categorizing them was hard! Here is the data I was given once I uploaded it into RapidMiner Studio:

 

Image1.png

 

Before attempting any data prep, I opened the RapidMiner Studio tutorials under the create process button -> then learn as shown below:

 Image2.png

 

 

This was useful to understand how to navigate the software and the basic flow necessary to run analytics on a data set. The RapidMiner website provided some good introductory getting started videos that were very useful. After playing around with several tutorials I had a basic knowledge of how things worked and began to plan prepping this data set. In the same create new process section there are many templates that can be run to see examples of the analytics you may use on your data. In my case I was looking at customer churn rate and found a template running analysis on a data set to find whether a customer was true for turning over or false for not.

 

After going through some tutorials, I was ready to import the data and began by clicking the add data button, but ran into some errors. I found that the read csv operator was much more powerful and ended up using this instead, despite the data having an odd file type (.chunk). Initially I had some issues with how the data was being spaced and realized this was due to not configuring the right parameter in import wizard. After getting the data in and connecting it to the results port I started to plan how to organize the random numeric values. First thing I noticed was the set contained many missing values indicated by “?” so I needed to use the replace missing values operator, which I then set the parameters to replace these values with zero, I later changed this to average these values instead.

 

I then downloaded and imported the label data using the using the add data button then joined the data as shown below:

 

Image3.png

 

This gave me an error since I needed to create a new column to add in the label data. I tried using the append operator which I learned after some playing around was for merging example sets. I eventually found out the generate ID operator is the right one to use to create a new column.

 

When preparing data, it is useful to split it into train and test sets, this way we can train and tune the model with the training set. Once this is done we can test the training set on how well it generalizes to data not seen before with the test set data. One thing I was unaware of is RapidMiner Studio contains many operators that combine multiple steps into a single one. The sub process panels are the most powerful within these types of operators allowing me to perform a lot of analytics all in one go as shown below:

 

Image4.png

 

 

 

 

 

Splitting data into test and train sets, running models, and running performance can all be done using a single operator. One can run multiple processes that in R would take a couple of steps to complete. In my case, cross validation seemed like the best option since I needed to split my data to train/test then check my models accuracy, precision, and performance.

At this point in my data prep I ran into a couple of problems that comes along with using a new software for the first time. Since I was working with such a large data set running the cross validation, which is a big process, I did not have enough memory to run it. When this occurs it’s best to either narrow down the number of operators used or try to reduce the number of attributes. Using the remove useless attributes operator, I could cut down some features that were not being used. Some other useful operators were the free memory and the filter examples operators.

In my next post, I will talk about my results (or lack of), what issues I faced, how I went about solving them, and what I found to be the most useful features in RapidMiner Studio.

 

 

 

 

  • Data Science
  • Journey
  • New User
502 Views
0 Comments

Follow this multi-part story of a brand new RapidMiner user on his journey using this real data science platform. 

Read more...

By: Nithin Mahesh

 

My name is Nithin Mahesh, I just finished my sophomore year at the University of Massachusetts Amherst studying Informatics Data Science. I recently took some classes on R programming and introduction statistics courses so getting an internship at RapidMiner was a great way to gain some experience in my field!

 

I am currently interning on the marketing team for the summer working on a variety of projects involving the product, RapidMiner Studio. One of the first tasks I was given was to download and sign up on the software. Part of my job was to understand the process for new RapidMiner Studio users and help provide suggestions on how we can improve how users navigate, get help, and work with the product.

 

I was given the KDD Cup 2009 data set; essentially a competition created by the leading professional organization of data miners. Many of the top companies participate including Microsoft, IBM Research, and many more using their own machines and data mining techniques. The large data set consisted of 10,000 rows and 15,000 attributes with mostly numerical and nominal data but also included some missing values. The small set consisted of 50,000 rows and 230 attributes; containing some missing values as well.

The data set was taken from the French Telecom Company Orange and is from their large marketing database. The challenge of Orange’s data is that one must be able to deal with a very large database containing noisy data, unbalanced class distributions, and both numerical/categorical data. The competition task was to find the customer churn rate, appetency, and up-selling with the results evaluated by the Area Under Curve (AUC). The main objective was to be able to make good predictions using the target variables, which needed to be predicted. This can then be displayed in a confusion matrix to represent the number of examples falling into each possible outcome.

 

There were two types of winners, those of the slow challenge and those of the fast challenge since KDD released both a large and small data set. The slow challenge was to achieve results on either the large or small data before the deadline and the fast challenge was a submission within five days of the release of the training labels. The results of the fast challenge was IBM Research taking the lead, followed by ID Analytics Inc, and last Old Dogs with New Tricks. The slow challenge was University of Melbourne, followed by Financial Engineering Group Inc Japan, and National Taiwan University Computer Science and Information Engineering. The AUC evaluation for churn by IBM Research ended up being 0.7611, which is  what I’d be comparing my results to.

 

Orange Labs already has their own customer analysis platform capable of building prediction models with a very large number of input variables. Their powerful platform implements a variety of features such as processing methods for instances and variable selection or variable selection regularization and model averaging method. Orange’s platform can scale on very large datasets with hundreds of instances and thousands of variables, with the KDD challenge goal to be able to beat their in-house system.

In my next post, I will talk about how I began to learn RapidMiner, starting with how to prep the data.

  • Data Science
  • New User
  • rapidminer
314 Views
0 Comments

We have some great news, the Community hit a few milestones last week!

Read more...

First the news, then the links!

 

We have some great news, the Community hit a few milestones last week. Community member @sgenzer declared @bigD the winner of the first Community Data Challenge! This challenge was so much fun that we'll do another one in July, maybe something with an open data set. Keep your eyes peeled on the Community!

 

The next milestone was that we crossed over 250,000 Community Members! We'll be reaching out that member shortly and giving them some swag as a way to say thank you for being part of the Community!

 

Now the links!

 

1. K-means and Davis Boulbin! Community member @namachoco99 has some questions and @mschmitz helps him out.

 

2. The ongoing discussion about how to score a time series model in RapidMiner!

 

3. Want to convert Hex to Decimals in RapidMiner? Just use this handy script!

 

4. How do you communicate a Gradient Boosted Tree model to your boss or non Data Scientists?

 

5. Working with Radoop and Amazon's EMR Hadoop Distro.

 

  • Community
  • links
  • News
  • Roundup
174 Views
0 Comments

Welcome to another edition of the Community Roundup! Here are a few interesting links from around the Community and the Interwebz. 

Read more...

Welcome to another edition of the Community Roundup! Here are a few interesting links from around the Community and the Interwebz

 

From the Community:

  • Community
  • Link
  • Roundup
346 Views
0 Comments

We are currently working on some new data integration and enrichment operators to aid your data mining journey. Therefore, we are running a case study for testing our latest findings. Within this study you are given a short introduction with some guiding material to help you test some new RapidMiner operators. That’s it! Just test the operators in your current environment and tell us your findings and ideas.

Read more...

We are currently working on some new data integration and enrichment operators to aid your data mining journey. Therefore, we are running a case study for testing our latest findings. Within this study you are given a short introduction with some guiding material to help you test some new RapidMiner operators. That’s it! Just test the operators in your current environment and tell us your findings and ideas.

 

Objectives to test

 

If you are interested contact us via research@rapidminer.com and we will get in touch with you. You are asked to hand in observations within four weeks.

 

Background

Finding the right data is a crucial step in every data mining project. Data is often distributed across different places and obtaining it might be difficult. Hence integrating various formats and sources is key. In the research project ‘Data Search for Data Mining’ we are investigating new ways of aiding this process by making data from previously unavailable sources easily available within RapidMiner. Possible sources are for example tables stored in PDFs or Google Spreadsheets.

And what about those data sets you already have but are unaware of? For that, we’re working together with the University of Mannheim to enrich data sets with existing data from internet and intranet sources in a (semi-)automatic way.

 

Happy Mining,

Philipp for the RapidMiner Research Team

  • Case Study
  • extensions
  • Research
522 Views
0 Comments

MeaningCloud has released a new RapidMiner Extension that provides a high-quality multilingual text mining, thanks to its broad analytical functions and customizability.

Read more...

By: Antonio Matarranz, CMO of MeaningCloud

 

MeaningCloud has released a new RapidMiner Extension that provides a high-quality multilingual text mining, thanks to its broad analytical functions and customizability.

 

Would you like to extract the information underlying unstructured text -from documents, customer interactions and social comments-, combine it with structured data and incorporate it into your analytic models based on RapidMiner?

 

The new RapidMiner Extension for MeaningCloud gives users the ability to structure all types of text and extract its meaning. It provides RapidMiner users with a set of operators performing some of MeaningCloud’s most popular functions: entity and concept extraction, theme classification using standard taxonomies, sentiment analysis, and lemmatization.

 

MeaningCloud Extension RapidMiner.png

 

 

 

 

 

More importantly, MeaningCloud has incorporated powerful customization tools that enable users to adapt it to their application domain (e.g. analysis of the voice of the customer in the financial industry) through the creation of personal dictionaries and classification and sentiment models. These capabilities are unique in the industry and ensure high levels of precision and recall.

Practical applications of this Extension range from root cause analysis in customer surveys to fraud or churn prevention.

 

Download it from RapidMiner Marketplace or MeaningCloud website.

 

Learn how to use the Extension in this recorded webinar

Would you like to see the MeaningCloud Extension in action, in a real-life scenario that combines structured data and text analytics? Learn more about application scenarios? Check out this recorded webinar.

 

Build your first text+data models in a snap using these tutorials

Use these two tutorials to learn how to extract insights that combine structured data with unstructured text. We use a dataset of food reviews from Amazon, including numeric scores and free text verbatims, to

  1. Analyze sentiment from the text and assess its correlation with numeric scores (see tutorial).
  2. Extract topics from the text and use them to induct a rule-based model to predict sentiment (see tutorial).

All data, analytics workflows, models, and results are available for download from the tutorials. Happy analyzing!

--

Author:

Antonio Matarranz, CMO of MeaningCloud.

 

Antonio is an engineer become marketer. He holds a master's degree in Electronic Engineering from the Technical University of Madrid, a MBA from the IE Business School and a Executive Certificate in Marketing & Sales Management from the Kellogg School of Management.

 

Connect with Antonio on LinkedIn.

  • Extension
  • Guest Blogger
  • MeaningCloud
528 Views
0 Comments

We are happy to release version 0.3.0 for Converters and Operator Toolbox Extensions. We worked hard to add new useful functionality as well as polished existing features. Without further waiting – here are the new features for your Data Science processes.

Read more...

New Versions 0.3.0 for the Operator Toolbox and the Converters Extension available.

 

We are happy to release version 0.3.0 for Converters and Operator Toolbox Extensions. We worked hard to add new useful functionality as well as polished existing features. Without further waiting – here are the new features for your Data Science processes.

 

Converters - Decision Tree to ExampleSet

 

You can now convert a decision tree model into an ExampleSet. Each individual path in the tree is thereby represented by one row in the ExampleSet. The condition for the path is given as a nominal attribute, as well as the prediction and the number of examples collected in the leaf.
This is how it looks for a decision tree which was trained on the Iris data sample.

Results of the Decision Tree to ExampleSet OperatorResults of the Decision Tree to ExampleSet Operator

 

 

 

 

 

 

 

 

Check out the attached process (see below) or the tutorial process in RapidMiner Studio.

 

Converters - Logistic Regression to ExampleSet

 

A logistic regression model can now be converted into an ExampleSet. The resulting ExampleSe   t contains the Coefficients, Std. Coefficients and Std. Error as well as the z-Values and the p-Values.

This is what you get when you apply it on a Logistic Regression which was trained on the Deals Data Sample.

Results of the Logistic Regression to ExampleSet OperatorResults of the Logistic Regression to ExampleSet Operator

 

 

 

 

 

 

 

 

 

 

 

You can find a tutorial process attached to this Post.

 

Operator Toolbox - Create ExampleSet

 

This Operator can be used to create an ExampleSet from a text box. Just insert the data in a CSV-like format into the text box of the Operator. No need to create any test-CSV files anymore and as the Operator is part of the process, sharing test data is more easy than before.

 

Operator Toolbox - Set Parameters from ExampleSet

 

You can change now the Parameters of other Operators in your process, by passing the desired changes as an  ExampleSet to the input of this new Operator.

Just create an ExampleSet with the Operator name, the Parameter name and the actual value of the Parameter as attributes. During execution of the process the Set Parameters from ExampleSet Operator change the Parameters of the corresponding Operators to the provided values.

 

Operator Toolbox - Set Macros from ExampleSet

 

If you want to provide a larger number of macros in a process, you can now use this new Operator to automatically do this. You provide an ExampleSet with the macro names and the macro values as attributes and the Operator sets the macros accordingly.

 

Operator Toolbox - Get Local Interpretation

 

This new Operator is a meta Operator to generate an approximation of the decision a given (complex) model made for specific examples. The basic idea is to generate local feature weights (“Interpretations") for every Example which can be easier interpreted. This can help to understand the “reasoning” for a decision of the complex model.

See in the following screenshot, how the results look like for some Examples for the decision of a Gradient Boosted Tree interpreted using the Weight by Gini Index. The corresponding tutorial process is attached to the Post.

Results of the Get Local Interpretation OperatorResults of the Get Local Interpretation Operator

 

 

 

 

 

 

 

 

 

 

So, next time someone ask you why your model decides in a specific way you can use this Operator to provide an interpretation of this decision.

The algorithm is very similar to LIME. Details on Lime can be found here:


Operator Toolbox - Collect and Persist

 

Another Operator helpful for complex process setups is the new Collect and Persist Operator.
It is used to collect various object created during the execution of a process.
The Operator creates a new collection (holding the object provided at its input port) when it is executed the first time. The collection is then saved in the cache of the process. Subsequent use of the Operator will add more objects to the Collection.

Finally, the resulting collection can be retrieved by a simple “Recall from App” Operator.

The Operator can be used to collect arbitrary objects during an Optimization (for example all models and Performance Vectors).

 

Operator Toolbox - Filter Tokens using ExampleSet

 

This Operator is an extension to the Text Processing Extension. It is similar to the Filter Token (Dictionary) Operator, but it receives an ExampleSet as input for the filters.

It can be used inside any Process Documents Operator to filter for strings which you provide using an simple ExampleSet.

 

 

If you have not yet done it, check out the Converters Extension and the Operator Toolbox Extension

 

521 Views
0 Comments

Read data from Google Spreadsheets, enrich using external APIs and analyse text – all in RapidMiner!

Read more...

By: Edwin Yaqub, PhD

In this article, you will be introduced to the ‘Spreadsheet Table Extraction’ - a new extension developed by RapidMiner Research. This article also provides a walk-through on how this extension may fit into your analysis process chain.

 

Motivation:

 

Many organizations are storing data in Google spreadsheets because they offer several advantages over offline spreadsheet solutions. To name a few: high availability of data and ease of collaboration based on sharing rights. Its integration with Google Drive and Google Docs allows to fetch spreadsheets (also having a different formats) from these sources.

 

The extension provides a ‘Read Google Spreadsheet’ operator, which extracts data from a Google Spreadsheet document and converts it to RapidMiner ExampleSet. Hence, it extends the reach of your data mining processes to live documents, which may be regularly updated e.g., if you are collecting feedback, sales or any other data from multiple customers or stakeholders. Thus, this operator enables you to stay up-to-date with newly arriving data, continuously assess your analytics models and adjust business decisions if necessary.

 

Let’s take an example to bring these concepts in a unified scenario by virtue of which you can learn the enrichment and integration capabilities of RapidMiner. This not only includes data but also third party services.

 

Pre-Requisites:

 

Please install the following extensions through the RapidMiner Marketplace to ensure you can reproduce the steps presented next:

  • Web Mining Extension
  • Text Processing Extension
  • Spreadsheet Table Extraction Extension

 

A Unified Example Scenario for Text Analytics (Extract, Enrich, Process, Interpret):

 

We consider the simple case of a book (or another product for that matter) review for demonstration purpose only. A good debut of a book may start with large purchases due to initial hype, but for the book stores, publishers or resellers, user feedback is essential to determine the consumer/market sentiment over time. Insights based on sentiment analysis can lead to e.g., better inventory management or creating improved selling propositions.

 

Extract:

We take a subset of customer reviews for the book ‘The Martian’ as made on amazon.com. After cleaning the data, the review text (excluding ranking information) is made available as a Google spreadsheet at [1]. The ‘Read Google Spreadsheet’ operator reads this sheet in RapidMiner as shown
in Fig.  1. Simply provide it the url of spreadsheet, the sheet name and a client secret file.

Image1.png

 

 

 

 

Obtaining the client secret file:

 

As Google spreadsheets are managed by Google API servers, you need to turn-on the Google Sheets API for your Google account and get a client secret (JSON) file, which contains your authentication credentials. To obtain it, you can follow the following steps:

  1. Visit Google Developers Console at 'https://console.developers.google.com/flows/enableapi?apiid=sheets.googleapis.com'.
  2. Create or select a project from the list box. This also enables API for your project.
  3. Go to 'credentials'. Now you land on the 'Add credentials to your project' page. Click Cancel.
  4. Now select 'OAuth consent screen' tab. Provide email and a product name and click Save.
  5. Now you are at the 'Credentials' tab. Click the 'Create credentials' list box button and select 'OAuth client ID'.
  6. From the list of options, select 'Other', provide a name and click Create.
  7. Now you see your 'OAuth 2.0 client ID' under the Credentials tab. Click the 'Download JSON' link. This gives you your client secret file.

 

Enrich:

We have already extracted data from a Google spreadsheet. Now let us set two objectives towards analysing this text data:

  • Understanding the polarity of users' sentiment towards this book through a third-party service, thereby enriching the data further.
  • Identifying the more frequently used words that influence the polarity classes.

  
To analyse sentiment, one approach is to use a dictionary such as the RapidMiner Wordnet dictionary as explained here [3]. The other approach that we will use is to use a third-party service, that already provides such a classification on whole sentences. In this example, we will use IBM Watson's Natural Language Understanding (NLU) API - a state of the art in natural language processing (NLP) technologies to classify examples in our dataset as positive, negative or neutral. Fig.  2 shows the RapidMiner process that achieves this.

Fig.  2 RapidMiner process to read Google spreadsheet, some filtering and enrichment by Webservice invocationFig. 2 RapidMiner process to read Google spreadsheet, some filtering and enrichment by Webservice invocation
As seen, we filter examples with the length of review text limited to one thousand characters in length. This leaves us with 600 examples. We then stuff this text (found under attribute name ‘review’) in the payload of a webservice request that will be invoked using the ‘Enrich Data by Webservice’ operator (from the Web Mining extension). The operator makes an API call for each example of the ExampleSet provided to it. To get authenticated by the API, just create a free user account at [4] and you will receive your login credentials. Upon first invocation, the operator prompts for your username and password as shown in Fig.  3.

Image3.png

 

 

The main parameters of the ‘Enrich Data by Webservice’ operator are configured to satisfy the API requirements [5] as follows:

  • request method = POST, request properties = Content-Type: application/json
  • query type = Regular Expression
  • attribute type = Nominal
  • Point the url to [6]
  • 4 shows the value for body parameter (the JSON to send as a request) and the regular expression queries we apply to extract results from the response JSON.

 

Image4.png

 

 

As seen in Fig.  4, we enquire Watson about several features:

  • the sentiment (polarity and score) to be assessed from the document i.e. the chunk of text (value of our ‘review’ attribute). If we think how we humans share our experience about something, we often talk about an object or place or a personality in terms of our impressions. Such information can help understand the context of our customer reviews. Like some other systems [8,9] Watson can also extract entities from text as Person, a Quantity, an Organization, a Movie, a Location, a Company, a Broadcaster, a Job Title, etc.
  • Further we can assess emotion (as a score between [0,1]) associated with the entity e.g., disgust, joy, sadness, fear or anger.

 

In this way, we have enriched our dataset using an external system. This demonstrates how API based systems can be harnessed rather conveniently and hooked with our RapidMiner processes to develop well integrated quasi service-driven data mining workflows – all within RapidMiner Studio!

 

Other advanced features can be enquired from Watson as well, but our free account is limited in terms of API calls per day, payload size (that’s why we used reviews of limited length earlier) and the number of features requested – all billable items that need careful consideration.

 

Process:

By now we have enriched the data with a third-party sentiment classification service. This data set is made available at [2] for your reference. What remains to be identified are words that influence these class labels the most. We will use feature weights to determine this, but before that, we prepare an unbiased dataset comprising an equal number of positive and negative reviews. This gives us a reduced dataset of 304 examples, 152 for positive and negative, while ignoring the neutral ones as they are less discriminating and hence less interesting.

 

Next, we use the Text Processing extension to build a standard operator chain inside the ‘Process Documents from Data’ operator, as seen in Fig.  5. Use TF-IDF for word vector creation and pruning methods to see the impact on number of attributes (tokens) you get. You will notice that percentual pruning in the range [3.0, 30.0] already reduces attributes from 10633 to 223, with no negative effect on the results but noticeable difference in execution speed of the process due to processing complexity. The sub-process performs tokenization, filtering out tokens smaller or larger than certain lengths as well as stop words as these bring no value, change case, apply stem so similar origin words are considered related, and finally we apply n-Grams as 2-Grams. The latter is helpful in analysing writing styles that use adjective declinations e.g., brilliant plot, excellent book, etc. This complex processing is relatively straight forward in RapidMiner.

Image5.png

 

 

The complete process uses several steps and is attached with this article for your reference.

 

Depending on the text, tokenization can result in many attributes (despite pruning). The question is which of these features are rather important? Weights to the rescue! The weight of an attribute gives its importance in relation to the target attribute, also called label. Once we have the weights, we can associate our preference to attributes.

 

RapidMiner offers more than a dozen operators that can characterize attributes with weights. We assign the role of label to the polarity attribute and split the data in two sets (one containing positive and other negative examples). Then we simply apply ‘Weight by Value Average’ operator on each set to weight attributes against their class. Next, the ‘Select by Weights’ operator can filter out attributes whose weights are above a desired threshold, say 0.6. This gives us most influencing attributes (in terms of their TF-IDF occurrence frequencies). Feature weighing is explained in a community article available at [7] which also touches on feature selection in RapidMiner.

Interpret:

We can now interpret results of our text analytic process. Three main insights were discovered.

  1. Overall polarity score: This showed that the average positive value is higher than average negative (+0.5 vs -0.34). This sounds good for the author, publishers and resellers.
  2. The bag of influential words that are found exclusively in positive reviews are shown below:

great, enjoy, love, recommend, make, fiction, survive, read_book, book_read, mark, science_fiction

 

 

 

Notice the combinations like read_book and science_fiction, captured by the 2-Gram n-Gram. Similarly, the influential words associated exclusively with negative reviews are:

martian, mar, detail, technic, page, novel, plot, know, found, finish, work, write, movie, weir

 

 

 

 

A bit tricky are the overlapping words i.e. those that appear in both positive and negative reviews. These are: “good, time, author, interest”. One way to analyse these is to look at the distribution of their TF-IDF values in data, as shown for the word ‘time’ and ‘author’ in Fig.  6.

 

 

Image6.png

 

 

  1. The entity analysis based on average values reveals that of the 223 reviews (quite a small dataset), 168 talked about some entities and associated a certain emotion with them as seen in the table below:

 

Image7.png

 

 

Except for the Quantity entity, Watson seems quite confident about the entities it detected (relevance > 0.8). The entity Person tops the count and the strongest emotion associated with it is sadness (value > 0.3). The concatenation of entity names (shown as the last column) gives clues in that direction.

 

Conclusion:

 

In this article, you learned about the new extension for reading Google Spreadsheets as ExampleSets in RapidMiner. Using this as a starting point, a more holistic scenario was presented. We gradually moved from data extraction to data enrichment through a third-party API. We then performed text analytics using out-of-the-box features of RapidMiner. Finally, we got a glimpse of how the amazon.com customers reviewed the book ‘The Martian’. Our limited dataset served as an educating exercise and is of course not meant to influence the book sales or market opinion in any way.

 

Acknowledgments:

 

The Spreadsheet Table Extraction extension is developed as part of “Data Search for Data Mining (DS4DM)” project (website: http://ds4dm.com) which is sponsored by the German ministry of education and research (BMBF).

 

 

References:

[1] A sample Google Spreadsheet, sheet name ‘The Martian’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1779829...

[2] A sample Google Spreadsheet, sheet name ‘Sentiments’, weblink: https://docs.google.com/spreadsheets/d/1vRJi3Ur3w6-9WhOa0G-vJ6GR4-RWGiBCwrQok5ILsow/edit#gid=1445561...

[3] RapidMiner Community article, Sentiment Analysis using Wordnet Dictionary, weblink: http://community.rapidminer.com/t5/Text-Analytics-in-RapidMiner/Sentiment-Analysis-using-Wordnet-Dic...

[4] IBM API Accounts, weblink: https://myibm.ibm.com/dashboard/

[5] IBM NLU API reference, weblink: https://www.ibm.com/watson/developercloud/natural-language-understanding/api/v1/#sentiment

[6] Endpoint Reference of IBM NLU API, weblink: https://gateway.watsonplatform.net/natural-language-understanding/api/v1/analyze?version=2017-02-27

[7] RapidMiner Community article, Feature Weighting Tutorial, weblink: http://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/Feature-Weighting-Tutorial/ta-p/...

[8] Rosette extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_rosette_text_tool...

[9] Aylien extension for RapidMiner, weblink: https://marketplace.rapidminer.com/UpdateServer/faces/download.xhtml?productId=rmx_com.aylien.textap...

  • Extension
  • Google Sheets
  • Spreadsheet
122 Views
0 Comments

If you happen to work at WeWork Tysons in VA or in the general Washington DC area, come down and say hi! I'm going to be speaking at the Spark DC Meetup group this coming Tuesday (5/23) at 6PM. You can get more Meetup details here.

Read more...

If you happen to work at WeWork Tysons in VA or in the general Washington DC area, come down and say hi! I'm going to be speaking at the Spark DC Meetup group this coming Tuesday (5/23) at 6PM. You can get more Meetup details here.

  • Meetup
  • radoop
  • Spark
246 Views
0 Comments

Last week I started the process of hiding a few Easter Eggs inside the community. I can't tell you what they are but there are instructions for you to execute when you find them. Good luck!

Read more...

Last week I started the process of hiding a few Easter Eggs inside the community. I can't tell you what they are but there are instructions for you to execute when you find them. Good luck!

  • Easter Eggs
274 Views
0 Comments

People often think a given model can just be put into deployment forever. In fact, the opposite is true. You need to maintain your models like you maintain a machine. Machine Learning models can get off or broken overtime. This sounds odd to you because they have no moving pieces? Well, you might want to have a close look on change and drifts of concept.

Read more...

People often think a given model can just be put into deployment forever. In fact, the opposite is true. You need to maintain your models like you maintain a machine. Machine Learning models can get off or broken overtime. This sounds odd to you because they have no moving pieces? Well, you might want to have a close look on change and drifts of concept.

Change of Concept

Let’s start off with an example. If you try to build a predictive maintenance model for an air plane, you often create columns like

Error5_occured_last_5_mins

as an input for your model. But what happens if error number 5 is not error number 5 anymore? Software updates can drastically change the data you have. They fix known issues but also encode your data in a different way. If you take the post-update data as an input for your pre-update model — it will do something, but not what you expected. This phenomenon is called change of concept.

Drift of Concept

A very similar phenomenon is drift of concept. This happens if change is not drastic but emerging slowly. An industrial example is encrustment of a sensor. This happens over time and a measured 100 degrees are not 100 degrees anymore. An example in customer analytics are adoption processes of new technology. People did not use iPhones at once, but slowly adopted to it. A column like “HasAnIphone” would mean a very tech-savvy person — in 2007. Today this indicates an average person.

What Can I Do?

 An example for window based relearning. The pattern to detect circles moves over time. Only recent data points are included to built a model.An example for window based relearning. The pattern to detect circles moves over time. Only recent data points are included to built a model.
 

A common approach to overcome concept drifting is window based relearning. ....... Read more on my medium.com page

--------------------------------------------------------------------------
Head of Data Science Services at RapidMiner