Introducing the new Shapelet Extension

tftemmetftemme Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, RMResearcher, Member Posts: 103  RM Research

Introducing the new Shapelet Extension

We, the research department of RapidMiner, are happy to announce the release of version 0.1.0 of the new Shapelet Extension. Discover new possibilities to analyse complex time series data. Perform feature transformations of your data which are specific to your problem and are based on the underlying patterns in the time series.

The Shapelet algorithm was developed within the European Union-funded research project PRESED (see [1] and [2]), which focused on quality prediction of sensor data mining in the steel industry.

The basic idea of shapelets is that subsequences in a time series can represent a reoccurring pattern in the entire time series, and hence can be considered a base function of the time series data. In addition, some subsequences may only occur in certain classes of your data, and their occurrence can be used to train a machine learning model to predict these classes.


Image 1: Principle of Shapelet algorithm (also called EAST). [3]

To retrieve good shapelet candidates, subsequences are randomly drawn from a collection of time series batches. These candidates are then used to perform a feature transformation on a separate time series. The shapelet is then compared with a new time series batch and the minimal distance of the Shapelet candidate and the new time series batch is calculated. If the minimal distances for many batches are small, the shapelet can be considered to occur often in the time series and can then be represented as a base function.

The extension provides 4 new operators:

Create Searchspace operator:

This operator is used to draw the shapelet candidates from a collection of input batches. The candidates are collected in the new shapelet model which is provided at the 'shapelet model' result port.


Image 2: Demo process to create a shapelet model with the Create Searchspace operator.

Image 3: Visualization tab of a Shapelet Model.

Shapelet Transformation operator:

This operator takes a shapelet model and performs a feature transformation on a collection of input batches. The resulting features (e.g. the minimal distances between shapelets and the new time series) are provided at the 'features' output port.

Image 4: Demo process to perform a feature transformation with the Shapelet Transformation operator.

Image 5: Resulting feature vector of the shapelet transformation.

Select Shapelets by Weight operator:

This operator can be used to select the most meaningful shapelets from the whole shapelet model. First use the Create Searchspace operator to create a shapelet model. Then perform a shapelet transformation on labeled data and use any 'Weight by' operator to determine the weights of the calculated features according to the label. Then you can use the Attribute Weights in the Select Shapelet by Weight operator to select only the most important shapelets (base function) and apply them on unseen data

Image 6: Demo process to reduce the number of shapelets in the model to only the main shapelets, by using the Select Shapelets by Weight operator.

Image 7: Feature weights of the calculated features from the shapelet transformation.

Image 8: Reduced shapelet model, with only high feature weight shapelets selected.

Shapelet Model to ExampleSet operator

This operator can be used to convert the shapelets in a shapelet model to an ExampleSet to investigate them further.

You can download the free extension over the Marketplace (Shapelet Extension). For more information see [3]

[1] D. Arnu, E. Yaqub, C. Mocci, V. Colla, M. Neuer, G. Fricout, X. Renard, C. Mozzati and P. Gallinari: A Reference Architecture for Quality Improvement in Steel Production. 1st International Data Science Conference 2017, Salzburg

[2] D. Arnu, E. Yaqub, F. Temme, R. Klinkenberg, M. Neuer; Smart Data für die Qualitätskontrolle in der Stahlproduktion; Tagungsband 20. IFF-Wissenschaftstage; 21-22. June 2017, Magdeburg

[3] X. Renard, M. Rifqi, G. Fricout, M. Detyniecki : EAST representation: fast discovery of discriminant temporal patterns from time series, ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal Data, Riva Del Garda, Italy (2016)

BalazsBaranySGolbertTelcontar120eackley29IngoRMlionelderkrikoryyhuang

Comments

  • hughesfleming68hughesfleming68 Member Posts: 239   Unicorn
    edited February 22
    Would it be possible to have a section for extensions in the forum? If I had not seen @tftemme's post I might never have come across this. It could also be a place where additional pdf's could be posted which further explain the concepts of new tools such as these as well as further discussion.

    There should also be some communication about relevant articles posted on the net. For example, if I didn't follow Martin Schmitz @mschmitz I would have missed all his articles posted on medium. Users should not have to come across this by accident.

    I have attached the relevant pdf. I hope it is possible as it would be very helpful.

    regards,

    Alex


    kaymanjacobcybulski
  • mschmitzmschmitz Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 2,122  RM Data Scientist

    what would be a better format? A Newsletter?

    BR,
    Martin
    - Head of Data Science Services at RapidMiner -
    Dortmund, Germany
  • hughesfleming68hughesfleming68 Member Posts: 239   Unicorn
    edited February 22
    Hi Martin, a newsletter is a good idea. I think a section on the forum with a description of the article and a link would also work. There is a lot of content out there that new users won't find unless they are guided. I read your article on your Random Forest Encoder on medium. I am sure a lot of people missed it because there isn't an easy way to find it. There should be a better way to organize this.

    Alex
  • sgenzersgenzer 12Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,435  Community Manager
    so this is the "section" of the forum dedicated to extensions:  https://community.rapidminer.com/discussions/tagged/Extensions I know it can seem odd that there are no more categories but the new tagging system has the advantage of a topic being in more than one, e.g. "extensions" and "time series". I believe you can then turn any of these into an RSS feed if you like...I have not done this myself but I believe that @BalazsBarany does.

    Scott
    hughesfleming68rfuentealba
  • BalazsBaranyBalazsBarany Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert Posts: 290   Unicorn
    Sure, just open the page source (Ctrl+u) on the screen you are (including a tagged page) in your browser and search for feed.rss. 
    sgenzerhughesfleming68rfuentealba
  • EamonnEamonn Member Posts: 1 Learner I
    Congrats. If I may, for context let me point to the original Shapelet paper. As it happens, Shapelets  were invented ten years ago this week.
    Lexiang Ye, Eamonn J. Keogh: Time series shapelets: a new primitive for data mining. KDD 2009: 947-956 
    varunm1hughesfleming68rfuentealba
  • OprickOprick Member Posts: 19 Contributor II
    Congratulations! Looks promising :smile:
  • David_ADavid_A Moderator, Employee, RMResearcher, Member Posts: 179  RM Research
    edited March 14
    Welcome to the community @Eamonn ,

    and congratulations for the anniversary. We hope, you like what we have build and any feedback from you would be much appreciated.

    Best,
    David


    rfuentealba
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hello,
    great, I was looking for this to analyze sensor data. I think the shapes of the curves are relatively important for predicting what happens next.
    However, I'm struggeling with applying this extensions operators. I have a dataset containint roughly 600k samples and my current selection of 60 sensors. It seems not to be able to draw from this exampleset by itself? (There's an error thrown that -692.322.231 combinations are possible but I selected 1000. I would need to select less than possible)
    I found, I can create the searchspace after converting the data into windows. However, with a window width of 20, this just multiplies the memory consumption by a factor of probably 100 (Every Window becomes an example set having huge overhead to the data itself). So I'm running out of memory on my 32 GB machine.
    Are there any plans to make this compatible with bigger data sets? This would require to simply draw from the original data set itself and have a parameter to select the width of the windows. I would also need to apply it to the time series data itself without conversion.
    If not, is that open source based so we can extend it to ourselves and contribute?

    Greetings,
     Sebastian

    varunm1Tghadially
Sign In or Register to comment.