I am working in the R&D department of an automatic machine provider company. We have developped an automatic machine which is welding photovoltaic cells in series, in order to get PV modules. We want to create a model of our machine. I will try to explain myself.
The machine consists in:
1.- an aluminium heating plate where a resistor is introducing energy, and a termocouple is measuring the actual temperature.
2.- an infrared lamp, which is being activarted when a solar cell is entering the soldeirng zone. A pyrometer detects the actual temperature of the cell.
The machine is controlled by a PLC, which is scaning (and this writing a row of data) each milisecond. In this row of data, we can see the temperature of the solar cell (which is the key variable we want to model) and all the other variables (power of lamp, power of resistor, temperature of the aluminium base).
A complete welding cycle is around 1,2 seconds, so we have around 1200 rows, where the temperature of the cell is passing from 25 degrees to around 220 degress, and back to 120 degrees.
Now my question: We would like to put several configurations of the machine (ex. lamp power 10%, 50% and 100%) and register wedling cycles for each configuration. Feed these data to rapidminer, and get a "model" of the machine for intermediatr states.
We have been looking into tutorials and doc, but we can not find some "cyclic" case. I mean, our 1200 rows represent one cycle, and not 1200 independent cases.
We are stucked with this "cyclic" issue, and we can not figure out how could it be implemented in rapidminer. It would be much appreciated if someone could give us some advise.
Thank you in advance for your collaboration,
Thanks for trying RapidMiner and welcome to the community. What you are facing is one of the most important tasks in Data Science: Feature Extraction. I will go in some length (and might use this as a starting point for some longer introduction to be used as a blog post or something).
Let me start with summing up your use case. I am physicist by education and have a bit of process engineer knowledge sparking over from my brother who works as a process engineer in semiconductor industry. Still I feel like we need to settle down and maybe convert it into data science terminology.
What you have is a cyclic machine producing something. The production process takes 1.2s. During this production process, you can take data from various sensors. I assume you store it in a DB and tag it with something like a cycle id. Thus you might have a table like this:
Cylce_id Time Temperature T_Al
1 1 25 20
1 2 50 40
1 1 20 60
2 2 40 130
3 3 50 150
Additionally, you have for every Cycle_id “environmental” condition. Usually these might be production parameters (like power of lamp), environmental parameters (like moisture) and recipe parameters (like details on the uses wafer or something).
So you might have a second table with these information:
Cylce_id power_of_lamp WaferCategory
1 70 A
2 80 B
The Data Prep Challenge
For most data science algorithms you need to bring your data in a “one-line-per-cycle” format. The second table you already have this. The first one needs to be treated.
A first and easy approach is to do an aggregation per cycle. You would create something like
Cycle_Id avg(Temperature) max(Temperature) min(Temperature)
1 100 200 20
2 104 198 25
This can then be joined to the first table to be used in your analysis. While this is a quick way and can be done with one operator in RapidMiner (Aggregate) it looses a lot of information. So as a next step we would need to sit down and extract more data from the first table. Key idea is that each 1200-batch is a time series where you can do standard time series things on. My approach would be to use the new operator Group into Collection which is available in Operator Toolbox extension and move forward there.
The Use Case?
The exact details on what you would extract from these time series depends on your use case. I think you have not talked yet on what you want to do. Let me give you a few examples on what we did with our customers:
One of the straight forward things is doing anomaly detection to identify abnormal behaviour in your production process. This runs into something which is like SPC but multivariate and more powerful.
Root Cause Analysis
Like Anomaly detection but you have non-optimal parts and want to figure out what might be potential issues for this.
Predictive Process Control
Most production processes have a few settings a engineer can set. If you have something like a “quality” of your produced product you can predict beforehand the result and thus optimize the setting for given environmental variables.
Assuming your data is of the form that @mschmitz describes, you may also want to look at the operator "Pivot" (which allows you to turn many rows into a single row by creating new attributes based on an index value). In the extreme case, this would create an additional 1200 attributes per example from each of sensor readings for each cycle. As already explained, you probably don't want to leave those in their raw form but instead do some feature engineering. For that you could also look at Generate Aggregation or perhaps PCA. But this is all do-able in RapidMiner.
Thank you very much for so detailed explanation! In fact I feel that you understood our problematic much better than I could explain it. :-)
If I try to summarize the recomended approach, it will consist in carefully selecting the relevant features of our process from each time series, and making 1 row out of it. This "feature table" should be added with the "recipe" table, ending up with 1 row per "process recipe".
Ok, I think now I have the "framework" or boundary conditions for building my process model.
Reagrding what is our whish or target with rapidminer: what we would like to obtain is some sort of predictive model, where we could introduce the environmental parameters (recipe), and the model coudl predict the resulting temperature profile.
I guess that we should first feed RM with several combination of recipes (lamp power %, baseplate temperature, lamp distance from the wafer...etc), and corresponding "relevant features". It will create a model, where we could predict the resulting cycle of an unknown recipe (not previously seen while training)
To be honest, in the limit, what we really would prefer is to be able to get as an output, a prediction of "Temperature-time" profile, lasting 1,2seconds as a prediction. But I understodd from the explanation that this is not possible, becauser we already have to combine data in a single row for each cycle...so the output will always be one row output.
For the moment I think we will start with feature selection and at least try to build a model able to predict "not-seen-recipes".
Thank you so much again!
I think we agree mostly. One thing is that you usually not just select relevant features, but also built them. A typicel thing are peaks in distributions or slopes. What you create is depended on your use case and usually the domain experts (=engineer) knows it bests.
On your use case: Extrapolating the time series is tough. Usually you do not predict a whole series, but only single values (regression) or strings (classification). Are you really interested in the whole curve? Or are you interested in "good" and "bad" curves? The later is way easier to do.
You should have a mail from my colleague Jess in your email inbox. We would be happy to jump on call with you to discuss the options. Just give him or me a note. I am available at mschmitz at rapidminer.com