About Data Pipeline Structure

tim0128 · August 2023

I am using Rapidminer for a big data analysis project and I only used execute Python operators to construct my workflow. What the workflow does is basically read a big pandas data frame at the very beginning and process it row by row in the following operators. I realize that each operator only starts when the previous operator finishes all of the rows. Is it possible that when a row is finished processing it can be immediately passed to the next operator? i.e. Does Rapidminer supports data pipeline?

CKönig · September 2023

RapidMiner does support building Data Pipelines for streaming data. For enterprise projects this can involve writing to a Kafka queue where multiple worker nodes are listening to continue calculations. For the basic operators, you are right that these are usually "atomic" in nature: calculations will be performed "in order". In your specific case, since you are mainly using execute Python operators, it would make sense to build the data pipeline there aswell.

Another comment: if you are only using the visual workflows of RapidMiner to orchestrate steps done purely in pythin, be aware that any python-based operator (Execute Python, Python Learner, Python Transformer,...) introduces a small overhead into your overall runtime. For now, every time a python environment will be started and all of the data will be serialized and read into a Pandas Dataframe. When chaining multiple Python operators this overhead can be significant compared to doing it all in one Python operator.

Further information can be found here: RapidMiner and Python - RapidMiner Documentation

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

About Data Pipeline Structure

Best Answer