About Data Pipeline Structure

tim0128tim0128 Member Posts: 6 Newbie
I am using Rapidminer for a big data analysis project and I only used execute Python operators to construct my workflow. What the workflow does is basically read a big pandas data frame at the very beginning and process it row by row in the following operators. I realize that each operator only starts when the previous operator finishes all of the rows. Is it possible that when a row is finished processing it can be immediately passed to the next operator? i.e. Does Rapidminer supports data pipeline?

Best Answer

  • CKönigCKönig Administrator, Moderator, Employee, Member Posts: 68 RM Team Member
    Solution Accepted
    RapidMiner does support building Data Pipelines for streaming data. For enterprise projects this can involve writing to a Kafka queue where multiple worker nodes are listening to continue calculations. For the basic operators, you are right that these are usually "atomic" in nature: calculations will be performed  "in order". In your specific case, since you are mainly using execute Python operators, it would make sense to build the data pipeline there aswell.

    Another comment: if you are only using the visual workflows of RapidMiner to orchestrate steps done purely in pythin, be aware that any python-based operator (Execute Python, Python Learner, Python Transformer,...) introduces a small overhead into your overall runtime. For now, every time a python environment will be started and all of the data will be serialized and read into a Pandas Dataframe. When chaining multiple Python operators this overhead can be significant compared to doing it all in one Python operator.

    Further information can be found here: RapidMiner and Python - RapidMiner Documentation
    tim0128
Sign In or Register to comment.