🦉 🎤   RapidMiner Wisdom 2020 - CALL FOR SPEAKERS   🦉 🎤

We are inviting all community members to submit proposals to speak at Wisdom 2020 in Boston.


Whether it's a cool RapidMiner trick or a use case implementation, we want to see what you have.
Form link is below and deadline for submissions is November 15. See you in Boston!

CLICK HERE TO GO TO ENTRY FORM

"How to split large dataset containing multiple smaller datasets"

jan_kvacekjan_kvacek Member Posts: 4 Contributor I
edited June 9 in Help

Hi.

I have a following problem I am trying to solve. I want to import large dataset where different rows belong to different tables (with totally different sets of attributes).

The dataset consists of “batches” where every batch has “headline” and “tables”. Headline by itself consists of one row containing defined set of attributes. There are several kinds of headlines. All rows bellow headline are related to this headline until new headline occurs. These rows related to specific headline can be of different kind and therefor can belong to different tables (containing different information represented by different set of attributers).

Example:

See an example bellow. I have 8 rows. I know that first character “A” or “F” defines beginning of new batch and for example that headline “A” has attributes “XXX” “TTT” “ZZZ” and “YYY” (of this specific length on these specific positions). A also know that rows beginning with “V” or “L” contain information related to the batch they belong to. But if “V” follows headline “A” it consists of different attributes than when it follows headline “F”.

Now I want several tables as a result. I want separate table for all kinds of headlines (ie. Table containing all information from headlines “A”). Then I want table for all values in rows beginning with “V” from batches “A” and another table for all values in rows beginning with “V” from batches “F” etc. I would also like to copy selected attributes from headline to following rows because when I group for example all rows starting “V” from batches starting with “A” I need to know some information which is contained in headers only for my later work.

1 AXXXTTTZZZYYY

2 VZZZTTTGGGFFF

3 VZZZTTTGGGFFF

4 LHHHBBBVVV

5 FXXXTTTSSSFFF

6 VDDDFFFGGTT

7 VDDDFFFGGTT

8 AXXXTTTZZZYYY

Does anybody have an idea how to solve this problem?

Thank you!

Tagged:
Sign In or Register to comment.