"How to split large dataset containing multiple smaller datasets"

jan_kvacek · August 2016

Hi.

I have a following problem I am trying to solve. I want to import large dataset where different rows belong to different tables (with totally different sets of attributes).

The dataset consists of “batches” where every batch has “headline” and “tables”. Headline by itself consists of one row containing defined set of attributes. There are several kinds of headlines. All rows bellow headline are related to this headline until new headline occurs. These rows related to specific headline can be of different kind and therefor can belong to different tables (containing different information represented by different set of attributers).

Example:

See an example bellow. I have 8 rows. I know that first character “A” or “F” defines beginning of new batch and for example that headline “A” has attributes “XXX” “TTT” “ZZZ” and “YYY” (of this specific length on these specific positions). A also know that rows beginning with “V” or “L” contain information related to the batch they belong to. But if “V” follows headline “A” it consists of different attributes than when it follows headline “F”.

Now I want several tables as a result. I want separate table for all kinds of headlines (ie. Table containing all information from headlines “A”). Then I want table for all values in rows beginning with “V” from batches “A” and another table for all values in rows beginning with “V” from batches “F” etc. I would also like to copy selected attributes from headline to following rows because when I group for example all rows starting “V” from batches starting with “A” I need to know some information which is contained in headers only for my later work.

1 AXXXTTTZZZYYY

2 VZZZTTTGGGFFF

3 VZZZTTTGGGFFF

4 LHHHBBBVVV

5 FXXXTTTSSSFFF

6 VDDDFFFGGTT

7 VDDDFFFGGTT

8 AXXXTTTZZZYYY

Does anybody have an idea how to solve this problem?

Thank you!

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

"How to split large dataset containing multiple smaller datasets"