suggestions for (pre)processing my datasource

harri678harri678 Member Posts: 34 Contributor II
edited November 2018 in Help
hello,

i am new with rapidminer and maybe somebody can help me with my problem! my datasource is the mysql database of an analysis tool containing measurement data. every 5 minutes a so called "transmission" is received. one transmission consists of 1-900.000 key-value pairs. those key-value-pairs represent a specific measured property ("descriptor") and the according value ("counter"). for better understanding i put the mysql-table-create statement here:

CREATE TABLE `temp_mem` (
  `transmission_id` int(4) unsigned NOT NULL,
  `descriptor` int(4) unsigned NOT NULL,
  `counter` int(4) unsigned default NULL,
  PRIMARY KEY  (`transmission_id`,`descriptor`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1

the content of this table looks like:

transmission_id  descriptor  counter
1788115            1         170257725
1788115            7              40078
1788115            32            5933
1788115            33            15167
.....
1788116            1         170255453
1788116            7              5863
.....


the so called descriptors are all defined in another sql-table with a short description. only descriptors with counter != 0 are submitted in a transmission. a typical transmission in reallife consists of about 5.000 to 300.000 different descriptors (and that every 5 minutes).

the thing i want to do with rapidminer is analyzing the collected data. i suspect there is a lot of redundant information in this measurement system, so i want to try some exploratory research on it and find the essential descriptors. my problem is the actual format of the stored data. for example using PCA, i would need some format like:

transmission_id    desc_1  desc_2  desc_3 ......  desc_976638
1788115            170257725    0    0    ........    0
1788116            170255453    44  0    ........    1
......

i already tried pivot and transpose but it didn't work. rapidminer has its problems with datasets and approx. 900.000 attributes. can you give me some advice how to handle this kind of problem?

thanks in advance,
harald

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Harald,
    this is of course a very huge table, at least the number of attributes is enormous. You are already touching the dimensions, where you have to make extra effort, to handle the amount of data at all. A little example:

    If you decide to use the PCA on a dataset with 900.000 attributes, the covariance matrix would have to be estimated. The covariance matrix would consist of 810.000.000.000 entries, each using up 8 bytes. This would fill around 6 TB (!) of data. And then you would need enough examples to make an estimation for each of these cells...So PCA is simply unusable.

    So which ways are there to cope with the problem?

    I would try to load the data in chunks, meaning, that you divide the complete data set into subsets, each containing only a limited number of attributes. You then could analyze this subset, removing attributes. Then combining two reduced subsets, analyze them again and so on.
    The analysis again depends on what you are going to use the data for.

    Greetings,
      Sebastian
  • harri678harri678 Member Posts: 34 Contributor II
    Thanks for the response Sebastian!

    During the afternoon I came to the same conclusion that it's impossible to analyse all measurements at once. Currently I've started to build "groups of related descriptors" where mutual analysis makes sense. My idea is to apply PCA to every group (5-10) and reduce them to 3-5 principal components. Is it possible to merge multiple ExampleSets together into one based on an index (in my case the "transmission_id") ?

    e.g.:
    group1 --> PCA --> transmission_id + PC1_1 + PC2_1 + PC3_1
    group2 --> PCA --> transmission_id + PC1_2 + PC2_2
    group3 --> PCA --> transmission_id + PC1_3 + PC2_3 + PC3_3

    merged exampleSet: transmission_id + PC1_1 + PC2_1 + PC3_1 + PC1_2 + PC2_2 + PC1_3 + PC2_3 + PC3_3

    Is this possible? I'm planning to use a merged exampleSet for further analysis, like clustering or learning.


    Greetings, Harald
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    the ExampleSetJoin operator will do this for you :)

    Greetings,
      Sebastian
  • harri678harri678 Member Posts: 34 Contributor II
    Hello,

    I've already figured it out after some trying! The last two days and nights I've been working with rapidminer or reading manual and lots of things are clear now :)
    It's a real good piece of software -- speeds up my thesis progress alot!

    Greets,
    Harald
Sign In or Register to comment.