Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.

mining high dimensional data...

yogafireyogafire Member Posts: 43 Contributor II
hello!!!!

i have a large dimensional data set.... actually the data set consist of about 2000 record and its dimension is about 2000 indeed....

i admit maybe i am still amateur in mining high dimensional data... ;D

what i'm going to ask is how are strategies to mine high dimensional data using RM5.


thank you for your immediate reply!!!

regs,

dimas yogatama

Answers

  • haddockhaddock Member Posts: 849 Maven
    Hi there Dimas,

    I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="391" width="915">
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="135" y="90">
            <parameter key="sparse_representation" value="false"/>
          </operator>
          <connect from_op="Generate Massive Data" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    It doesn't take too long.
    Mar 24, 2010 6:16:23 PM INFO: Decoupling process from location //R5 Forum/data. Process is now associated with file //R5 Forum/data.
    Mar 24, 2010 6:17:15 PM INFO: No filename given for result file, using stdout for logging results!
    Mar 24, 2010 6:17:15 PM INFO: Loading initial data.
    Mar 24, 2010 6:17:15 PM INFO: Process starts
    Mar 24, 2010 6:17:21 PM INFO: Saving results.
    Mar 24, 2010 6:17:21 PM INFO: Process finished successfully after 5 s
    Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).

    So the bottom line is that the main strategy is to have lots of memory, if I remember correctly.. ;)
  • yogafireyogafire Member Posts: 43 Contributor II
    haddock wrote:

    Hi there Dimas,

    I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="391" width="915">
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="135" y="90">
            <parameter key="sparse_representation" value="false"/>
          </operator>
          <connect from_op="Generate Massive Data" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    It doesn't take too long.

    Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).

    So the bottom line is that the main strategy is to have lots of memory, if I remember correctly.. ;)
    ow, maybe i didn't make it clearer yet, sorry...

    what i mean strategy is that, how to optimize accuracy by selecting only "good attribute" among all available ones.... if the specs issue is critical, i only have laptop (lenovo y450-310) with core 2 duo processor @2200 ghz, and 2 gb ddr3 of ram, is it really bothering...? :D

    after all i would like to say sorry for my english, i am still learning.

    regs,

    Dimas Yogatama
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi Dimas,
    of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.

    Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.

    Greetings,
      Sebastian
  • yogafireyogafire Member Posts: 43 Contributor II
    Sebastian Land wrote:

    Hi Dimas,
    of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.

    Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.

    Greetings,
      Sebastian
    how about attribute weighting? how is the performance between attribute selection/attribute set reduction vs attribute weighting based on your experience in mining high dimensional data?

    then what is actually affect the length of model learning by general if we talk about data? is it the total sum of the data (record) or its dimension?
  • IngoRMIngoRM Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,751 RM Founder
    Hello Dimas,

    well, there is no general answer for this. In some settings the complete removal of attributes works better, in some others a rescaling based on weights. The same is true if it comes to "weight by wrapper" vs. "weight by filtering". From my experience, I would say that if you have severe problems with data set size and no other option is possible for you, the calculation of weights followed by a weight based selection can help without loosing too much accuracy.

    Cheers,
    Ingo
  • yogafireyogafire Member Posts: 43 Contributor II
    thank you very much...
Sign In or Register to comment.