mining high dimensional data...

yogafireyogafire Member Posts: 43  Maven
hello!!!!

i have a large dimensional data set.... actually the data set consist of about 2000 record and its dimension is about 2000 indeed....

i admit maybe i am still amateur in mining high dimensional data... ;D

what i'm going to ask is how are strategies to mine high dimensional data using RM5.


thank you for your immediate reply!!!

regs,

dimas yogatama

Answers

  • haddockhaddock Member Posts: 849  Guru
    Hi there Dimas,

    I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="391" width="915">
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="135" y="90">
            <parameter key="sparse_representation" value="false"/>
          </operator>
          <connect from_op="Generate Massive Data" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    It doesn't take too long.
    Mar 24, 2010 6:16:23 PM INFO: Decoupling process from location //R5 Forum/data. Process is now associated with file //R5 Forum/data.
    Mar 24, 2010 6:17:15 PM INFO: No filename given for result file, using stdout for logging results!
    Mar 24, 2010 6:17:15 PM INFO: Loading initial data.
    Mar 24, 2010 6:17:15 PM INFO: Process starts
    Mar 24, 2010 6:17:21 PM INFO: Saving results.
    Mar 24, 2010 6:17:21 PM INFO: Process finished successfully after 5 s
    Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).

    So the bottom line is that the main strategy is to have lots of memory, if I remember correctly.. ;)
  • yogafireyogafire Member Posts: 43  Maven
    haddock wrote:

    Hi there Dimas,

    I'm not clear as to what you want to know, but you should understand that RM can handle much larger datasets than you are talking about. For example if I run the following to get a 10k * 10k matrix ...
    <?xml version="1.0" encoding="UTF-8" standalone="no"?>
    <process version="5.0">
      <context>
        <input>
          <location/>
        </input>
        <output>
          <location/>
          <location/>
        </output>
        <macros/>
      </context>
      <operator activated="true" class="process" expanded="true" name="Process">
        <process expanded="true" height="391" width="915">
          <operator activated="true" class="generate_massive_data" expanded="true" height="60" name="Generate Massive Data" width="90" x="135" y="90">
            <parameter key="sparse_representation" value="false"/>
          </operator>
          <connect from_op="Generate Massive Data" from_port="output" to_port="result 1"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
        </process>
      </operator>
    </process>
    It doesn't take too long.

    Just so you can compare I'm on XP64 double quad with 16G, and for windows boxes it is that 64 that matters, as 32 bit boxes can only address 3??G ( you'll have to Google for the right number ).

    So the bottom line is that the main strategy is to have lots of memory, if I remember correctly.. ;)
    ow, maybe i didn't make it clearer yet, sorry...

    what i mean strategy is that, how to optimize accuracy by selecting only "good attribute" among all available ones.... if the specs issue is critical, i only have laptop (lenovo y450-310) with core 2 duo processor @2200 ghz, and 2 gb ddr3 of ram, is it really bothering...? :D

    after all i would like to say sorry for my english, i am still learning.

    regs,

    Dimas Yogatama
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,525   Unicorn
    Hi Dimas,
    of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.

    Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.

    Greetings,
      Sebastian
  • yogafireyogafire Member Posts: 43  Maven
    Sebastian Land wrote:

    Hi Dimas,
    of course the amount of memory does make a difference. If the data doesn't fit into the memory, it either fails or you will need it to stream it from a database what might slow down your process a lot.

    Coming back to the strategy question: RapidMiner offers several methods for selecting attributes. You might either use the Forward Selection or Backward Elimination operator as a simple start. If that does not suit your needs or they take too long, you might take another operator from the package and it's sub packages Data Transformation / Attribute Set Reduction and Transformation / Selection.

    Greetings,
      Sebastian
    how about attribute weighting? how is the performance between attribute selection/attribute set reduction vs attribute weighting based on your experience in mining high dimensional data?

    then what is actually affect the length of model learning by general if we talk about data? is it the total sum of the data (record) or its dimension?
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,643  RM Founder
    Hello Dimas,

    well, there is no general answer for this. In some settings the complete removal of attributes works better, in some others a rescaling based on weights. The same is true if it comes to "weight by wrapper" vs. "weight by filtering". From my experience, I would say that if you have severe problems with data set size and no other option is possible for you, the calculation of weights followed by a weight based selection can help without loosing too much accuracy.

    Cheers,
    Ingo
  • yogafireyogafire Member Posts: 43  Maven
    thank you very much...
Sign In or Register to comment.