Work on disease data

student_computestudent_compute Member Posts: 63 Contributor II
edited June 12 in Help
My friends used to work on text data before
Now I have a dataset containing 18 features and 106 samples. About the disease. With two classes
There are 79 samples of healthy specimens that do not have a disease. And 25 patients are sick.
And 2 samples are unknown.
I wanted to know if I should do normalization and pre-processing?
Should I do over sampeling, under sampeling?
Is this possible in the rapidminer?
Do you know the typical process for me?
As always, I'm grateful to help you
Everyone's happy day


  • DocMusherDocMusher Member Posts: 242   Unicorn
    I can have a look at your data. Send your data as PM, I will answer you ASAP.
  • student_computestudent_compute Member Posts: 63 Contributor II
    Unable to send data
    Could you guide the specifications given in the data?
    I'm waiting for your help
  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,150   Unicorn
    With such a small sample of both classes, I wouldn't do either upsampling or downsampling.   You could try Weight by Stratification instead and use a simple machine learning model such as Naive Bayes or Decision Trees to start.  But your ability to discriminate between classes may be very limited due to your small sample size.  You should review the in-program tutorials for the referenced operators if you need help setting up the process, or you could post your xml of your own process if you are running into a specific problem and need more help.
    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • varunm1varunm1 Member Posts: 497   Unicorn
    Hello @student_compute

    These are basic suggestions and more appropriate suggestions are given based on data. As we don't have data, you can try this.

    Normalization: check if the data values have different ranges, for example (one feature has values between 1 and 10 and other feature has a value between 1000 and 10000) then you can normalize otherwise there is no necessity to do that.

    Preprocessing: separate missing label samples from the data. Later you can use them to just predict values based on trained model. Use feature selection techniques if possible to see whether all 18 features are important or not.
    Over or undersampling: First try without sampling and check how the models are working, if you feel that it is necessary to sample, i recommend smote for upsampling. As your data set is small, I guess downsampling is not a good Idea.

    Build models using cross validation and add feature selection techniques inside this operator.

    Finally, yes all these things are possible in rapidminer.

    Hope this helps

  • student_computestudent_compute Member Posts: 63 Contributor II
    edited 3:27AM
    Thank you very much for all the friends in the posts

    I changed my data. I actually picked up another data and collected 100 data.
    But I still have 18 features. 23 unsuccessful data and 75 data with successful class and 2 data with unknown class
    These data are about four ball sportsmen.
    Can you help me with this type of data now?
    Should I increase or decrease my data over sampling or under sampling? How does this work in RapidMiner?
    Is my process correct?
    I sent this sample. I created.

    <?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
      <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
        <process expanded="true">
          <operator activated="true" class="read_excel" compatibility="8.1.000" expanded="true" height="68" name="Read Excel" width="90" x="45" y="34">
            <parameter key="excel_file" value="C:\data.xlsx"/>
            <list key="annotations"/>
            <list key="data_set_meta_data_information"/>
          <operator activated="true" class="replace_missing_values" compatibility="8.2.000" expanded="true" height="103" name="Replace Missing Values" width="90" x="179" y="34">
            <list key="columns"/>
          <operator activated="true" class="normalize" compatibility="8.2.000" expanded="true" height="103" name="Normalize" width="90" x="313" y="34"/>
          <operator activated="true" class="set_role" compatibility="8.2.000" expanded="true" height="82" name="Set Role" width="90" x="447" y="34">
            <parameter key="attribute_name" value="class"/>
            <parameter key="target_role" value="label"/>
            <list key="set_additional_roles"/>
          <operator activated="true" class="concurrency:cross_validation" compatibility="8.2.000" expanded="true" height="145" name="Cross Validation" width="90" x="581" y="34">
            <process expanded="true">
              <operator activated="true" class="k_nn" compatibility="8.2.000" expanded="true" height="82" name="k-NN" width="90" x="45" y="34">
                <parameter key="k" value="3"/>
                <parameter key="nominal_measure" value="JaccardSimilarity"/>
              <connect from_port="training set" to_op="k-NN" to_port="training set"/>
              <connect from_op="k-NN" from_port="model" to_port="model"/>
              <portSpacing port="source_training set" spacing="0"/>
              <portSpacing port="sink_model" spacing="0"/>
              <portSpacing port="sink_through 1" spacing="0"/>
            <process expanded="true">
              <operator activated="true" class="apply_model" compatibility="8.2.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
                <list key="application_parameters"/>
              <operator activated="true" class="performance_classification" compatibility="8.2.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34">
                <list key="class_weights"/>
              <connect from_port="model" to_op="Apply Model" to_port="model"/>
              <connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
              <connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
              <connect from_op="Performance" from_port="performance" to_port="performance 1"/>
              <connect from_op="Performance" from_port="example set" to_port="test set results"/>
              <portSpacing port="source_model" spacing="0"/>
              <portSpacing port="source_test set" spacing="0"/>
              <portSpacing port="source_through 1" spacing="0"/>
              <portSpacing port="sink_test set results" spacing="0"/>
              <portSpacing port="sink_performance 1" spacing="0"/>
              <portSpacing port="sink_performance 2" spacing="0"/>
          <connect from_op="Read Excel" from_port="output" to_op="Replace Missing Values" to_port="example set input"/>
          <connect from_op="Replace Missing Values" from_port="example set output" to_op="Normalize" to_port="example set input"/>
          <connect from_op="Normalize" from_port="example set output" to_op="Set Role" to_port="example set input"/>
          <connect from_op="Set Role" from_port="example set output" to_op="Cross Validation" to_port="example set"/>
          <connect from_op="Cross Validation" from_port="model" to_port="result 1"/>
          <connect from_op="Cross Validation" from_port="example set" to_port="result 2"/>
          <connect from_op="Cross Validation" from_port="performance 1" to_port="result 3"/>
          <portSpacing port="source_input 1" spacing="0"/>
          <portSpacing port="sink_result 1" spacing="0"/>
          <portSpacing port="sink_result 2" spacing="0"/>
          <portSpacing port="sink_result 3" spacing="0"/>
          <portSpacing port="sink_result 4" spacing="0"/>

    I will be grateful. Complete your friends
    (I would say that I found myself in the forum, but I did not find a solution to my problem ..)
    How should I choose a visa? And how to make a model with a neural network?

    Thankful <3
    I'm waiting for your help

Sign In or Register to comment.