Defining target variable

ram_nit05ram_nit05 Member Posts: 12 Contributor II
edited November 2018 in Help

Am a new user of Rapidminer, hence this query.

I have to use RapidMiner for a classification problem in text mining domain,where the dependent variable is a class variable(takes values like Account Status Issues,Fraud issues,etc.) and the independent variable is a free flow text( for e.g. -  my account status changes with me knowing about it,etc.). While adding the CSV dataset into texts in TextInput operator,

Which of the following form of data should be used,
1. One dataset : Dependent variable(with all the classes in one variable) with the independent variable(freeflow text)
2. Multiple datasets: Split the files into binary dependent variables(for e.g one file for Whether Account Status problem or not; sanother for Fraud issues or not) with independent variable(freeflow text).

Many thanks in advance for your assistance.


  • fischerfischer Member Posts: 439  Guru
    This depends very much on your domain, and, as always, the answer is: Try both.

    If the domain really has a hierarchical structure like the one you described it in option 2,  it may be a good idea to split the data set accordingly. Also, if there are many classes, multi-class classifiers will probably not perform too well.

    Note that there is a Binary2MultiClassLearner that cuold help if you want to use a classifier that supports only binary classification.

  • fischerfischer Member Posts: 439  Guru
    Reading over your posting again, it occurs to me that I may have misunderstood your question. If your question was purely how you should organize your data, then the answer should be this:

    - You can read the data with any ExampleSource operator and use a StringTextInput provided that the example set contains both a string column with the text and another column with the label (dependent variable)

    - You can organize your texts into directories, using the directory name as the dependent variable by using a TextInput operator.

    Again, which method you choose depends only on which is closer to how your data is currently organized and which is more convenient to you.

  • ram_nit05ram_nit05 Member Posts: 12 Contributor II
    Many thanks for your help.

    The data form that I have taken is the dependent variable and independent variable in the same dataset and it seems to be working fine.
Sign In or Register to comment.