"data mining on school database"

ElmoElmo Member Posts: 7 Contributor II
edited May 23 in Help
Hi all , I am new to data mining

I am trying to analyse the set of information on the students of a sma ll school
those are the grades in diferent subjects ( math sciences, english , french, ...) , the distance or time to come to school, family background, and the disciplinary record
I have the data in and excel file

can any exeperineced user of rapid miner give me some advice ?
Tagged:

Answers

  • steffensteffen Member Posts: 347  Guru
    Hello Elmo and welcome to RapidMiner

    Since you are are complete newbie to data mining, I suggest  that you go and get a good book. Here is a thread with suggestions for some literature: klick

    It is much easier to give you more advice and help when you are able to specify your questions ... starting from such a general position one could start a data mining lecture ... hope you understand :)

    kind regards,

    Steffen

  • ElmoElmo Member Posts: 7 Contributor II
    Thank you Steffen for your answer

    I think I misexpressed myself :-(

    I wanted to say I am new to rapidminer,  I downloadded it last week

    I have been reading about data mining for four months, but you know theories are not identical to applications

    may I ask some technical questions from time to time?

    regards
  • steffensteffen Member Posts: 347  Guru
    Hello Elmo

    Of course !  :)

    I just wanted to clarify that "I have data, what now?" - questions are hard to answer.

    kind regards,

    Steffen
  • ElmoElmo Member Posts: 7 Contributor II
    hello Steffen

    thank you very much

    would you please tell me what mistake am I doing to get this message :

    "Many operators like classification and regression methods or the PerformancEvaluator require the input example sets to have a label or class attribute. If this not the case, applying these operators is pointless. If you read the data using an ExampleSource, you can specify the label attribute by using a 'label' tag in the attribute description file."

    I am trying to load my data put on an excel 2003 sheet , I have deleted the other sheets and saved it as an csv file

    best regards

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,523   Unicorn
    Hi Elmo,
    you have to specify which attribute (the way we call the columns), is the target of your analysis. This attribute is then called label.
    If you perform a regression to predict a numerical value, this label has to be numerical, otherwise you have to choose a nominal for classification.

    Greetings,
      Sebastian
  • ElmoElmo Member Posts: 7 Contributor II
    thank you Sebastian

    may I ask other questions?

    which is better for RM Excel 2003 or Excel 2007? or is ti the same?

    which is easier :data from MS access or MS Excel?

    I am having trouble with the excel files  how can I make sure I am doing the right way?

    many thanks
  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,523   Unicorn
    Hi Elmo,
    I think you might use booth excel versions, but you have to save the files in the "old" .xls format instead of the new xml style document format.
    Excel is probably easier to use than access, but if your data exceeds some number of lines (64k if I remember correctly) you will have to change to Access. But with a school database this should be not a problem :)

    Greetings,
      Sebastian
  • ElmoElmo Member Posts: 7 Contributor II
    thank you Sebastian you are really kind

    can you tell me please what mistake am I doing ?

    when I try to work on may data , on an excel sheet, I get the following error message:

    " Parameter 'excel_file' is not set and has no default value. "

    best regards
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,621  RM Founder

    Parameter 'excel_file' is not set and has no default value.
    It seems that you did not specify the file --> just specify the excel file you want to read the data from for this parameter "excel_file".

    Cheers,
    Ingo
  • ElmoElmo Member Posts: 7 Contributor II
    thank you Ingo It worked

    My intension is to find a correlation, if there is one, between the distnace from home to school and the behavior of a student (measured by warnings), or to the achievement in a certain subject  let's say Math or English.
    I also want to find out the influence of home backround ( parents together or divorced) on the students achievement in school, and if this influence varies according to gender.
    would you tell me please what operators to use in order to measure the correlation?

    thanks & best regards
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 289  Maven
    Hi,

    for the computation of correlations you can use the [tt]CorrelationMatrix[/tt] operator. However, be aware that the correlation coefficients in the matrix might not be the right concept to get an insight into your data, as the correlation coefficient only measures a linear relationship among numerical values. When being faced with nominal attributes (e.g. family status) the correlation coefficient has almost no useful meaning.

    Since your problem seems to be a standard classification task with the achievement at school (i.e. the grade) as the label, I would use a classification learner (Decision Tree, Naive Bayes, etc.) to model the data and find relationships.

    Kind regards,
    Tobias
  • ElmoElmo Member Posts: 7 Contributor II
    Thank you  Tobias

    I am thinking of replacing the column on family status by two colunms with numerical entries,

    the first numbers of parents at home
    (2 if parents live together, 1 if one parent is divorced, dead, or working abroad, and 0 if the student is living with grand parents or living by him/herself)

    the second columns changing the status into numbers
    (a positive value if the student live with two parents , it has a positive effect on the student's well being
    a zero value if one of the parent is abroad, and a negative if one of the parent is dead or divorced, and a lower negative value in the case of the abscence of the two parents. i think I have to ask the school counsellor on which has the worst effect on the student)

    Do you think this is reasonable? can I do it on RapidMiner ? if yes wich operators are the best?

    best regards
  • IngoRMIngoRM Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, Community Manager, RMResearcher, Member, University Professor Posts: 1,621  RM Founder
    Hi,

    yes, you can do something like this with RapidMiner with the AttributeConstruction operator. This operator is able to work on conditions like

    if (family_status == "together", 2, if (family_status="divorced", 1, 0))

    Cheers,
    Ingo
Sign In or Register to comment.