RAPIDMINER 9.7 BETA ANNOUNCEMENT

The beta program for the RapidMiner 9.7 release is now available. Lots of amazing new improvements including true version control!

CLICK HERE TO DOWNLOAD

New Extension for Applied Onomastics (name recognition) on GitHub + help needed

NamSorNamSor Member Posts: 6 Contributor I
edited November 2018 in Help
Hi,

Last month we've prototyped RapidMiner integration with NamSor GendRE API, to recognize the gender of names
http://namesorts.com/2014/04/23/rapidminer-to-enrich-gender-data/
using  'Enrich Data by Webservice'.

We've started building a custom extension to offer more functionalities, but we're running into problems.
https://github.com/namsor/rapidminer-onomastics-extension

1) The firstName in the CSV output doesn't correspond to the input
2) The REAL value shows a rounded value instead of full precision (don't look at the value it's random generated)
3) We had to create a 'DummyOperator' with 'name generate_extract' otherwise RM complains that the documentation is missing

Otherwise, the integration seems to work wth RM5.3.015, the operator appears under /Onomastics/Name2Gender

Any help welcome!
Thanks,
Elian

Input file:
firstName;lastName;countryIso2
Blas;PEREZ+HENRIQUEZ;
A.+Craig;COPETAS;
Abdel;AISSOU;
Abderrahman;BEDDI;
Achmad+Danny;GAZALI;
Ada;COLAU;
Adam;GREEN;
Adam+S.;POSEN;
Adeline;BRAESCU+KERLAN;
Aditya;GARG;
Adnan;BALI;
Adnane;EL+FASSI;
Adriaan;SMIT;
Adrian;MCGINN;
Adrián;MICHEL+ESPINO;
Adriana;VERDIER;
Adrien;REGNIER+LAURENT;fr
Adrien;SURU;
Илья;Ковальчук;ru


What we get in the output (genderScale is a random number) :

"firstName";"lastName";"countryIso2";"genderScale";"gender"
"Blas";"PEREZ+HENRIQUEZ";;0.0;"Male"
"A.+Craig";"COPETAS";;1.0;"Female"
"Abdel";"AISSOU";;2.0;"Unknown"
"Blas";"BEDDI";;0.0;"Male"
"A.+Craig";"GAZALI";;1.0;"Female"
"Blas";"COLAU";;0.0;"Male"
"Abdel";"GREEN";;2.0;"Unknown"
"Blas";"POSEN";;0.0;"Male"
"Blas";"BRAESCU+KERLAN";;0.0;"Male"
"Blas";"GARG";;0.0;"Male"
"Abdel";"BALI";;2.0;"Unknown"
"A.+Craig";"EL+FASSI";;1.0;"Female"
"Blas";"SMIT";;0.0;"Male"
"A.+Craig";"MCGINN";;1.0;"Female"
"Abdel";"MICHEL+ESPINO";;2.0;"Unknown"
"Abdel";"VERDIER";;2.0;"Unknown"
"A.+Craig";"REGNIER+LAURENT";"fr";1.0;"Female"
"A.+Craig";"SURU";;1.0;"Female"
"Blas";"Ковальчук";"ru";0.0;"Male"

Tagged:

Answers

  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,928   RM Engineering
    Hi,

    cool stuff 8)

    1) I don't quite get the problem. What CSV output?
    2) RapidMiner is by default rounding to 3 fraction digits when displaying data. You can change the default setting in the preferences under "General" -> "rapidminer.general.fractiondigits.numbers". When calculating, the actual numbers are used.
    3) Not quite sure what that is about, are you getting this warning in the console also when removing your extension? I don't think it has to do anything with it.

    Regards,
    Marco
  • NamSorNamSor Member Posts: 6 Contributor I
    Hi Marco! Thanks for helping out.

    I've created a simple process loading data from an Excel file with

    >firstName;lastName;countryIso2
    >Blas;PEREZ+HENRIQUEZ;
    >A.+Craig;COPETAS;
    >Abdel;AISSOU;

    Then I've connected this Import Excel operator with my custom Extension operator Name2Gender, and connected the output to a CSV file. Unfortunately, the output of my Extension operator seems completely mixed up, with the same firstName being repeated several times, incorrect numeric values, etc.

    I think the problem comes from the way I pass parameters in and out in the doWork method


    @Override
    public void doWork() throws OperatorException {

    ExampleSet exampleSet = inputSet.getData();
    Attributes attributes = exampleSet.getAttributes();
    Attribute fnAttribute = attributes.get(ATTRIBUTE_FN);
    Attribute lnAttribute = attributes.get(ATTRIBUTE_LN);
    Attribute iso2Attribute = attributes.get(ATTRIBUTE_ISO2);

    String mashapeAPIKey = getParameterAsString(MASHAPE_API_KEY);
    String defaultISO2 = getParameterAsString(DEFAULT_COUNTRY_ISO2);
    double threshold = getParameterAsDouble(ATTRIBUTE_THRESHOLD);

    Attribute genderScaleAttribute = AttributeFactory.createAttribute(
    ATTRIBUTE_GENDERSCALE, Ontology.REAL);
    genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
    attributes.addRegular(genderScaleAttribute);

    Attribute genderAttribute = AttributeFactory.createAttribute(
    ATTRIBUTE_GENDER, Ontology.STRING);
    genderAttribute.setTableIndex(fnAttribute.getTableIndex());
    attributes.addRegular(genderAttribute);

    for (Example example : exampleSet) {
    String firstName = example.getValueAsString(fnAttribute);
    String lastName = example.getValueAsString(lnAttribute);
    String iso2 = example.getValueAsString(iso2Attribute);
    if (iso2 != null && iso2.trim().length() == 2) {
    // real value
    } else if (defaultISO2 != null && defaultISO2.trim().length() == 2) {
    iso2 = defaultISO2.trim();
    } else {
    // invalid value, set to null
    iso2 = null;
    }

    double genderScale = 0d;
    if (MOCKUP) {
    genderScale = RND.nextDouble() * 2 - 1;
    } else {
    // API stuff goes here
    }
    String gender = "Unknown";
    if (genderScale > threshold) {
    gender = "Female";
    } else if (genderScale < -threshold) {
    gender = "Male";
    }
    example.setValue(genderScaleAttribute, genderScale);
    example.setValue(genderAttribute, gender);
    }
    outputSet.deliver(exampleSet);
    }

    Any idea?
    Thx,
    Elian
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,928   RM Engineering
    Hi,

    the call

    genderScaleAttribute.setTableIndex(fnAttribute.getTableIndex());
    seems dangerous. Generally speaking, you can only append new attribute columns on the right. Does removing said line fix your problem?

    Regards,
    Marco
  • NamSorNamSor Member Posts: 6 Contributor I
    Hi Marco,

    Without this call, I get a ArrayIndexOutOfBoundsException. I took this method from "How-to-Extend-RapidMiner-5" documentation. Is there an updated document?

    Thx in advance for your help,
    Elian

    SEVERE: java.lang.ArrayIndexOutOfBoundsException: -1
    java.lang.ArrayIndexOutOfBoundsException: -1
            at com.rapidminer.example.table.DoubleArrayDataRow.set(DoubleArrayDataRo
    w.java:61)
            at com.rapidminer.example.table.AbstractAttribute.setValue(AbstractAttri
    bute.java:184)
            at com.rapidminer.example.table.DataRow.set(DataRow.java:85)
            at com.rapidminer.example.Example.setValue(Example.java:140)
            at com.namsor.api.rapidminer.Name2GenderOperator.doWork(Name2GenderOpera
    tor.java:160)
            at com.rapidminer.operator.Operator.execute(Operator.java:866)
            at com.rapidminer.operator.execution.SimpleUnitExecutor.execute(SimpleUn
    itExecutor.java:51)
            at com.rapidminer.operator.ExecutionUnit.execute(ExecutionUnit.java:711)

            at com.rapidminer.operator.OperatorChain.doWork(OperatorChain.java:375)
            at com.rapidminer.operator.Operator.execute(Operator.java:866)
  • Marco_BoeckMarco_Boeck Team Lead Software Engineering Administrator, Moderator, Employee, Member, University Professor Posts: 1,928   RM Engineering
    Hi,

    the document will be updated, however I cannot name any date as of yet.
    Please use these calls to add new attributes to an existing ExampleSet.

    exampleSet.getExampleTable().addAttribute(newAttribute);
    exampleSet.getAttributes().addRegular(newAttribute);
    Regards,
    Marco
  • NamSorNamSor Member Posts: 6 Contributor I
    Thanks a lot Marco, that worked! E.
Sign In or Register to comment.