Contributor

# predict age and gender with Multilabel calssification for twitter's user

[ Edited ]

Hello,

I need to predict age and gender for twitter's user regarding thier tweets,

I just collect mor than 300 known user profile with thier  age and gender.

and i deivde the file to 4 gropus (Female over 20,Femlae under 20,Male over 20,male under20).

I finished process text (tokenize,remove stopword,stem,replace token).

Now how can do that in Rapidminer?

7 REPLIES
Moderator

Contributor

[ Edited ]

Thanks

Highlighted
Power User

## Re: predict age and gender with Multilabel calssification for twitter's user

I'm sure as you have already learned in your studies, some algorithms can only be applied to binary labels, some to regression (numbers), but did you know that many algorithms can handle multiple categories?  For example, a kNN algorithm can predict for all 4 categories in your label without much trouble.

Have an explore of http://mod.rapidminer.com/#app and use it to help understand a small selection of the algorithms available for your solution.

(As this is RapidMiner, there are a large number of different ways to solve your problem, but first let's begin here as it is a very simple way to get you started).

Happy mining!

-- Training, Consulting, Sales in China, Hong Kong & Taiwan --
www.RapidMinerChina.com
Contributor

## Re: predict age and gender with Multilabel calssification for twitter's user

i think first we need to predict first label (Gender:Male/Female) after that we can predict age(Over 20Y,Under20Y).

i try browse your link but i dont know process steps to do that,

Could anyone help me please?

Moderator

## Re: predict age and gender with Multilabel calssification for twitter's user

As @JEdward pointed out, there are several algorithms that can handle multi-label. My link shows how the process would work.

For your example, I would make labels of male_under20, male_over20, female_under20, and female_over20. This way the label is all in one attribute column and you can test the predictions and measure the performance of the classification. Assuming the model is good, then the testing (scoring) data set will spit out those labels with confidences.

You can build a model that will first classifiy the gender via a Cross Validation, then pipe that information to another Cross Validation. You'd have to use a Set Role operator and Select Attribute operator to remove the confidence attributes and change the label role to an regular attribute, but that seems very complicated.

Contributor

## Re: predict age and gender with Multilabel calssification for twitter's user

Thanks @Thomas_Ott ,

I appreciate that but how can i apply multi-label with best accuracy and performance for more than 184000 tweets or 3000000 tokens.do you have any fully example that explain handle MLC in Rapidminer

Thomas_Ott wrote:

As @JEdward pointed out, there are several algorithms that can handle multi-label. My link shows how the process would work.

For your example, I would make labels of male_under20, male_over20, female_under20, and female_over20. This way the label is all in one attribute column and you can test the predictions and measure the performance of the classification. Assuming the model is good, then the testing (scoring) data set will spit out those labels with confidences.

You can build a model that will first classifiy the gender via a Cross Validation, then pipe that information to another Cross Validation. You'd have to use a Set Role operator and Select Attribute operator to remove the confidence attributes and change the label role to an regular attribute, but that seems very complicated.

Moderator

## Re: predict age and gender with Multilabel calssification for twitter's user

Yes, here is a process that uses 3 classes.

<?xml version="1.0" encoding="UTF-8"?><process version="7.4.000">
<context>
<input/>
<output/>
<macros/>
</context>
<operator activated="true" class="process" compatibility="7.4.000" expanded="true" name="Process">
<parameter key="encoding" value="SYSTEM"/>
<process expanded="true">
<operator activated="true" class="social_media:search_twitter" compatibility="7.3.000" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="34">
<parameter key="connection" value="NewConnection"/>
<parameter key="query" value="machinelearning"/>
<parameter key="result_type" value="recent"/>
<parameter key="limit" value="1000"/>
<parameter key="language" value="en"/>
</operator>
<operator activated="true" class="select_attributes" compatibility="7.4.000" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="34">
<parameter key="attribute_filter_type" value="subset"/>
<parameter key="attributes" value="Retweet-Count|Text"/>
<parameter key="include_special_attributes" value="true"/>
</operator>
<operator activated="true" class="nominal_to_text" compatibility="7.4.000" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="34">
<parameter key="attribute_filter_type" value="single"/>
<parameter key="attribute" value="Text"/>
</operator>
<operator activated="true" class="text:process_document_from_data" compatibility="7.4.001" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="34">
<parameter key="prune_method" value="percentual"/>
<list key="specify_weights"/>
<process expanded="true">
<operator activated="true" class="text:transform_cases" compatibility="7.4.001" expanded="true" height="68" name="Transform Cases (2)" width="90" x="45" y="34"/>
<operator activated="true" class="text:filter_tokens_by_content" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Content)" width="90" x="179" y="34">
<parameter key="string" value="http"/>
<parameter key="invert condition" value="true"/>
</operator>
<operator activated="true" class="text:tokenize" compatibility="7.4.001" expanded="true" height="68" name="Tokenize" width="90" x="313" y="34">
<parameter key="mode" value="specify characters"/>
<parameter key="characters" value=" .:;#!(){}[]/"/>
</operator>
<operator activated="true" class="text:filter_stopwords_english" compatibility="7.4.001" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="514" y="34"/>
<operator activated="true" class="text:filter_by_length" compatibility="7.4.001" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="648" y="34"/>
<connect from_port="document" to_op="Transform Cases (2)" to_port="document"/>
<connect from_op="Transform Cases (2)" from_port="document" to_op="Filter Tokens (by Content)" to_port="document"/>
<connect from_op="Filter Tokens (by Content)" from_port="document" to_op="Tokenize" to_port="document"/>
<connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
<connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
<connect from_op="Filter Tokens (by Length)" from_port="document" to_port="document 1"/>
<portSpacing port="source_document" spacing="0"/>
<portSpacing port="sink_document 1" spacing="0"/>
<portSpacing port="sink_document 2" spacing="0"/>
</process>
</operator>
<operator activated="true" class="k_means" compatibility="7.4.000" expanded="true" height="82" name="Clustering" width="90" x="581" y="34">
<parameter key="k" value="3"/>
<parameter key="measure_types" value="NumericalMeasures"/>
<parameter key="numerical_measure" value="CosineSimilarity"/>
</operator>
<operator activated="true" class="set_role" compatibility="7.4.000" expanded="true" height="82" name="Set Role" width="90" x="715" y="85">
<parameter key="attribute_name" value="cluster"/>
<parameter key="target_role" value="label"/>
<list key="set_additional_roles"/>
</operator>
<operator activated="true" class="concurrency:cross_validation" compatibility="7.4.000" expanded="true" height="145" name="Validation" width="90" x="849" y="85">
<parameter key="sampling_type" value="stratified sampling"/>
<process expanded="true">
<operator activated="true" class="h2o:deep_learning" compatibility="7.4.000" expanded="true" height="82" name="Deep Learning" width="90" x="203" y="34">
<enumeration key="hidden_layer_sizes">
<parameter key="hidden_layer_sizes" value="50"/>
<parameter key="hidden_layer_sizes" value="50"/>
</enumeration>
<enumeration key="hidden_dropout_ratios"/>
<list key="expert_parameters"/>
<list key="expert_parameters_"/>
</operator>
<connect from_port="training set" to_op="Deep Learning" to_port="training set"/>
<connect from_op="Deep Learning" from_port="model" to_port="model"/>
<portSpacing port="source_training set" spacing="0"/>
<portSpacing port="sink_model" spacing="0"/>
<portSpacing port="sink_through 1" spacing="0"/>
<description align="left" color="green" colored="true" height="80" resized="true" width="248" x="114" y="135">In the training phase, a model is built on the current training data set. (90 % of data by default, 10 times)</description>
</process>
<process expanded="true">
<operator activated="true" class="apply_model" compatibility="7.4.000" expanded="true" height="82" name="Apply Model" width="90" x="45" y="34">
<list key="application_parameters"/>
</operator>
<operator activated="true" class="performance" compatibility="7.4.000" expanded="true" height="82" name="Performance" width="90" x="179" y="34"/>
<connect from_port="model" to_op="Apply Model" to_port="model"/>
<connect from_port="test set" to_op="Apply Model" to_port="unlabelled data"/>
<connect from_op="Apply Model" from_port="labelled data" to_op="Performance" to_port="labelled data"/>
<connect from_op="Performance" from_port="performance" to_port="performance 1"/>
<connect from_op="Performance" from_port="example set" to_port="test set results"/>
<portSpacing port="source_model" spacing="0"/>
<portSpacing port="source_test set" spacing="0"/>
<portSpacing port="source_through 1" spacing="0"/>
<portSpacing port="sink_test set results" spacing="0"/>
<portSpacing port="sink_performance 1" spacing="0"/>
<portSpacing port="sink_performance 2" spacing="0"/>
<description align="left" color="blue" colored="true" height="103" resized="true" width="315" x="38" y="137">The model created in the Training step is applied to the current test set (10 %).&lt;br/&gt;The performance is evaluated and sent to the operator results.</description>
</process>
</operator>
<connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
<connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
<connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
<connect from_op="Process Documents from Data" from_port="example set" to_op="Clustering" to_port="example set"/>
<connect from_op="Clustering" from_port="cluster model" to_port="result 1"/>
<connect from_op="Clustering" from_port="clustered set" to_op="Set Role" to_port="example set input"/>
<connect from_op="Set Role" from_port="example set output" to_op="Validation" to_port="example set"/>
<connect from_op="Validation" from_port="performance 1" to_port="result 2"/>
<portSpacing port="source_input 1" spacing="0"/>
<portSpacing port="sink_result 1" spacing="0"/>
<portSpacing port="sink_result 2" spacing="0"/>
<portSpacing port="sink_result 3" spacing="0"/>
</process>
</operator>
</process>