The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

Set Class Label to The Dataset

Fatin_FezarudinFatin_Fezarudin Member Posts: 2 Contributor I
edited December 2018 in Help

Hi All,

I have questions to ask regarding my dataset. I have a bunch of text and I only want to take out relevant words to make it as a class label.
Is it related to text mining? How I want to set the class label?
Here the example of text in my dataset: 
(The red color is the class label that I want to set)
Plan, lead, organize production schedule.Conduct necessary checking of all raw materials, packaging materials and supervise production process to ensure quality assurance. Handling production documentation filing and monitoring company safety and quality programs in accordance with standard of HACCP, ISO, JAKIM Halal and etc.Responsible for inventory management to ensure supply always available.Implementing safe work environment, maintain good housekeeping and ensure compliance with safety standard.Assist in production planning by coordinate production process improvement, raw materials, packaging, storage and manpower to minimize production downline and wastage.Maintain great communication at all level in the organization.
Thank You for your help.


  • Options
    rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hello @Fatin_Fezarudin,


    Let me see if I get it:


    - There is text.

    - There should be a text classification based in certain words.




    It all depends on how you want to determine what words are important, and there are at least three ways (that I know of) to determine such a thing:


    1.- Having a collection of words.




    - Have a list of words somewhere.

    - Use Loop Examples to walk through that list of words, and inside this list:

    ----> Filter Examples and use a "contains" filter.

    ----> Add your word as an attribute.

    ----> Join and save (you can use a Remember/Retrieve operator so you can handle what is saved and how)

    - Retrieve the final results. Add Set Role to create your labels.


    Except for the Remember/Retrieve, this is the easiest thing you can do, but for that you should already know what words are important.


    2.- Creating a collection of words.


    Our great community manager @sgenzer posted this solution a while ago. I'm using Bold to indicate the names of the operators you should use on each step. Unfortunately I'm abroad and don


    - (This is a suggestion) Filter Stopwords before doing the rest. Stopwords are words that connect other words but don't add meaning by themselves.

    - Take your text and use the Split operator to create a ton of attributes.

    - Transpose this mess so that your text is listed word by word in one attribute and a ton of examples.

    - Use the Join operator with your keyword database list to see overlap.

    - Aggregate to see word frequencies.

    - (This is my addition) Filter Examples to get the most important words, Select Attributes to get a good grasp of your data, and then Label by the word list, and you will have many classes for each doc.


    Now, since you have many classes here, I wouldn't save the result of the Join in a dataset, because that will end up in a huge file.


    This is not difficult either, but since you don't have control over what words appear, you should work a lot with adding or removing breakpoints to get an idea on how things go.


    3.- Analyze the text with text mining operators.


    The usual process is:


    - Use the Process Documents From Files or one of the appropiate text mining tools to:

    ----> create TF-IDF vectors,

    ----> Tokenize,

    ----> Lowercase,

    ----> Filter Stopwords,

    ----> Generate N-Grams if you need associations of words.

    ----> Or Filter Tokens by POS to get only verbs, nouns, adjectives...

    ----> Or Filter Tokens by one of the others.

    ----> Or Lemmatize to create some meaning.

    - Once you get your results, you can apply some kind of segmentation operator (it's up to you, I'm running out of knowledge here) to define which words are important.

    - Once you get that segmentation, you can do some magic to associate these important words to the original texts.


    That said, I consider text mining and natural language processing as a complete area inside Machine Learning. There is so much to know regarding how languages work, sentences and all that... But as a first, this should become your initial guide.


    All the best,


  • Options
    Mary61Mary61 Member Posts: 2 Contributor I
    @rfuentealba Hi , i have the same problem and thank you for your answer . i also have a text and by "process document " i separated each text to words . i have 100 texts i need to do classification . would you please let me know how i can choose one word for each row as a class so i can use clustering operator.
    Thanks in advance for the reply
Sign In or Register to comment.