normalize informal text

jabrajabra Member Posts: 20 Contributor I
edited December 2018 in Help

Hi everyone
I want to normalize my text but I do not know how?
Is the normalized operator doing this?
I want to identify the words that the user has written briefly and write in full. For example
img -> image
Or replace the abbreviations with the original. Like
DM -> dimention reduction
  Or I can correct words that are spelling mistakes. Like
gud -> good
Please guide
I'm waiting for the answer.
Thankful

Answers

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    You can use 'Replace Tokens' operator from text processing extension for performing exactly this kind of replacement.

    However note that this operator requires that you manually create a dictionary for such replacement. Obviously it is not able to autoimatically detect that 'img' stands for 'image', 'gud' for 'good' and so on.   

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    To clarify further, the regular Normalize operator has to do with transformations of numerical attributes into a consistent scale (range) for comparison, it is not related to text data at all.  Your needs with your text data are not "normalization" from a statistical perspective.  They are text substitutions, which can be accomplished in RapidMiner as @kypexin said with the Replace Tokens operator.  You can also use the Stopwords (Dictionary) operator if you want to remove certain words/tokens entirely.  Or you can use either Map or Replace operators if you have text inside an ordinary polynominal attribute in an example set that you want to modify. But all of these require you to generate your own list of the specific substitutions or eliminations you want to make.

     

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • jabrajabra Member Posts: 20 Contributor I

    Hello .
    Thankful
    what is your suggestion?
    How to create a dictionary. And use it?
    Please help.
    Thanks

     

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    Dictionary has to be created manually in case you would use 'Replace Tokens' operator. 

    First you go to operator's parameters tab: 

     

    Screenshot 2018-04-17 09.31.37.png

     

    After pressing 'Edit List' button you will be able to add dictionary entries one by one. replace what token on the left, replace by what token on the right: 

     

    Screenshot 2018-04-17 09.31.45.png

  • jabrajabra Member Posts: 20 Contributor I

    Hello
    Can you send me a small sample dictionary?
    I want to know what format and what kind of dictionary should I do?
    Many thanks

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    Dictionary in this case stays within operator parameter settings and is not represented by a standalone file or table. 

    What I meant is that in 'Replace Tokens' parameters you have to manually list each pair of entries, with your examples from the first post it would look like this (so each time it sees 'img' in the input document it replaces with 'image' and so on; but you should also note that there's no certain 'magic' here, all the pairs have to be 'hardcoded' in some sense): 

     

    Screenshot 2018-04-18 10.23.53.png

  • jabrajabra Member Posts: 20 Contributor I

    Thank you very much
    How should I correct the spelling of words? For example, I'll get to know. That user, spelling a word image-->Imege
     
    Or how to figure out the word gud. What word has been
    good? Or GUD
    Thanks for helping this too
    Can I use wordnet? How? My data is twitter data.

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    Replace Token operator can be used to correct misspellings or otherwise substitute one token for another.

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    I have never used Wordnet myself so can't help really much with that.

    'Replace tokens' operator does not offer any way to "guess" which word can be represented by a certain type of misspelling; you still have to handle all the token pairs manually. 

  • jabrajabra Member Posts: 20 Contributor I

    Hello
    Thank you very much for the time and guidance
    Because the words are not clear.
      And I do not know exactly what words are written.
      So I can predict. And correct them
    What do you suggest?

  • kypexinkypexin Moderator, RapidMiner Certified Analyst, Member Posts: 291 Unicorn

    Hi @jabra

     

    If you want to make an automated spell-checker which 'guesses' the word based on a wrong spelling then I am afraid there's no easy solution at all. I can think of differemt approaches, but I guess all of these are far beyond the scope of your original task. I have googled something though... 

     

    For example, there are examples of using neural networks / deep learning for spell checking:

     

    Or, there are a bit simpler tips for handling spelling errors in RapidMiner: 

     

    But anyway, automated spelling correction never gonna be an easy task, compared to manual dictionary compiling. Unfortunately, no RapidMiner operator is capable to do this king of 'magic'.

     

  • Telcontar120Telcontar120 Moderator, RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,635 Unicorn

    As @kypexin there really isn't an automated solution here.  However, depending on the size of your document corpus, I also don't think it is too cumbersome to create a wordlist from the original text, and then notate any tokens you want to map/replace, and enter them manually. 

    Brian T.
    Lindon Ventures 
    Data Science Consulting from Certified RapidMiner Experts
Sign In or Register to comment.