How to correct the wrong words?

jozeftomas_2020 · June 2018

Hello
How to get in rapidminer
Improved spelling of words?
For example a word

meeseg - > message
or
veeeery gooood - >very good

Does anyone know

Telcontar120 · June 2018

I think I answered this same question in another thread. If you generate a wordlist first and you compile a list of substitutions you want to make, then you can use the "Replace Tokens" operator. If you are looking for an automated way to do this (i.e., RapidMiner identifies misspellings and replaces them automatically), there isn't a built-in solution for that. There might be some third party software you could access via an API though.

lionelderkrikor · June 2018

Hi @jozeftomas_2020,

As @Telcontar120 said, there isn't a built-in solution for performing what you want to do.

So I propose to use a Python script using the textblob library . Here some results :

However, when the words are too mispelled, the script is not able to correct them correctly (like the examples you gave) :

FYI, spelling correction is based on Peter Norvig’s “How to Write a Spelling Corrector” as implemented in the pattern library. It is about 70% accurate.

The process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.1.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,text&#10;1,meeseg&#10;2,veeeery gooood"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.000" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10;  &#10;  b = TextBlob(text)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

To execute this process, you have to :

- Install Python on your computer

- Install the textblob library

- Install the Execute Python operator from the marketplace

- Set the name of your text attribute in the Set Macros operator

I hope it helps,

Regards,

Lionel

jozeftomas_2020 · June 2018

Hello
Thank you

The last part you said
Did not get Macro setup?
What exactly should I do?

I want R to use this code

https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html
But I do not know how to run rapidminer on my data.
Maybe help me?

I installed anaconda but I do not know how to install textblob and use it in rapidminer?:smileysad:
Can someone help?
Thank you

lionelderkrikor · June 2018

Hi @jozeftomas_2020,

1. To set up the Set Macros :

Have you try to import the process I shared ? You have to enter in the parameters of this operator (in the "values" column)

the name of the attribute where there are the mispelled words.

2. To install textblob :

a. Type Win + R to open a window

b. Type "cmd" and then click OK

c. Type "pip install textblob" and click enter

textblob will be automatically installed on your computer.

Regards,

Lionel

jozeftomas_2020 · July 2018

Hi dear friend
I did all the steps
I want to correct spelling mistakes in my data, which has a text column

I loaded the data and then with the 'select attribute' operator I chose my text column and then I connected to the 'execute python' operator.

The column name I want to correct is 'text'.

But run this error

I do not know how to solve it
Can you help me once more?

Thanks a lot

lionelderkrikor · July 2018

Hi

Have you set the name of your text attribute (text) in the set macros operator with quotes? (value ='text')

Regards,

Lionel

jozeftomas_2020 · July 2018

Hello
Yes you got it
But it still has an error
look

Maybe help me Allow me to send a photo or sample process?
Thanks a lot
With respect

student_compute · July 2018

Hi, I did the same for installing textblob. But is this error?

What should i do

"
2. To install textblob:

a. Type Win + R to open a window

b. Type "cmd" and then click OK

c. Type "pip install textblob" and click enter

textblob will be automatically installed on your computer.

"

lionelderkrikor · July 2018

Jozeftomas, can you share your process and your dataset. Tomorrow, I will try to find and fix the bug you mentionned.

Regards.

Lionel

lionelderkrikor · July 2018

Hi Student_compute,

The 'pip' command is installed with Python.
So first install Python (Python 3.x) via
Anaconda.

Regards,

Lionel.

student_compute · July 2018

Hello
But I installed Python first.
How should I do now?
Thank you my friend

jozeftomas_2020 · July 2018

Hello, thank you very much for your response and kindness
I've got it from Twitter, in the photo above
I have a search twitter operator before nominal to text.
This
Can you tell what the problem is?
And how can I run the preprocess code on my tweets in RapidMiner?
https://www.kdnuggets.com/2018/03/text-data-preprocessing-walkthrough-python.html
Thanks if you get started
With respect and dedication

lionelderkrikor · July 2018

Hi @jozeftomas_2020,

It will be very hard for us to understand your bug without your process, can you share it ?

and what you want to do in fine ?, correct the mispelled tweets ??

Regards,

Lionel

lionelderkrikor · July 2018

Hi @student_compute,

If you have, effectively, installed Python, 'pip' must be installed too. So I see only one solution :

You have to update your "environment variables" :

1/

- Search the pip.exe file on your computer. it is by default located in C:\Users\username\Anacondax\Scripts or C:\Users\username\Pythonx\Scripts. (where x = 2 or 3 according to the version of Python you installed).

or

- Type 'pip.exe' (with quote) in the search bar of windows 10 (bottom-left), then right click on the result and select open the location of the file.

2/ Then (here on Windows 10):

- open an explorer window

then click on properties

then

ikk

then

I

I hope it helps,

Regards,

Lionel

jozeftomas_2020 · July 2018

Hello
This is my process
I want to correct spelling mistakes in any tweets. And then I can do kmesan clustering. But I'm new to Python.
And in the RapidMiner program, I do not know how to write code for Python to achieve this goal.
Please, dear friend, if possible
With respect
I will be grateful . I'm waiting for your help

lionelderkrikor · July 2018

Hi @jozeftomas_2020,

Here the operational process to correct mispelled tweets :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="social_media:search_twitter" compatibility="8.0.010" expanded="true" height="68" name="Search Twitter" width="90" x="45" y="136">
        <parameter key="connection" value="dkk"/>
        <parameter key="query" value="iphone"/>
        <parameter key="limit" value="10"/>
        <parameter key="language" value="en"/>
      </operator>
      <operator activated="true" class="select_attributes" compatibility="8.2.001" expanded="true" height="82" name="Select Attributes" width="90" x="179" y="136">
        <parameter key="attribute_filter_type" value="subset"/>
        <parameter key="attributes" value="Text"/>
      </operator>
      <operator activated="true" class="nominal_to_text" compatibility="8.2.001" expanded="true" height="82" name="Nominal to Text" width="90" x="313" y="136"/>
      <operator activated="true" class="text:process_document_from_data" compatibility="8.1.000" expanded="true" height="82" name="Process Documents from Data" width="90" x="447" y="136">
        <parameter key="keep_text" value="true"/>
        <parameter key="prune_method" value="percentual"/>
        <parameter key="prune_below_percent" value="2.0"/>
        <parameter key="prune_above_percent" value="70.0"/>
        <list key="specify_weights"/>
        <process expanded="true">
          <operator activated="true" class="text:transform_cases" compatibility="8.1.000" expanded="true" height="68" name="Transform Cases" width="90" x="45" y="34"/>
          <operator activated="true" class="text:tokenize" compatibility="8.1.000" expanded="true" height="68" name="Tokenize" width="90" x="179" y="34"/>
          <operator activated="true" class="text:filter_stopwords_english" compatibility="8.1.000" expanded="true" height="68" name="Filter Stopwords (English)" width="90" x="313" y="34"/>
          <operator activated="true" class="text:filter_by_length" compatibility="8.1.000" expanded="true" height="68" name="Filter Tokens (by Length)" width="90" x="447" y="34"/>
          <operator activated="true" class="text:stem_porter" compatibility="8.1.000" expanded="true" height="68" name="Stem (Porter)" width="90" x="581" y="34"/>
          <connect from_port="document" to_op="Transform Cases" to_port="document"/>
          <connect from_op="Transform Cases" from_port="document" to_op="Tokenize" to_port="document"/>
          <connect from_op="Tokenize" from_port="document" to_op="Filter Stopwords (English)" to_port="document"/>
          <connect from_op="Filter Stopwords (English)" from_port="document" to_op="Filter Tokens (by Length)" to_port="document"/>
          <connect from_op="Filter Tokens (by Length)" from_port="document" to_op="Stem (Porter)" to_port="document"/>
          <connect from_op="Stem (Porter)" from_port="document" to_port="document 1"/>
          <portSpacing port="source_document" spacing="0"/>
          <portSpacing port="sink_document 1" spacing="0"/>
          <portSpacing port="sink_document 2" spacing="0"/>
        </process>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="514" y="238">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="581" y="136">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(txt) : &#10;  &#10;  b = TextBlob(txt)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Search Twitter" from_port="output" to_op="Select Attributes" to_port="example set input"/>
      <connect from_op="Select Attributes" from_port="example set output" to_op="Nominal to Text" to_port="example set input"/>
      <connect from_op="Nominal to Text" from_port="example set output" to_op="Process Documents from Data" to_port="example set"/>
      <connect from_op="Process Documents from Data" from_port="example set" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Note that according to the number of tweets, the correction of the tweets may take many minutes.

Regards,

Lionel

Thomas_Ott · July 2018

@lionelderkrikor this is quite handy, thank you for this!

lionelderkrikor · July 2018

Hi,

You're welcome, @Thomas_Ott.

Happy corrections !

Regards,

Lionel

jozeftomas_2020 · July 2018

Hello
Thank you so much
Really your codes will surprise me
I do not know how to thank
But the master
In one comment, I typed a false word and run the program. As a result, the word was not corrected
Maybe check
like this
iphon worst phone appl made helo meseg
After running
iphon worst phone appl made helo meseg
I wanted to correct the two words helo, meseg as hello, message
Thank you

lionelderkrikor · July 2018

Hi @jozeftomas_2020,

I executed the script with your examples and here what I get (in your case, I don't know why, no correction is performed):

That's not what you're waiting for, but the spelling corrector try to find the nearest correct word from the mispelled word.

So :

- "held" is nearer from "helo" than "hello".

- "meet" is nearer from "meseg" than "message".

I think it will be very difficult to do best.

Regards,

Lionel

jozeftomas_2020 · July 2018

Hello.
Yes you are right.
Thanks again.
Is it possible just to send your last example xml file?
Thankful

lionelderkrikor · July 2018

Hi @jozeftomas_2020,

Here the last process :

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.001">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.001" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="operator_toolbox:create_exampleset" compatibility="1.2.000" expanded="true" height="68" name="Create ExampleSet" width="90" x="112" y="34">
        <parameter key="generator_type" value="comma_separated_text"/>
        <list key="function_descriptions"/>
        <list key="numeric_series_configuration"/>
        <list key="date_series_configuration"/>
        <list key="date_series_configuration (interval)"/>
        <parameter key="input_csv_text" value="Id,text&#10;1,helo&#10;2,meseg&#10;3,iphon worst phone appl made helo meseg"/>
      </operator>
      <operator activated="true" class="set_macros" compatibility="8.2.001" expanded="true" height="82" name="Set Text Atribute" width="90" x="246" y="34">
        <list key="macros">
          <parameter key="textAttribute" value="'text'"/>
        </list>
      </operator>
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="380" y="34">
        <parameter key="script" value="import pandas&#10;from textblob import TextBlob&#10;&#10;Text_Attribute = %{textAttribute}&#10;&#10;&#10;def spellingCorrection(text) : &#10;  &#10;  b = TextBlob(text)&#10;  return b.correct()&#10;&#10;&#10;def rm_main(data):&#10;&#10;&#10;  data['corrected_text'] = data[Text_Attribute].apply(spellingCorrection)&#10;&#10;  return data"/>
      </operator>
      <connect from_op="Create ExampleSet" from_port="output" to_op="Set Text Atribute" to_port="through 1"/>
      <connect from_op="Set Text Atribute" from_port="through 1" to_op="Execute Python" to_port="input 1"/>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

Regards,

Lionel

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

How to correct the wrong words?

Answers