"Regarding Text Mining"

maria_godricmaria_godric Member Posts: 20 Maven
edited May 2019 in Help
Hi,

I have a text document.How can I delete the contents in between two special characters (For Example  my document contains #something#). I want to delete the special character also. I tried with TextCleaner but we have to include the content whatever we want to delete.So I think this will not work out if its for huge amount of data.Is there any Operators available in RM?

Thanks,
Maria

Answers

  • landland RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 2,531 Unicorn
    Hi,
    you might add an TokenReplace Operator before the Tokenizer during TextProcessing and then use regular expressions to capture whatever you want.

    Here's an example process setup:
    <operator name="Root" class="Process" expanded="yes">
        <operator name="TextInput" class="TextInput" expanded="yes">
            <list key="texts">
            </list>
            <list key="namespaces">
            </list>
            <operator name="TokenReplace" class="TokenReplace">
                <list key="replace_dictionary">
                  <parameter key="#[^#]*#" value=" "/>
                </list>
            </operator>
            <operator name="StringTokenizer" class="StringTokenizer">
            </operator>
        </operator>
    </operator>
    For more information about regular expressions, you could visit wikipedia http://en.wikipedia.org/wiki/Regular_expression and for trying something without executing the process, you could use the online form at http://en.wikipedia.org/wiki/Regular_expression.

    Greetings,
      Sebastian
  • maria_godricmaria_godric Member Posts: 20 Maven
    Thanks Sebastain.

    It worked fine.But I would like to get the edited text in the same format as that of original data ie I need to save it in .txt format .

    Thanks,
    Maria
Sign In or Register to comment.