Remove all HTML labels in a message field

bea11005bea11005 Member Posts: 20 Maven
edited November 2018 in Help

Hi everyone!

I want to delete all HTML labels in a message field, so I could count characters from the message without them with lenght operator.

How can I do it?

 

Answers

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    You'll need the Web Mining extension for that. It has the ability to get rid of HTML tags. 

  • bea11005bea11005 Member Posts: 20 Maven

    I have to remove HTML labels of an attribute of a dataset.

    Which operator should I use?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Depends on how your data is set up but I would look at the Extract Content, Unescape HTML, or Unescape HTML Document operators .

  • bea11005bea11005 Member Posts: 20 Maven

    I will try. Can i do it with a regular expressions tha delete everything between <> symbols?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Yes you can do RegEx. Just use the Replace operator. 

  • bea11005bea11005 Member Posts: 20 Maven

    What RegEx can I use?

  • Thomas_OttThomas_Ott RapidMiner Certified Analyst, RapidMiner Certified Expert, Member Posts: 1,761 Unicorn

    Without seeing your data, I would guess something like this: \<.*\>

     

    and replace with a space or something else.

  • kaymankayman Member Posts: 662 Unicorn

    That's a greedy regex, so that would eat all your tags in one go and leave you with not much content.

     

    Remove tags either with <.*?> (note the question mark that makes it a non greedy regex) or <[^>]+>

     

     

Sign In or Register to comment.