Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
I have problem removing url and hashtags in the data(from excel)
I’m having a problem in removing url and hashtags in the data(from excel). I have inputted data(tweets) using 3 read excel then append them. After that, I connected the append operator to replace then inputted regex for url and hashtags in parameters named regular expression and replace what. Then, I connected it to data to document then process documents where I have Transform cases, Tokenize and Filter Stopwords(dictionary) respectively. The results were tokenized and the stopwords I created were removed. But the one with hashtags, only the # symbol is removed. For example, original text is #vscocam the result is vscocam while the url it is not removed. It was just tokenized too.
Tagged:
0
Answers
hello @fangirl96 - welcome to the community. I think I understand and believe you just need to adjust your regex. Can you give some examples and the process you're using (see instructions "Read Before Posting" on the right).
Scott
This is the full xml of my process.
The links are not removed but the hashtags were removed.
PS. The links included in my data is starting with https
thank you @fangirl96 - can you share one of those excel sheets as well?
Scott
@fangirl96 take a look at my tutorial process here: http://www.neuralmarkettrends.com/blog/entry/use-rapidminer-discover-twitter-content
I extract hashtags and drop https: to a generic word called 'link'