The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
extract URL from text
Dear community,
I am trying to extract an URL from a text. Not only do I want to parse Twitter posts for mentioned URLs but also other news content.
I then want to feed the get page operator with the URLs - I am fine with this part but I have not made it to extract URLs so far. Tried it with extract information already...
Help is much appreciated!
Thanks,
Julian
Tagged:
0
Answers
@julian_d pretty easy and you were on the right track. Have to use RegEx parameter in the Extract Information operator.
Aylien and Rosette "Extract Entity" operators within RapidMiner also will allow you to pull out URLs if you want to go down the third party API route.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts
Thanks @Thomas_Ott and @Telcontar120 for your quick replies. However, it seems like I do not make it to feed the extract Information with the right document type. I am trying to convert the content of the Twitter Post into a document.
I created a sample process. hAve you got an idea?
Thanks again
Julian
Hi @julian_d,
Your XML process is broken. To share properly your process :
https://community.rapidminer.com/t5/RapidMiner-Studio-Knowledge-Base/How-can-I-share-processes-without-RapidMiner-Server/ta-p/37047
1. However, I think you can use this simple process using the Extract Entities operator of Aylien extension (to install the last version of this extension RDV in the Marketplace) as indicated by @Telcontar120.
2. Like you, I try to play with RapidMiner to feed Extract Information operator from the Search Twitter operator, but
I have only one Tweet in the results.
So I am curious to have elements of answers to solve this problem.
Here my process :
I hope it helps / thanks you
Regards,
Lionel
@lionelderkrikor you're making it a bit hard on yourself, try this process on for size
Hi @Thomas_Ott,
Thanks you for your solution.
Why I did not think to Process Document to Data ????? I don't know... Maybe because it's the end of the day here and it's time to sleep.
Thanks again,
Best regards,
Lionel
Thanks for your solutions and feedback @Thomas_Ott @lionelderkrikor
The Process Documents operator nearly gets me there. However, if the link is followed by further text, the operator seems not to be able to keep the link only, I attached my sample process - hope it works this time.
Further, the extract information operator does not really filter the URL. Since I want to feed a get page operator I would need to extract the url only. Would be awesome if you had another hint.
Thanks
Julian
@julian_d it's kinda funny but I'm working on something similiar. I have something that technically works below BUT it only works for http.*.com URLS, which limits it to .coms. Not great.
I think the trick here is to tokenize the text properly where you don't destroy the full http://link.com, select it in the Process Documents and set it to the URL attribute. Then outside the Process Documents operator, you'll have to use an extract macro and loop over a Get Page to pull in the URLs.
@julian_d this is incredibly crude but fast. You'll have to tune how you want to Tokenize and extract only URLs
Thanks very much @Thomas_Ott for your immediate and awesome support. I made it to implement it into my main process. Unfortunately the get page operator seems to have an issue with the redirects. A lot of the gathered URLs are forwards from Twitter to other pages starting with https://t.co/.. I have tried hard to build a workaround to get to the final page but did not make it yet. To demonstrate my problem I continued your process a bit. Sorry to ask again but... any idea? ?
Do you further know how I can eliminate dots and commas if they follow an URL? Currently I get an error if somebody mentions an URL followed by a comma or a dot right after because the process documents operator handles the dot as if it was part of the url and the get page operator then complains about a corrupt input.
Thank you!
Julian
Yes, URLs can be tricky! Right now the regex here is just http.* which basically picks up anything that starts with http but does not specify any kind of terminal character restrictions.
So you could reformulate your URL regex to specify something like this: http.*\.[a-zA-Z]{3}
This will work for any domain that ends in a TLD extension with 3 letters (.com, .net, .org, etc) but it will stop there and not pick up other trailing characters. You can use similar logic to create other versions if you need to be able to take longer URLs or deal with TLDs that don't have 3 letters.
Or if all this regex starts to give you a headache you could look at the Extract Entity operators from either MonkeyLearn or Rosette, since either of them support URL extraction.
Lindon Ventures
Data Science Consulting from Certified RapidMiner Experts