The RapidMiner community is on read-only mode until further notice. Technical support via cases will continue to work as is. For any urgent licensing related requests from Students/Faculty members, please use the Altair academic forum here.
Web Mining - Web Page Similarity
Hello,
I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.
I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators. I then tokenize the webpages, use stopwords and transform cases. Finally, I use the "data to similarity" operator.
However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.
Could somebody advise on how to do this? I will be really thankful!
Have a good day!
- Prat
I am a beginner for rapidMiner so please excuse my lack of knowledge. I am very excited about rapidMiner.
I want to find similarities in some web pages I am in interested in. So I have a list of web page links stored in an excel sheet. I then use the "read excel" operator to read the links and then use "Get Pages" operator to fetch the pages. I then use "data to documents" & "process documents" operators. I then tokenize the webpages, use stopwords and transform cases. Finally, I use the "data to similarity" operator.
However, I notice that in my results I have a lot of html tokens which I do not want. I know that the "extract content" operator can strip away the html content, but it only seems to work with "get page" operator and not "get pages". This means that I am unable to strip html content if I want to get multiple pages at once using the "get pages" operator.
Could somebody advise on how to do this? I will be really thankful!
Have a good day!
- Prat
0
Answers
Happy Mining!
~Marius