Due to recent updates, all users are required to create an Altair One account to login to the RapidMiner community. Click the Register button to create your account using the same email that you have previously used to login to the RapidMiner community. This will ensure that any previously created content will be synced to your Altair One account. Once you login, you will be asked to provide a username that identifies you to other Community users. Email us at Community with questions.
[ALMOST SOLVED] Web Crawling and Text Editing challenge
Kind people of the rapid-i,
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:
I've got a website where there are some news and articles: (i.e. www.parolibero.it)
I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3. Export the data in a graphic format that would highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)
I've tried to use the web crawling but all I get is the home page in txt format and then the Excel with just one record.
Can you please help me? At least I would like to know where I get wrong or which operators to use for that.
Thank you very much indeed for your help!
Leon
P.S. There is no copyright issue at all as I am one of the staff of that website
I'm a very new beginner of the RapidMiner world and I am dealing with a project that seems harder than expected. Maybe it's just that I am still learning all the tools and operators of the RM...but here's the situation:
I've got a website where there are some news and articles: (i.e. www.parolibero.it)
I would like to do three things
1. Being able to Extract the articles from the website (text format or even better in XML format keeping the tags as Title, subtitle, body...)
2. Create an Excel list of the articles with title+url of the article
3. Export the data in a graphic format that would highlight some chosen differences: for example I would like to get a diagram where I can see how many articles have been written in that specific year or by that specific journalist (how is possible to use some search filters once I download the data files?)
I've tried to use the web crawling but all I get is the home page in txt format and then the Excel with just one record.
Can you please help me? At least I would like to know where I get wrong or which operators to use for that.
Thank you very much indeed for your help!
Leon
P.S. There is no copyright issue at all as I am one of the staff of that website
Tagged:
0
Answers
to extract information from the site you can for example use the Get Page Operator followed by Cut Documents and Extract Information, see here: One thing you have to notice is that for XPath every HTML identifier must have a 'h:' appended. Otherwise it won't work.
Best,
Nils