Web scraping on site created in javascript

marcelolimabati · July 2018

Hi experts,

I'm trying to create a scraper from a site created in JavaScript but I'm not getting it. Which operator can I use to be able to perform the site scraper?
I was using the webtable operator, but it's only for HTML-created sites, correct?
Could you help me please?

Thank you in advance for your help.

Marcelo Batista

Telcontar120 · July 2018

Get Page or Get Pages are the basic web scrapers that work for specific given URLs. Crawl Web is a more advanced version that can actually go through a site and follow links of a specified formulation.

Thomas_Ott · July 2018

@marcelolimabati and @Telcontar120 i've been running into problems where websites are completely disallowing webscrapers. While I have not implemented this yet, the solution would appear to be web broswer automation. There are several non-RapidMiner packages that can do this. The big one is Selenium (python).

Something to think about.

rfuentealba · July 2018

Hello all,

I know Ruby isn't popular as Python over here (and I almost certainly can feel the crowd chanting "Switch-to-python! Switch-to-python!"), but it is quite handy when it comes to automate Web manipulation stuff (I used it on a daily basis as it was part of my testing process when I was a Ruby developer) and you can still use it with RapidMiner: just use the Execute Program operator and read its output.

Please find attached zip file with code. You need Ruby from https://www.ruby-lang.org/, Google Chrome installed, and the Selenium Chrome Driver from http://chromedriver.chromium.org/downloads to make this work.

Screen Shot 2018-07-19 at 00.33.50.png Here it is!!!

Once you have Ruby installed, you can uncompress the zip file in your $HOME, run bundle install to install the required libraries and execute the code with ruby website.rb. If, on the other hand, you want to pass the URL as a parameter, you only need to change line 6 (browser.get 'https://www.rapidminer.com/') by browser.get ARGV[0] and that's it. Beware that with this modification, the script will throw an error if you don't call it with a URL as the last parameter like the following:

ruby website.rb https://www.datasciencegems.com/

(BTW, I haven't tried this on Windows. Mac is immensely more popular among Rubyists).

Thomas_Ott · July 2018

@rfuentealba I barely know Python, you want me to learn Ruby now? lol.

Ruby does not work nice with Windows, so I'll default over to Python.

kayman · July 2018

(silently chanting python / python / python ...)

Using Selenium in combination with python / rapidminer works really nice. attached an example that I used to get the IP adresses of some forum, as these were not retrievable through normal API but required login. Not that this matters but it sets the scene a bit.

What below script was doing is open a specific page, enter username and password, click the login button, look for the element stating a users name, then open that page and get the IP information from his / hers profile. Next the logic took all of these IP addresses and user codes and exported them nicely in one big table / exampleset

It doesn't work anymore as we closed down the forum itself, but it shows you can do pretty complex stuff. You will have to modify it for your own uses but it may get you started more quickly

<?xml version="1.0" encoding="UTF-8"?><process version="8.2.000">
  <context>
    <input/>
    <output/>
    <macros/>
  </context>
  <operator activated="true" class="process" compatibility="8.2.000" expanded="true" name="Process">
    <process expanded="true">
      <operator activated="true" class="python_scripting:execute_python" compatibility="7.4.000" expanded="true" height="82" name="Execute Python" width="90" x="246" y="34">
        <parameter key="script" value="import pandas&#10;&#10;from selenium import webdriver&#10;from selenium.common.exceptions import NoSuchElementException&#10;import time&#10;&#10;# login sequence, need to modify to hide password&#10;def login_com():&#10;    driver.get(&quot;your_page.html&quot;)&#10;    driver.find_element_by_id(&quot;loginPageV1&quot;).click()&#10;    driver.find_element_by_id(&quot;lia-login&quot;).clear()&#10;    driver.find_element_by_id(&quot;lia-login&quot;).send_keys(&quot;my_login&quot;)&#10;    driver.find_element_by_id(&quot;lia-password&quot;).click()&#10;    driver.find_element_by_id(&quot;lia-password&quot;).clear()&#10;    driver.find_element_by_id(&quot;lia-password&quot;).send_keys(&quot;my_password&quot;)&#10;    driver.find_element_by_id(&quot;submitContext_0&quot;).click()&#10;&#10;# logic to strip IP from page&#10;def get_ip(cid):&#10;    ip_xpath = &quot;//td[@class='lia-property-value lia-data-cell-secondary lia-data-cell-text lastVisitIpAddress']&quot;&#10;    try:&#10;        driver.get(&quot;http://yourpage/user/viewprofilepage/user-id/&quot; + str(int(cid)))&#10;        ip = driver.find_element_by_xpath(ip_xpath).text&#10;    except NoSuchElementException:&#10;        return 'null'&#10;    return ip&#10;&#10;# We use selenium webdriver to mimic the behaviour of a human&#10;path = 'C:\\Users\\me_myself_and_AI\\PycharmProjects\\selenium'&#10;# using chrome but any driver should do&#10;driver = webdriver.Chrome(path + '\\driver\\chromedriver')&#10;driver.maximize_window()&#10;# login so we have admin rights&#10;login_com()&#10;# give the page some time to load&#10;time.sleep(2)&#10;&#10;def rm_main(data):&#10;&#9;cid = data['cid']&#10;&#9;# loop through all Customer ID's and get the IP for all of them&#10;&#9;data['ip'] = cid.apply(lambda x: get_ip(x))&#10;&#9;# Close the browser afterwards&#10;&#9;driver.quit()&#10;&#9;return data"/>
        <description align="center" color="transparent" colored="false" width="126">Use selenium webdriver to get ip by batch</description>
      </operator>
      <connect from_op="Execute Python" from_port="output 1" to_port="result 1"/>
      <portSpacing port="source_input 1" spacing="0"/>
      <portSpacing port="sink_result 1" spacing="0"/>
      <portSpacing port="sink_result 2" spacing="0"/>
    </process>
  </operator>
</process>

The trick for non python gurus is to use selenium first with your browser. Install the Katalon Automation Recorder (great plugin, should be standard in everybodys scraping toolkit), let it record and copy paste the generated python code. You may need to tune the code a bit but it should get you started.

rfuentealba · July 2018

Hello Sensei @Thomas_Ott,

I just wrote that Ruby code to test if Selenium does what @marcelolimabati wants (it does!), and thought it was better to share it than to keep it for myself. You will still need to install Google Chrome and the Chrome Driver, which were my main concern, but it wasn't difficult at all on my Mac at least.

And no, Ruby doesn't play nice with Windows, but that's mostly true for server (e.g. Rails, Puma, Sinatra) applications. The one I attached shouldn't be much of a problem, though.

(BTW, I went back to sleep before pressing the "Post" button and woke up with @kayman chanting "Python! Python! Python!", was about to say it would be nice if someone else could post Python code that does the same. Thanks, mate!)

Thomas_Ott · July 2018

@kayman man, you just saved me hours to trying to figure this stuff out. Are you going to Wisdom? If so, I'm buying you a beer.

kayman · July 2018

Naah, I wish... My budget is too small for this.

But if you're ever in the neighbourhood I'll remind you on the offer...

SGolbert · July 2018

Hi,

Selenium is a good choice for php- or cookies-heavy websites. But it is very slow.

Regarding web scraping limitations, they can be partially addressed by using pauses and setting the user agent correctly. But if they really don't want any scrapper, i.e. they have a robots.txt against it, even using Selenium would be illegal.

OT: I am also no Ruby fan. It is very powerful but I value code clarity above anything else.

rfuentealba · July 2018

Hello @SGolbert,

Remember that the site was created in JavaScript, hence the only ways to scrape this are using a browser, a headless browser or a JavaScript parser. Hence, you might be able to render it to save with other solutions such as PhantomJS but the amount of work to do just that is sometimes unfeasible.

OT: I began telling everyone that I used plain TextMate so that they couldn't include me in their vim vs emacs flamewars. Be sure that I won't bring a knife with gems to a snake-firing gunfight. Mine was just a partial solution, and with Python one has to install Selenium and the Chrome Driver too, so it's not quite different.

sgenzer · July 2018

@rfuentealba: "...that they couldn't include me in their vim vs emacs flamewars"

LOL didn't know people still used emacs. I used it on a VT-100 dummy terminal in 1989...wow I'm old!

Scott

holyswede · October 2018

Hi Marcelo, I'd have a look at a webscraping library that actually scrapes using JavaScript. Like https://github.com/apifytech/apify-js from Apify.

Until now JavaScript hasn't had any similar library, like Scrapy for Python. It simplifies doing deep crawls of complex JavaScript sites using lists of 100k URLs, and many other things.

Here’s the docs https://www.apify.com/docs/sdk/apify-runtime-js/latest

Cheers

Holy

Howdy, Stranger!

Quick Links

Categories

Altair RapidMiner Community

GET HELP. LEARN BEST PRACTICES. NETWORK WITH YOUR PEERS.

Web scraping on site created in javascript

Answers