Including a filter in the Get Page operator

s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II
edited December 2018 in Help

Dear experts,

 

I have an issue with retrieving the info I need from a web page. My issue might go beyond RapidMiner, but I hope there'll still be some useful input :-)

 

I am trying to retrieve all documents from a search engine of a standardisation organisation, and in particular to retrieve some information regarding these documents, that isn't displayed by default in the search result. This is the page:

 

https://eur-lex.europa.eu/search.html?qid=1538673501151&scope=EURLEX&type=quick&lang=en&FM_CODED=REG

 

On the page there is an option to modify the information displayed by clicking on "Change displayed metadata" and selecting the desired fields. However, if I apply the filter, I do see the info I wanted, but nothing changes in the URL path, and followingly the content I get out of the Get Page operator stays the same.

 

Any idea how to solve this? I thought that using the query parameters of the Get Page operator could be useful, but I didn't manage to find any examples of what these parameters do and how they can be used.

 

Any input would be much appreciated! Many thanks in advance!

 

Cheers,

Snežana

Tagged:

Answers

  • sgenzersgenzer Administrator, Moderator, Employee, RapidMiner Certified Analyst, Community Manager, Member, University Professor, PM Moderator Posts: 2,959 Community Manager

    hi @s_nektarijevic so that's a good question. From what I know, the short answer is no, you cannot do what you're asking with the "query parameters" feature as you want. The "Change displayed metadata" selections create a JS query that goes back to their server via https://eur-lex.europa.eu/change-displayed-metadata.html, makes a new list (and gives it a new qid), and sends it back to you. If you look at the Network traffic when you do this choice, you will see something like this:

     

    Request URL: https://eur-lex.europa.eu/change-displayed-metadata.html
    Request Method: POST
    Status Code: 302 Moved Temporarily
    Remote Address: 147.67.210.44:443
    Referrer Policy: no-referrer-when-downgrade
    Connection: Keep-Alive
    Content-Language: en
    Date: Thu, 04 Oct 2018 18:33:53 GMT
    Location: https://eur-lex.europa.eu/search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
    Server: Europa
    Transfer-Encoding: chunked
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
    Accept-Encoding: gzip, deflate, br
    Accept-Language: en-US,en;q=0.9,fr;q=0.8
    Cache-Control: max-age=0
    Connection: keep-alive
    Content-Length: 3161
    Content-Type: application/x-www-form-urlencoded
    Cookie: ELX_SESSIONID=e4dAV0CaIYd7y9OXWEi-bZmoGqhZGVoRGB9585PVa6TET1xvcJP7!1567286828; validateConsentCookies=true; WT_FPC=id=10.235.250.103-663370864.30694416:lv=1538695991009:ss=1538695755372; ACOOKIE=C8ctADEwLjIzNS4yNTAuMTAzLTY2MzM3MDg2NC4zMDY5NDQxNgAAAAAAAAABAAAAAwAAAOdctlv7W7ZbAQAAAAEAAADnXLZb+1u2WwEAAAADAAAAITEwLjIzNS4yNTAuMTAzLTY2MzM3MDg2NC4zMDY5NDQxNg--
    DNT: 1
    Host: eur-lex.europa.eu
    Origin: https://eur-lex.europa.eu
    Referer: https://eur-lex.europa.eu/search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
    Upgrade-Insecure-Requests: 1
    User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36
    qid: 1538673501151
    callingUrl: /search.html?qid=1538673501151&FM_CODED=REG&scope=EURLEX&type=quick&lang=en
    id: 1538677936442
    defaultProfile: true
    profileName: Custom profile
    multilingualLink: false
    firstMultilingualLanguage: en
    secondMultilingualLanguage:
    thirdMultilingualLanguage:
    firstSortCriteria: LEGAL_RELEVANCE_SORT
    firstSortCritAsc: DESC
    secondSortCriteria: NULL
    secondSortCritAsc: DESC
    isExpertMode: false
    nbResultPerPage: 10
    highlightResult: true
    _highlightResult: on
    filter:
    filter:
    metadataSelected[DD_DISPLAY]: DD_DISPLAY
    _metadataSelected[DD_DISPLAY]: on
    _metadataSelected[CELLAR_ID_ACT_DISPLAY]: on
    _metadataSelected[XC_DISPLAY]: on
    _metadataSelected[XA_DISPLAY]: on
    _metadataSelected[DC]: on
    _metadataSelected[CT]: on
    _metadataSelected[CC]: on
    _metadataSelected[RJ]: on
    metadataSelected[ECLI]: ECLI
    _metadataSelected[ECLI]: on
    metadataSelected[AU]: AU
    _metadataSelected[AU]: on
    metadataSelected[FM]: FM
    _metadataSelected[FM]: on
    _metadataSelected[DN-old]: on
    metadataSelected[DTS]: DTS
    _metadataSelected[DTS]: on
    _metadataSelected[DTA]: on
    metadataSelected[DTT]: DTT
    _metadataSelected[DTT]: on
    _metadataSelected[DTC]: on
    _metadataSelected[TT]: on
    _metadataSelected[PAGES_TOTAL]: on
    metadataSelected[SO]: SO
    _metadataSelected[SO]: on
    _metadataSelected[PD_DISPLAY]: on
    _metadataSelected[IF_DISPLAY]: on
    _metadataSelected[EV_DISPLAY]: on
    _metadataSelected[SG_DISPLAY]: on
    _metadataSelected[DB_DISPLAY]: on
    _metadataSelected[LO_DISPLAY]: on
    _metadataSelected[DL_DISPLAY]: on
    _metadataSelected[DH_DISPLAY]: on
    _metadataSelected[NF_DISPLAY]: on
    _metadataSelected[RP_DISPLAY]: on
    _metadataSelected[TP_DISPLAY]: on
    _metadataSelected[VO_DISPLAY]: on
    _metadataSelected[MS_DISPLAY]: on
    _metadataSelected[BF_DISPLAY]: on
    _metadataSelected[CI_DISPLAY]: on
    _metadataSelected[AJ_DISPLAY]: on
    _metadataSelected[EA_DISPLAY]: on
    _metadataSelected[CD_DISPLAY]: on
    _metadataSelected[MD_DISPLAY]: on
    _metadataSelected[SP_DISPLAY]: on
    _metadataSelected[LB_DISPLAY]: on
    _metadataSelected[AP]: on
    _metadataSelected[DF]: on
    _metadataSelected[OB]: on
    _metadataSelected[PR]: on
    _metadataSelected[AG_DISPLAY]: on
    _metadataSelected[JR_DISPLAY]: on
    _metadataSelected[NA]: on
    _metadataSelected[NO]: on
    _metadataSelected[NC]: on
    _metadataSelected[COLL_DISPLAY]: on
    _metadataSelected[NO_OJ]: on
    _metadataSelected[NO_OJ_CLASS]: on
    _metadataSelected[COLL_OJ_DISPLAY]: on
    _metadataSelected[AS_DISPLAY]: on
    _metadataSelected[CM]: on
    _metadataSelected[IC]: on
    _metadataSelected[AF]: on
    _metadataSelected[MI]: on
    _metadataSelected[LG]: on
    _metadataSelected[RI]: on
    _metadataSelected[REP]: on
    _metadataSelected[TOC_DISPLAY]: on
    _metadataSelected[PROC_GR_DISPLAY]: on
    _metadataSelected[DP]: on
    _metadataSelected[AD]: on
    _metadataSelected[LF]: on
    _metadataSelected[RS_DISPLAY]: on
    metadataSelected[MNE_IMPLEMENTS_DIR_DISPLAY]: MNE_IMPLEMENTS_DIR_DISPLAY
    _metadataSelected[MNE_IMPLEMENTS_DIR_DISPLAY]: on
    _metadataSelected[ELI]: on
    button.apply: Apply

    So this is a POST request with a form. In order to do this yourself in RapidMiner you would need to make the same query (with your sessionID etc). It would be a lot of work.

     

    If it were me I would do it in RapidMiner because (a) I am a terrible coder, and (b) I really enjoy puzzles like this. But most likely you just want to get it done. I am sure there is a Python library somewhere that will do something like this. If you know Python, I'd poke around GitHub and see what you find. Otherwise perhaps some of my coder friends will have a suggestion. :)

     

    Scott

     

  • rfuentealbarfuentealba Moderator, RapidMiner Certified Analyst, Member, University Professor Posts: 568 Unicorn

    Hi @s_nektarijevic,

     

    I would refrain from using the Get Pages operator for heavy JS things and/or processing POST. In my experience, your best bet would be to use Selenium (yes, it's slow and many more things, but it is indeed useful). Few months ago some of us had a similar discussion that might help you. You can find it here.

     

    Basically, Selenium is a headless driver, meaning that you get all the benefits of a browser except for the visual representation. You have to interact with the browser programatically. I use it with Ruby, but (oh no, the crowd again!) you can find tutorials to use it with Python. If you use the Anaconda Python distribution and the Python Extension for RapidMiner, you can program it directly.

    Hope this helps,

  • s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II

    Dear Scott,

     

    Many thanks for the response, this is very useful! I'll try something with the Python extension in RapidMiner, fingers crossed :-)

     

    Cheers,

    Snežana

  • s_nektarijevics_nektarijevic RapidMiner Certified Analyst, Member Posts: 12 Contributor II

    Dear Rodrigo,

     

    Thanks a lot for the inputs! I'll try Python Extension and see how it goes :-)

     

    Cheers,

    Snežana

  • kaymankayman Member Posts: 662 Unicorn

    One way to tackle is as follows :

     

    -> Use firefox and load your page. You can do the same with other browsers but FF is a bit easier 

    -> use ctrl - shift - e to open the network inspector 

    -> select HTML in the network menu / pane to avoid too much clutter showing up, and click on the trashcan to remove everything stored

    -> click on change modify metadata and apply your settings

    -> click ok and you will see that there appears a post method page in the network pane going to change-displayed-metadata.html

    -> right click this link and select -> copy -> copy post data and safe this somewhere for now (like a text file)

     

    -> next use the get page operator (I agree there are better ways using python but this one works also)

    -> set the url of the page (copy -> copy url)

    -> set request method to post

    -> set follow redirects

    -> in the 'query parameters' add the details you got from your post data above.

     

    so if if you have this in your file : _metadataSelected[SP_DISPLAY]=on set it as follows :

     

    query key : _metadataSelected[SP_DISPLAY]

    query value : on

     

    You may not need to use all of these as some of them may be default values so try and error around. Worst case scenario you may need to include them all but it's a one time effort.

     

    Good luck!

  • SGolbertSGolbert RapidMiner Certified Analyst, Member Posts: 344 Unicorn

    Hi Snežana,

     

    The Scrapy framework can also do the trick. Don't forget to check the robots file:

     

    https://eur-lex.europa.eu/robots.txt

     

    In particular these fields can be problematic:

    Crawl-delay: 10
    Disallow: /autocomplete
    Disallow: /change-displayed-metadata

    As far as I understood, what you are trying to do is illegal, but I know very little about crawling restrictions.

     

    Regards,

    Sebastian

Sign In or Register to comment.