Different results each time I run this process

carlcarl Member Posts: 30 Guru
edited November 2018 in Help

I've got a my first RM personal-learning process working, but it oddly gives me a different answer each time I run it.  


If I store the data after Step1 and re-run Step2, it gives the same answer each time, but a slightly wrong answer.  If I re-run Steps1&2, then I get different results each time, even though I can verify directly that the source site has not changed.


So not sure if it's a bug, or if I've introduced an error somewhere in my process.  I've attached my files in the hope that someone may be able to spot something.



Best Answer

  • Options
    carlcarl Member Posts: 30 Guru
    Solution Accepted

    In case anyone is interested, I found that if I run Process Documents from Web 5 times (4 didn't quite get there), then Append the 5 outputs, and Remove Duplicates, I get the entire set of data with nothing missing and nothing duplicated.


  • Options
    carlcarl Member Posts: 30 Guru

    I've rebuilt the process in a different way using Extract Information inside Cut Document inside Process Documents from Web.  This approach is also producing a small percentage of incorrect results, e.g. the following entry is somehow extracted twice.  The total number of examples extracted is correct, but some are duplicated and some missing.

            <a href="/g-cloud/services/100201645788425">Peer-to-peer support planning and brokerage</a>

    This is proving a great learning exercise, but I can't for the life of me see where the problem Is in my set-up.  Wondered if anyone can help?


    (One interesting learning here for me is that the Max Crawl Depth seems to be controlling the page iteration without the need for a Loop operator.)

  • Options
    carlcarl Member Posts: 30 Guru



    So, I’ve found the root cause of the problem.  Seems that both the Rapidminer process, and the source web site, are working correctly, but the web site itself has a rather odd curious feature.


    Inspecting the data stored after the Process Documents from Web operator, I could see that less than 3% of the 25k examples were duplicated.  And I noticed the duplicates always bridged consecutive pages.


    Turns out, in the web site itself, that when I refresh a page URL directly in the browser (say “page=125”), the sort occasionally toggles to an alternative sequence.  It presumably does this every x number of views.


    So Process from the Web picks up 100 items from page x, then when crawling through page x+1, it may pick something up again from the prior page because of the re-sort, and lose something in exchange.  Hence the overall total of 25,260 examples returned by Rapidminer always conspiratorially matched the web site total.


    Not sure if there is a clever way to overcome that?  Instead of crawling the 253 search results pages, each with 100 items which have the unfortunate tendency to hop about and hide, I could go direct to the 25k lower-level pages.  I would have preferred to mine only the results summary pages.

Sign In or Register to comment.