Different results each time I run this process

carlcarl Member Posts: 30 Guru
edited November 2018 in Help

I've got a my first RM personal-learning process working, but it oddly gives me a different answer each time I run it.  

 

If I store the data after Step1 and re-run Step2, it gives the same answer each time, but a slightly wrong answer.  If I re-run Steps1&2, then I get different results each time, even though I can verify directly that the source site has not changed.

 

So not sure if it's a bug, or if I've introduced an error somewhere in my process.  I've attached my files in the hope that someone may be able to spot something.

 

 

Best Answer

  • carlcarl Member Posts: 30 Guru
    Solution Accepted

    In case anyone is interested, I found that if I run Process Documents from Web 5 times (4 didn't quite get there), then Append the 5 outputs, and Remove Duplicates, I get the entire set of data with nothing missing and nothing duplicated.

Answers

  • carlcarl Member Posts: 30 Guru

    I've rebuilt the process in a different way using Extract Information inside Cut Document inside Process Documents from Web.  This approach is also producing a small percentage of incorrect results, e.g. the following entry is somehow extracted twice.  The total number of examples extracted is correct, but some are duplicated and some missing.


            <a href="/g-cloud/services/100201645788425">Peer-to-peer support planning and brokerage</a>
        

    This is proving a great learning exercise, but I can't for the life of me see where the problem Is in my set-up.  Wondered if anyone can help?

     

    (One interesting learning here for me is that the Max Crawl Depth seems to be controlling the page iteration without the need for a Loop operator.)

  • carlcarl Member Posts: 30 Guru

    Interesting.

     

    So, I’ve found the root cause of the problem.  Seems that both the Rapidminer process, and the source web site, are working correctly, but the web site itself has a rather odd curious feature.

     

    Inspecting the data stored after the Process Documents from Web operator, I could see that less than 3% of the 25k examples were duplicated.  And I noticed the duplicates always bridged consecutive pages.

     

    Turns out, in the web site itself, that when I refresh a page URL directly in the browser (say “page=125”), the sort occasionally toggles to an alternative sequence.  It presumably does this every x number of views.

     

    So Process from the Web picks up 100 items from page x, then when crawling through page x+1, it may pick something up again from the prior page because of the re-sort, and lose something in exchange.  Hence the overall total of 25,260 examples returned by Rapidminer always conspiratorially matched the web site total.

     

    Not sure if there is a clever way to overcome that?  Instead of crawling the 253 search results pages, each with 100 items which have the unfortunate tendency to hop about and hide, I could go direct to the 25k lower-level pages.  I would have preferred to mine only the results summary pages.

Sign In or Register to comment.