The Altair Community is migrating to a new platform to provide a better experience for you. The RapidMiner Community will merge with the Altair Community at the same time. In preparation for the migration, both communities are on read-only mode from July 15th - July 24th, 2024. Technical support via cases will continue to work as is. For any urgent requests from Students/Faculty members, please submit the form linked here.

DBSCAN taking very long time

moritz_moellermoritz_moeller Member Posts: 5 Learner I
edited January 2019 in Help
Hello there,

I am currently trying to do a cluster analysis with DBSCAN. Since it is my first time to either do a clusteranalysis or using DBSCAN I only have knowledge from papers and online documents. But maybe someone of you is able to help me out:

I am analyzing a kind of huge amount of data (I know it's relative). It's 10 columns and around 6 million rows. I am selecting attributes, filter them, normalize and then put them into the dbscan clustering. My parameters are epsilon=0.5 and minpts=4. I want to look at 2 attributes at a time since I'll compare it to k-means.

But the problem is that it already takes over an hour to preprocess the data (there is the loading circle on the clustering part) before it even starts to go from 1 to 100. Is there anything I can change in my process that would maybe make it faster? Perhaps there are some beginner mistakes involved which is quite likely..

Thanks for your answers and have a nice day.

EDIT: I have 64GB of RAM and the process uses around 32GB at the moment. I put the maximum to 50GB. In addition I can say that I only have numeric attributes

Best Answer

  • Options
    MartinLiebigMartinLiebig Administrator, Moderator, Employee, RapidMiner Certified Analyst, RapidMiner Certified Expert, University Professor Posts: 3,525 RM Data Scientist
    Solution Accepted
    Hi Moritz,
    i guess 6M rows are just a lot for this.. If i remember correctly the runtime is in O(n²).

    - Sr. Director Data Solutions, Altair RapidMiner -
    Dortmund, Germany


  • Options
    moritz_moellermoritz_moeller Member Posts: 5 Learner I
    Well it seems like you're correct. I am working with only a range of my rows now and the runtime is fairly lower.

    Thanks for the answer, I assume that this is the correct one.
Sign In or Register to comment.