"Importing data from UCL Machine Learning Repository"

DrSnuggelsDrSnuggels Member Posts: 4 Contributor I
edited May 2019 in Help
Hi!

I'm quite new to RapidMiner and want, for a start, import some of the data sets from the UCL Machine Learning Repository at http://archive.ics.uci.edu/ml/. As I'm guessing from the RapidMiner documentation, these data sets seem to be in some sort of C4.5 format (at least they come with .data and .names files). But when I use the c45-importer, I always get errors like
: Line 1: the number of tokens in each line must be the same as the number of attributes (15), was: 10
Has anyone a hint for me how to import these files, resp. how I would have to alter them? Any help is welcome.

Greetings, Marius
Tagged:

Answers

  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Marius,

    could you please post which data set you tried to load and resulted in the error you mentioned?! We can check, if the operator exhibits an error then.

    Normally, you can also load the .data files also with the [tt]CSVExampleSource[/tt] or [tt]SimpleExampleSource[/tt] operators with parameters set appropriately. This however will not load the attribute names.

    Regards,
    Tobias
  • DrSnuggelsDrSnuggels Member Posts: 4 Contributor I
    Hi Tobias!

    Thanks for your answer. The problem ocurred with every dataset I tried to load, seven or eight at the least. If you want to have a try for yourself, you can take the zoo-dataset ( it's the last  one in the complete list), which gives me the same error message, just the numbers are different (...as the number of attributes (16), was: 18.).

    The background: I'm helping to prepare a college course, and we want the students to try out some learning algorithms - not only by hand, but with real data. RapidMiner is quite new to us all, but it looks very promising. However, for that purpose it would be nice if we could say "import dataset xy which you can find at the UCL repo".

    Worst case, we could edit a dataset manually (without attribute names it would only be half of the fun), but it somehow strucks me that every single set I tried won't work. Maybe it's just a misconfiguration (although it doesn't work on two different machines--in case it might be important, we're running Windows XP, and we're in Germany. Could it be a codepage problem?), or I'm missing something. If we could find a way for importing data, the students could try around for themselves, play with the data and could try to evaluate which kind of data fits which algorithm best.

    BTW, there are many nice datasets at the repo. If we find a solution, I'd place a hint in the RM-Wiki--I guess other users might benefit, too.

    Best Regards, Marius
  • DrSnuggelsDrSnuggels Member Posts: 4 Contributor I
    Just pushing this back to top before it falls off the first page -- don't want to spam the board, though, but maybe Tobias or another writer has got an idea in the meantime. We converted two datasets by hand and will let our students go on them soon. But of course it would be far more interesting for them to pick out data sets they would like to use. And as the error occured with all sets I tried -- maybe it's an error in the import filter itself?

    If I can provide any more information that might be helpful, just tell me. Alternatively, if anyone knows of data repositories on the web with freely available training data, that would help, too. I've done several searches myself but found none even nearly comparable to the UCL repo.

    Regards, Marius
  • TobiasMalbrechtTobiasMalbrecht Moderator, Employee, Member Posts: 294 RM Product Management
    Hi Marius,

    I just had a look onto the data provided at UCL. The problem is, that most data sets are not really in the C4.5 format. The actual data files have in most cases a kind of comma-separated value format, whereas the names files mostly are simple textual descriptions or summaries of the data. You should have a look into the data files, then you will see what I mean. Hence, there will be of course problems by loading the data with the C4.5 data reader in RM. The only way I see in using the UCL data is to download the data files, prepend the attribute values and provide the files yourself. I think this will be no real problem, downloading the data sets from UCL and preparing them should be done in less than an hour. Once prepared, the data should be readable e.g. by using the [tt]CSVExampleSource[/tt] or the [tt]SimpleExampleSource[/tt] operator.

    Just a hint: did you check out the examples that are shipped with RM? There are at least some basic sets contained!

    Regards,
    Tobias
  • DrSnuggelsDrSnuggels Member Posts: 4 Contributor I
    Hi!

    Thanks very much! That indeed explains why it didn't work. I had searched the whole UCML page for some information on the datasets' file format, but I dind't find anything anywhere, and within the .names-files wasn't anything to be found, too. So I had guessed from the file endings--that corresponded with the respective section in the RM docs--that it is supposed to be C4.5.

    We'll fix some of the sets by hand then and have a look at the samples that come with RM again, too. However, using some stuff from the UCML repo might be quite nice, as some sets are very huge, and in part noisy, too. That would be a nice contrast to the small and consistent sunny-rainy-play_tennis examples.

    Again, thank you very much for your assistance. BTW, if you should know of some places on the web where we might find one or another real-world-set that works with one of RM's import filters, maybe you can post the link here. The more soruces to build on, the more fun (and agony) the students will have.

    Best Regards, Marius
Sign In or Register to comment.