At the 13th European Semantic Web Conference (ESWC2016) in June 2016 the award for Best Demonstration went to a team from the Data Science Group at the University of Mannheim. The title of their presentation was "Extending RapidMiner with Data Search and Integration Capabilities".
We all know that large numbers of data sets have become widely available for download, but it is often difficult to find the right data sets needed and then to integrate them. This can be a major headache in the creation of a new data science project workflow.
The team at Mannheim are developing an extension that introduces the concept of 'SearchJoins' combining serach functions with logic that helps the user to find relevant datasets for a given task as well as in integrating newly discovered data with data that they already know. Within a GUI interface the user can then make any needed corrections and refinements.
Here's the demo:
A number of other materials have been made available by the Mannheim team. A pdf describing the demo is attached here. In addition you can visit:
http://web.informatik.uni-mannheim.de/ds4dm/ for an overview of the project
and download the source fils from GitHub here https://github.com/AnLiGentile/DS4DM
Here I'd like to list a few Open Data sites from European governments that I often use. Feel free to add more!
Austria: https://www.data.gv.at/ - Aggregator site for Open Data of municipalities and from the state
European Data Portal: http://www.europeandataportal.eu/ - Aggregates national portals
Maintained on the rs.io blog is this somewhat eclectic set. "Interesting" though is an understatement. Some subjects covered in the links include:
- Drone strikes
- The cost of hiring your favourite band
- Human gnome data
- Yelp restaurant rankings
Web traffic dataset "To foster thestudy of the structure and dynamics of Web traffic networks, we make"
Some young whippersnapper in the office asked me who Enron were recently - oh how time flies.
HadCRUT4 is a global temperature dataset, providing gridded temperature anomalies across the world as well as averages for the hemispheres and the globe as a whole
Hundreds of databases ssortable by topic or country.
Excel or XML formats, or API
Of course this is also available, with a host of other data sets through the RapidMIner Economics Extention.
It is, what it is!
But also there are some extras
All available in Excel format
Collation from various sources including some listed elsewhere here. Focussed on demographics. Around 500 entries include disease, employment, birth and mortality rates, trade, debt, wealth, sanitation, gender measures, wealth etc.
A consortium of more than 700 academic institutions and research organizations. ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. The vast majority of ICPSR data holdings are public-use files with no access restrictions.
Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. There are over 180,000 data sets neatly arranged by topic.
Production, food security, nutrition, prices
US Trade in Goods and Services, Trade Flows, Expenditures by Consumers and Business
Coastal flooding, food resilience, human health, Artic region
US Trade, Family Expenditures, Food retailing.
Weather, sea-levels, soil, vegetation etc.
300 data sets covering topics as diverse as: Loans, employment, enrollment, crime and saftey
280+ data sets covering consumer and business usage, oil and gas, nuclear
Banking, lending, retirement, investments, and insurance.
FDA, Five Decades of Data on Smoking, the Health Datapalooza and lots more.
State, local and tribal
Federal R&D, Production, Capital Expenditure
Vessel tracking and marine planning
Information on product recalls, policing, transportation (but not the Titanic data – you can find that here)
A mixed back of data on R&D, employment and technology licensing
New York City Open Data
Court cases, jobs, demographics, real estate etc etc. Something like 1300 data sets available