Combined Data Search and Integration Project

by Community Manager ‎10-05-2016 06:15 AM - edited ‎10-05-2016 06:33 AM

At the 13th European Semantic Web Conference (ESWC2016) in June 2016 the award for Best Demonstration went to a team from the Data Science Group at the University of Mannheim. The title of their presentation was "Extending RapidMiner with Data Search and Integration Capabilities".


We all know that large numbers of data sets have become widely available for download, but it is often difficult to find the right data sets needed and then to integrate them. This can be a major headache in the creation of a new data science project workflow. 


The team at Mannheim are developing an extension that introduces the concept of 'SearchJoins' combining serach functions with logic that helps the user to find relevant datasets for a given task as well as in integrating newly discovered data with data that they already know. Within a GUI interface the user can then make any needed corrections and refinements. 


Here's the demo:


A number of other materials have been made available by the Mannheim team. A pdf describing the demo is attached here. In addition you can visit:  for an overview of the project


and download the source fils from GitHub here





Government Open Data

by BalazsBarany on ‎10-05-2016 06:06 AM

Here I'd like to list a few Open Data sites from European governments that I often use. Feel free to add more!


Austria: - Aggregator site for Open Data of municipalities and from the state


European Data Portal: - Aggregates national portals

100+ Interesting Data Sets

by Community Manager on ‎08-08-2016 05:25 AM

Maintained on the blog is this somewhat eclectic set. "Interesting" though is an understatement.  Some subjects covered in the links include:


  • Drone strikes
  • The cost of hiring your favourite band
  • Human gnome data
  • Yelp restaurant rankings



UCI Machine Learning Data Archive

by Community Manager on ‎07-22-2016 04:58 AM

My thanks to @Harry4RM for pointing out this out:  350 data sets as-a-service to the machine learning community. Huge variety of interesting data sets here - everything from heart disease to poker hands.






53.5 billion clicks

by Community Manager on ‎07-12-2016 06:43 AM

US Energy Information Administration

by Community Manager on ‎07-12-2016 06:39 AM


Data on numerous energy sources including fossil, renewables and nuclear.

by Community Manager on ‎07-12-2016 06:28 AM


22 million forum messages.

Titanic Survival Data Set:

by Community Manager ‎05-26-2016 03:16 PM - edited ‎07-12-2016 06:21 AM

You've watched the film

You've seen Ingo's YouTube (no? It's here: 


Now you can play with data yourself. 

Titanic Survival Data Set:




Enron Email Dataset

by Community Manager on ‎07-12-2016 06:20 AM


All you text miners - this is the classic dataset. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.


Some young whippersnapper in the office asked me who Enron were recently - oh how time flies.

Global Temperatures

by Community Manager on ‎07-12-2016 06:12 AM


HadCRUT4 is a global temperature dataset, providing gridded temperature anomalies across the world as well as averages for the hemispheres and the globe as a whole

Machine Learning Data Set Repository

by Community Manager on ‎07-12-2016 06:10 AM


Find data sets and upload your own.

AIrlines and Airports

by Community Manager on ‎07-12-2016 06:07 AM


Gives data on Flight Delays over many years



United Nations Data

by Community Manager on ‎07-12-2016 06:04 AM


  • Crime
  • Environment
  • Meteorology
  • Finance
  • Agriculture
  • Health
  • Industry
  • Technology
  • Labour
  • Population
  • Tourism
  • Trade





Uk Government Statistics

by Community Manager on ‎07-12-2016 06:00 AM


Nearly 40,000 data sets covering:


  • Environment
  • Towns and Cities
  • Mapping
  • Government
  • Society
  • Health

Formats vary, including HTML, SCV, WMS, Excel and an API


World Bank Data

by Community Manager on ‎07-12-2016 05:55 AM


Hundreds of databases ssortable by topic or country.

Excel or XML formats, or API


Of course this is also available, with a host of other data sets through the RapidMIner Economics Extention.


Million Song Database

by Community Manager on ‎07-12-2016 05:49 AM

It is, what it is!



But also there are some extras



Stanford Large Network Dataset Collection

by Community Manager on ‎07-12-2016 05:45 AM


Social networks, communities, roads, communications

Gapminder Demographics

by Community Manager on ‎07-12-2016 05:22 AM


All available in Excel format


Collation from various sources including some listed elsewhere here. Focussed on demographics. Around 500 entries include disease, employment, birth and mortality rates, trade, debt, wealth, sanitation, gender measures, wealth etc.



CDC - Center for Disease Control

by Community Manager on ‎07-12-2016 05:16 AM



Mainly pdf's but some Excel files. 


Subjects are mainly US or global.



1001 data sets

by Community Manager on ‎05-26-2016 03:13 PM

Everyone is pulling together lists



2010 US Census

by Community Manager on ‎05-26-2016 03:11 PM

US Census Site


Numerous data sets and tools


by Community Manager on ‎05-26-2016 03:09 PM

ICPSR- Inter-university Consortium for Political and Social Research

A consortium of more than 700 academic institutions and research organizations.  ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. The vast majority of ICPSR data holdings are public-use files with no access restrictions.

Yahoo Data Sets

by Community Manager on ‎05-26-2016 03:05 PM

Yahoo Data Sets

Numerous data sets from the Yahoo! Developer Network.

US Government Open Data

by Community Manager on ‎05-26-2016 03:03 PM

US Government Open Data



Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. There are over 180,000 data sets neatly arranged by topic.



Production, food security, nutrition, prices


US Trade in Goods and Services, Trade Flows, Expenditures by Consumers and Business

Climate Change

Coastal flooding, food resilience, human health, Artic region


US Trade, Family Expenditures, Food retailing.


Weather, sea-levels, soil, vegetation etc.


300 data sets covering topics as diverse as: Loans, employment, enrollment, crime and saftey


280+ data sets covering consumer and business usage, oil and gas, nuclear


Banking, lending, retirement, investments, and insurance.


FDA, Five Decades of Data on Smoking, the Health Datapalooza and lots more.

Local Government

State, local and tribal


Federal R&D, Production, Capital Expenditure


Vessel tracking and marine planning

Public Safety

Information on product recalls, policing, transportation (but not the Titanic data – you can find that here)

Science and Research

A mixed back of data on R&D, employment and technology licensing

New York City Open Data

by Community Manager on ‎05-26-2016 03:02 PM

New York City Open Data

Court cases, jobs, demographics, real estate etc etc. Something like 1300 data sets available

Citi Bike - NYC Bike Share Data

by Community Manager on ‎05-26-2016 03:01 PM

Open Data - rounding out your analytics datasets

by Community Manager on ‎05-26-2016 02:59 PM

Presentation given by Bob Lytle of at Wisdom 2016