Combined Data Search and Integration Project

by Community Manager ‎10-05-2016 06:15 AM - edited ‎10-05-2016 06:33 AM

At the 13th European Semantic Web Conference (ESWC2016) in June 2016 the award for Best Demonstration went to a team from the Data Science Group at the University of Mannheim. The title of their presentation was "Extending RapidMiner with Data Search and Integration Capabilities".

 

We all know that large numbers of data sets have become widely available for download, but it is often difficult to find the right data sets needed and then to integrate them. This can be a major headache in the creation of a new data science project workflow. 

 

The team at Mannheim are developing an extension that introduces the concept of 'SearchJoins' combining serach functions with logic that helps the user to find relevant datasets for a given task as well as in integrating newly discovered data with data that they already know. Within a GUI interface the user can then make any needed corrections and refinements. 

 

Here's the demo:

 

A number of other materials have been made available by the Mannheim team. A pdf describing the demo is attached here. In addition you can visit:

 

http://web.informatik.uni-mannheim.de/ds4dm/  for an overview of the project

 

and download the source fils from GitHub here https://github.com/AnLiGentile/DS4DM

 

 

 

 

Government Open Data

by BalazsBarany on ‎10-05-2016 06:06 AM

Here I'd like to list a few Open Data sites from European governments that I often use. Feel free to add more!

 

Austria: https://www.data.gv.at/ - Aggregator site for Open Data of municipalities and from the state

Germany: https://www.govdata.de/ 

European Data Portal: http://www.europeandataportal.eu/ - Aggregates national portals

100+ Interesting Data Sets

by Community Manager on ‎08-08-2016 05:25 AM

Maintained on the rs.io blog is this somewhat eclectic set. "Interesting" though is an understatement.  Some subjects covered in the links include:

 

  • Drone strikes
  • The cost of hiring your favourite band
  • Human gnome data
  • Yelp restaurant rankings

 

http://rs.io/100-interesting-data-sets-for-statistics/

 

Enjoy!

UCI Machine Learning Data Archive

by Community Manager on ‎07-22-2016 04:58 AM

My thanks to @Harry4RM for pointing out this out:  350 data sets as-a-service to the machine learning community. Huge variety of interesting data sets here - everything from heart disease to poker hands.

 

https://archive.ics.uci.edu/ml/

 

 

 

 

 

53.5 billion clicks

by Community Manager on ‎07-12-2016 06:43 AM

US Energy Information Administration

by Community Manager on ‎07-12-2016 06:39 AM

http://www.eia.gov/

 

Data on numerous energy sources including fossil, renewables and nuclear.

Ancestry.com

by Community Manager on ‎07-12-2016 06:28 AM

http://www.cs.cmu.edu/~jelsas/data/ancestry.com/

 

22 million Ancetry.com forum messages.

Titanic Survival Data Set: http://bit.ly/1kJ4pkF

by Community Manager ‎05-26-2016 03:16 PM - edited ‎07-12-2016 06:21 AM

You've watched the film

You've seen Ingo's YouTube (no? It's here: 

 

Now you can play with data yourself. 

Titanic Survival Data Set: http://bit.ly/1kJ4pkF

 

 

 

Enron Email Dataset

by Community Manager on ‎07-12-2016 06:20 AM

http://www.cs.cmu.edu/~enron/

 

All you text miners - this is the classic dataset. This data was originally made public, and posted to the web, by the Federal Energy Regulatory Commission during its investigation.

 

Some young whippersnapper in the office asked me who Enron were recently - oh how time flies.

Global Temperatures

by Community Manager on ‎07-12-2016 06:12 AM

https://crudata.uea.ac.uk/cru/data/temperature/#datter

 

HadCRUT4 is a global temperature dataset, providing gridded temperature anomalies across the world as well as averages for the hemispheres and the globe as a whole

Machine Learning Data Set Repository

by Community Manager on ‎07-12-2016 06:10 AM

http://mldata.org/

 

Find data sets and upload your own.

AIrlines and Airports

by Community Manager on ‎07-12-2016 06:07 AM

http://www.transtats.bts.gov/OT_Delay/OT_DelayCause1.asp

 

Gives data on Flight Delays over many years

 

 

United Nations Data

by Community Manager on ‎07-12-2016 06:04 AM

http://data.un.org/

 

  • Crime
  • Environment
  • Meteorology
  • Finance
  • Agriculture
  • Health
  • Industry
  • Technology
  • Labour
  • Population
  • Tourism
  • Trade

 

 

 

 

Uk Government Statistics

by Community Manager on ‎07-12-2016 06:00 AM

https://data.gov.uk/data/search

 

Nearly 40,000 data sets covering:

 

  • Environment
  • Towns and Cities
  • Mapping
  • Government
  • Society
  • Health

Formats vary, including HTML, SCV, WMS, Excel and an API

 

World Bank Data

by Community Manager on ‎07-12-2016 05:55 AM

http://data.worldbank.org/

 

Hundreds of databases ssortable by topic or country.

Excel or XML formats, or API

 

Of course this is also available, with a host of other data sets through the RapidMIner Economics Extention. 

https://marketplace.rapidminer.com/UpdateServer/faces/product_details.xhtml?productId=rmx_quantx1

 

Million Song Database

by Community Manager on ‎07-12-2016 05:49 AM

It is, what it is!

 

http://labrosa.ee.columbia.edu/millionsong/

 

 

But also there are some extras

 

 

Stanford Large Network Dataset Collection

by Community Manager on ‎07-12-2016 05:45 AM

http://snap.stanford.edu/data/index.html

 

Social networks, communities, roads, communications

Gapminder Demographics

by Community Manager on ‎07-12-2016 05:22 AM

https://www.gapminder.org/data/

 

All available in Excel format

 

Collation from various sources including some listed elsewhere here. Focussed on demographics. Around 500 entries include disease, employment, birth and mortality rates, trade, debt, wealth, sanitation, gender measures, wealth etc.

 

 

CDC - Center for Disease Control

by Community Manager on ‎07-12-2016 05:16 AM

http://www.cdc.gov/DataStatistics/

 

 

Mainly pdf's but some Excel files. 

 

Subjects are mainly US or global.

 

 

1001 data sets

by Community Manager on ‎05-26-2016 03:13 PM

Everyone is pulling together lists

 

https://dreamtolearn.com/ryan/1001_datasets

 

 

2010 US Census

by Community Manager on ‎05-26-2016 03:11 PM

US Census Site

 

Numerous data sets and tools

ICPSR Data

by Community Manager on ‎05-26-2016 03:09 PM

ICPSR- Inter-university Consortium for Political and Social Research

A consortium of more than 700 academic institutions and research organizations.  ICPSR maintains a data archive of more than 500,000 files of research in the social sciences. The vast majority of ICPSR data holdings are public-use files with no access restrictions.

Yahoo Data Sets

by Community Manager on ‎05-26-2016 03:05 PM

Yahoo Data Sets

Numerous data sets from the Yahoo! Developer Network.

US Government Open Data

by Community Manager on ‎05-26-2016 03:03 PM

US Government Open Data

 

 

Here you will find data, tools, and resources to conduct research, develop web and mobile applications, design data visualizations, and more. There are over 180,000 data sets neatly arranged by topic.

 

Agriculture

Production, food security, nutrition, prices

Business

US Trade in Goods and Services, Trade Flows, Expenditures by Consumers and Business

Climate Change

Coastal flooding, food resilience, human health, Artic region

Consumer

US Trade, Family Expenditures, Food retailing.

Ecosystems

Weather, sea-levels, soil, vegetation etc.

Education

300 data sets covering topics as diverse as: Loans, employment, enrollment, crime and saftey

Energy

280+ data sets covering consumer and business usage, oil and gas, nuclear

Finance

Banking, lending, retirement, investments, and insurance.

Health

FDA, Five Decades of Data on Smoking, the Health Datapalooza and lots more.

Local Government

State, local and tribal

Manufacturing

Federal R&D, Production, Capital Expenditure

Ocean

Vessel tracking and marine planning

Public Safety

Information on product recalls, policing, transportation (but not the Titanic data – you can find that here)

Science and Research

A mixed back of data on R&D, employment and technology licensing

New York City Open Data

by Community Manager on ‎05-26-2016 03:02 PM

New York City Open Data

Court cases, jobs, demographics, real estate etc etc. Something like 1300 data sets available

 

https://data.cityofnewyork.us/data?cat=city%20government

Citi Bike - NYC Bike Share Data

by Community Manager on ‎05-26-2016 03:01 PM

Open Data - rounding out your analytics datasets

by Community Manager on ‎05-26-2016 02:59 PM

Presentation given by Bob Lytle of rel8ted.to at Wisdom 2016