Accidents
Traffic accident database consists of all accidents that happened in Slovenia’s capital city Ljubljana between the years 1995 and 2005.
Airline
Airline on-time data are reported each month to the U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carrie…
Atherosclerosis
The study STULONG is a longitudinal 20 years lasting primary preventive study of middle-aged men. The study aims to identify prevalence of atherosclerosis RFs in a population generally considered to be the most endangered by possible atherosclerosis com…
Biodegradability
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
Bupa
Evaluation of patients on liver disorder.
Carcinogenesis
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenicity tests and 148 negative tests.
CCS
Transactional data from Czech debit card company specialising on payments at petrol pumps.
CDESchools
A database containing geospatial information, as well as SAT average scores and Free-or-Reduced-Price Meal eligibility data, for California schools.
Chess
The goal is to predict the outcome of a match.
CiteSeer
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding …
ClassicModels
The schema is for Classic Models, a retailer of scale models of classic cars. The database contains typical business data such as customers, orders, order line items, products and so on.
ConsumerExpenditures
The Consumer Expenditure Survey (CE) collects data on expenditures, income, and demographics in the United States. The public-use microdata (PUMD) files provide this information for individual respondents without any information that could identify respondents. PUMD fi…
CORA
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wo…
Countries
The task is to predict "Forest area (% of land area)" for 247 countries in 2012 based on the previous values.
CraftBeer
Craft beers labeled by styles and composition. A separate dataset lists breweries by state.
Credit
A bit more complex artificial database with loops.
CS
Artificial data from a Czech bank.
Dallas
Officer-involved shootings as disclosed by the Dallas Police Department. Includes separate tables for officer and subject/suspect information.
DCG
The set of positive examples consists of all sentences of up to seven words that can be generated by the DCG in Bratko's book (565 positive examples).The set of negative examples was generated by randomly selecting one word in each positive example and replacing it by …
Dunur
Dunur is a relatedness of two people due to marriage such that A is dunur of B if a child of A is married to a child of B.
Elti
Elti is a relatedness of two people due to marriage such that A is elti of B if A's husband is a brother of B's husband.
Employee
The employees test database: small, fake database of employees.
ErgastF1
Ergast.com is a webservice that provides a database of Formula 1 races, starting from the 1950 season until today. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc. of all Fo…
Facebook
This dataset consists of 'circles' (or 'friends lists') from Facebook.
Financial
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions. The standard task is to predict the loan outcome for finished loans (A vs B in loan.status) at the time of the loan start (defined by loan.dat…
FNHK
Anonymised data from a hospital in Hradec Kralove, Czech Republic, about treatment and medication.
FTP
PAKDD'15 Data Mining Competition: The task is to reconstruct the information about user’s gender from product viewing logs. The data were obtained from simulations of product viewing activities of users with known gender. The data closely follow the real-life distribut…
Geneea
Data on deputies and senators in the Czech Republic.
Genes
KDD Cup 2001 prediction of gene/protein function and localization.
Hepatitis
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
IMDb
The IMDb database: moderately large, real database of movies.
MovieLens
MovieLens data set from the UC Irvine machine learning repository.
KRK
The task is to identify, whether the position of two kings and a rook on a chessboard is legal or illegal.
Lahman
Lahman’s baseball database contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.
LegalActs
Bulgarian court decision metadata.
Thrombosis
PKDD'99 Medical dataset describes 41 patients with Thrombosis.
Mesh
This domain is about finite element methods in engineering. The task is to predict how many elements should be used to model each edge of a structure. The target predicate is mesh(Edge,Number) where the Number of elements in the Mesh model can vary between 1 and 17.
MooneyFamily
The dataset describes a family composed of 86 people across 5 generations. The family dataset includes 744 positive instances and 1488 randomly generated negative instances.
Musk
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and…
Mutagenesis
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …
Nations
A sample database from Alchemy website.
NBA
A database with information about basketball matches from the National Basketball Association. Lists Players, Teams, and matches with action counts for each player.
NCAA
2015 NCAA Basketball Tournament.
Northwind
The Northwind database contains the sales data for a fictitious company called Northwind Traders, which imports and exports specialty foods from around the world.
Pima
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
PremiereLeague
A database with information about football matches from the UK Premier League. Lists Players, Teams, and matches with action counts for each player.
PTE
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
PubMed_Diabetes
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word …
Pubs
The pubs sample database is modeled after a book publishing company.
Pyrimidine
A pyrimidine QSAR dataset. The goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
Restbase
A database of restaurants in San Francisco. The goal is to predict the customer's satisfaction.
Sakila
The venerable sakila test database: small, fake database of movies.
SalesDB
A simple artificial database in star schema.
SameGen
The task is to predict whether two given people are from the same generation.
SAP
You are a member of the Sales Management team in a large retail bank. The current date is July 02, 2007. Your Sales Director has just asked you to generate additional revenues of $1,500,000 before September 01, 2007. You must find ways to sell more "Credit++" – the ne…
SAT
The task is to diagnose power-supply failures in a communications satellite.
Seznam
Seznam.cz is a web portal and search engine in the Czech Republic. The data represent online advertisement expenditures from Seznam's "wallet". Table description: client: location and domain field of the client (anonymized) dobito: prepaid into a wallet in Czech cur…
SFScores
The San Francisco Dept. of Public Health’s database of eateries, inspections of those eateries, and violations found during the inspections. The task is to predict the unscheduled inspection scores from 2013 to 2016. The scores range from 1 to 100, where 100 means that…
Shakespeare
The Open Source Shakespeare is a collection of Shakespeare's complete works. This is a much more interesting data set than some boring imaginary online retailer. In this dataset, people die! The task is to predict the character, who speaks the lines.
Stats
An anonymized dump of all user-contributed content on the Stats Stack Exchange network.
StudentLoan
Student Loan contains data about students enrollment and employment status, and the aim is to find rules that define a students' obligation for paying his/her loan back.
PTC
Predictive Toxicology Challenge (2000) consists of more than three hundreds of organic molecules marked according to their carcinogenicity on male and female mice and rats.
TPCDS
TPC-DS is the new decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data …
Trains
East-West challenge (1980) database describes east-bound and west-bound trains.
Triazine
A pyrimidine QSAR dataset. The the goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
University
An artificial database from Simon Fraser University describing students, professors and courses.
UTube
The task is to learn rules that identify the legal states of the U-tube dynamical system.
UW-CSE
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
Walmart
Walmart challenges participants to accurately predict the sales of 111 potentially weather-sensitive products (like umbrellas, bread, and milk) around the time of major weather events at 45 of their retail locations.
WebKP
The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wor…
World
A database of 239 states and their cities.