AdventureWorks
Adventure Works 2014 (OLTP version) is a sample database for Microsoft SQL Server, which has replaced Northwind and Pub sample databases that were shipped earlier. The database is about a fictious, multinational bicycle manufacturer called Adventure Works Cycles.
Airline
Airline on-time data are reported each month to the U.S. Department of Transportation (DOT), Bureau of Transportation Statistics (BTS) by the 16 U.S. air carriers that have at least 1 percent of total domestic scheduled-service passenger revenues, plus two other carrie…
BasketballMen
The task is to predict rank of teams.
BasketballWomen
The task is to predict whether the team plays playoff, or not.
Biodegradability
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
Carcinogenesis
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenicity tests and 148 negative tests.
CiteSeer
The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding …
CORA
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wo…
Credit
A bit more complex artificial database with loops.
CS
Artificial data from a Czech bank.
Dunur
Dunur is a relatedness of two people due to marriage such that A is dunur of B if a child of A is married to a child of B.
Elti
Elti is a relatedness of two people due to marriage such that A is elti of B if A's husband is a brother of B's husband.
Employee
The employees test database: small, fake database of employees.
ErgastF1
Ergast.com is a webservice that provides a database of Formula 1 races, starting from the 1950 season until today. The dataset includes information such as the time taken in each lap, the time taken for pit stops, the performance in the qualifying rounds etc. of all Fo…
Facebook
This dataset consists of 'circles' (or 'friends lists') from Facebook.
Financial
PKDD'99 Financial dataset contains 606 successful and 76 not successful loans along with their information and transactions. The standard task is to predict the loan outcome for finished loans (A vs B in loan.status) at the time of the loan start (defined by loan.dat…
Geneea
Data on deputies and senators in the Czech Republic.
Genes
KDD Cup 2001 prediction of gene/protein function and localization.
Hockey
The Hockey Database follows the same general design as the Lahman Baseball Database. In addition to the NHL, the Hockey DB covers the following early and alternative leagues: NHA, PCHA, WCHL and WHA. It contains individual and team statistics from 1909-10 through the 2…
Lahman
Lahman’s baseball database contains complete batting and pitching statistics from 1871 to 2014, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.
LegalActs
Bulgarian court decision metadata.
Mesh
This domain is about finite element methods in engineering. The task is to predict how many elements should be used to model each edge of a structure. The target predicate is mesh(Edge,Number) where the Number of elements in the Mesh model can vary between 1 and 17.
Mondial
A geography dataset from University of Göttingen describes 114 Christian countries and 71 non-Christian countries.
MooneyFamily
The dataset describes a family composed of 86 people across 5 generations. The family dataset includes 744 positive instances and 1488 randomly generated negative instances.
Mutagenesis
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …
Nations
A sample database from Alchemy website.
NBA
A database with information about basketball matches from the National Basketball Association. Lists Players, Teams, and matches with action counts for each player.
NCAA
2015 NCAA Basketball Tournament.
PremiereLeague
A database with information about football matches from the UK Premier League. Lists Players, Teams, and matches with action counts for each player.
PTE
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
PubMed_Diabetes
The Pubmed Diabetes dataset consists of 19717 scientific publications from PubMed database pertaining to diabetes classified into one of three classes. The citation network consists of 44338 links. Each publication in the dataset is described by a TF/IDF weighted word …
Restbase
A database of restaurants in San Francisco. The goal is to predict the customer's satisfaction.
Sakila
The venerable sakila test database: small, fake database of movies.
SameGen
The task is to predict whether two given people are from the same generation.
SAT
The task is to diagnose power-supply failures in a communications satellite.
Stats
An anonymized dump of all user-contributed content on the Stats Stack Exchange network.
TPCC
TPC-C is the benchmark published by the Transaction Processing Performance Council (TPC) for Online Transaction Processing (OLTP).
TPCD
TPC-D represents a broad range of decision support (DS) applications that require complex, long running queries against large complex data structures.
TPCDS
TPC-DS is the new decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. Although the underlying business model of TPC-DS is a retail product supplier, the database schema, data …
TPCH
TPC-H is the benchmark published by the Transaction Processing Performance Council (TPC) for decision support.
UW-CSE
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
VisualGenome
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
WebKP
The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wor…