Biodegradability
This is an older data set of chemical structures containing 328 compounds labeled by their half-life for aerobic aqueous biodegradation (a regression task).
Bupa
Evaluation of patients on liver disorder.
Carcinogenesis
For prediction of whether a given molecule is carcinogenic or not. The dataset contains 182 positive carcinogenicity tests and 148 negative tests.
CCS
Transactional data from Czech debit card company specialising on payments at petrol pumps.
Chess
The goal is to predict the outcome of a match.
CORA
The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wo…
CS
Artificial data from a Czech bank.
DCG
The set of positive examples consists of all sentences of up to seven words that can be generated by the DCG in Bratko's book (565 positive examples).The set of negative examples was generated by randomly selecting one word in each positive example and replacing it by …
Elti
Elti is a relatedness of two people due to marriage such that A is elti of B if A's husband is a brother of B's husband.
Employee
The employees test database: small, fake database of employees.
Facebook
This dataset consists of 'circles' (or 'friends lists') from Facebook.
Genes
KDD Cup 2001 prediction of gene/protein function and localization.
Hepatitis
PKDD'02 Hepatitis dataset describes 206 instances of Hepatitis B (contrasting them against 484 cases of Hepatitis C).
MovieLens
MovieLens data set from the UC Irvine machine learning repository.
KRK
The task is to identify, whether the position of two kings and a rook on a chessboard is legal or illegal.
Mesh
This domain is about finite element methods in engineering. The task is to predict how many elements should be used to model each edge of a structure. The target predicate is mesh(Edge,Number) where the Number of elements in the Mesh model can vary between 1 and 17.
Musk
The Musk database describes molecules occurring in different conformations. Each molecule is either musk or non-musk and one of the conformations determines this property. Such a problem is known as a multiple-instance problem, and is modeled by two tables molecule and…
Mutagenesis
The dataset comprises of 230 molecules trialed for mutagenicity on Salmonella typhimurium. A subset of 188 molecules is learnable using linear regression. This subset was later termed the ”regression friendly” dataset. The remaining subset of 42 molecules is named the …
Nations
A sample database from Alchemy website.
NBA
A database with information about basketball matches from the National Basketball Association. Lists Players, Teams, and matches with action counts for each player.
Pima
The National Institute of Diabetes and Digestive and Kidney Diseases conducted a study on 768 adult female Pima Indians living near Phoenix.
PremiereLeague
A database with information about football matches from the UK Premier League. Lists Players, Teams, and matches with action counts for each player.
PTE
A database from The Predictive Toxicology Evaluation Challenge (1997). The task is to predict whether the compound is carcinogenic, or not.
Pyrimidine
A pyrimidine QSAR dataset. The goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
SameGen
The task is to predict whether two given people are from the same generation.
SAT
The task is to diagnose power-supply failures in a communications satellite.
StudentLoan
Student Loan contains data about students enrollment and employment status, and the aim is to find rules that define a students' obligation for paying his/her loan back.
TPCD
TPC-D represents a broad range of decision support (DS) applications that require complex, long running queries against large complex data structures.
TPCH
TPC-H is the benchmark published by the Transaction Processing Performance Council (TPC) for decision support.
Trains
East-West challenge (1980) database describes east-bound and west-bound trains.
Triazine
A pyrimidine QSAR dataset. The the goal is to predict the inhibition of dihydrofolate reductase by pyrimidines.
University
An artificial database from Simon Fraser University describing students, professors and courses.
UW-CSE
This dataset lists facts about the Department of Computer Science and Engineering at the University of Washington (UW-CSE), such as entities (e.g., Student, Professor) and their relationships (i.e. AdvisedBy, Publication).
VisualGenome
Visual Genome is a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language.
WebKP
The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding wor…