CiteSeer

CiteSeer

The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words.

Original source: linqs.soe.ucsc.edu

Versions

  • CiteSeer (by Jan Motl)

    • Note that some papers appear in cite table without having a content entry.

Dataset details

Associated task:
Classification
Domain:
Education
Data types:
Size:
5.9 MB
Count of tables:
3
Count of rows:
113,760
Count of columns:
6
Missing values:
No
Compound keys:
No
Loops:
Yes
Type:
Real
Instance count:
3,312
Target table:
paper
Target column:
class_label
Target ID:
paper_id
Target timestamp:
?

Algorithms

Dataset versionTargetAlgorithmAuthor textMeasureValue
CiteSeerCBCCCase-Based Collective ClassificationAccuracy0.669
CiteSeerMLNInvestigating Markov Logic Networks for Collective ClassificationAccuracy0.742

How to download the dataset

The datasets are publicly available directly from MariaDB database.

  1. Open your favourite MariaDB client (MySQL Workbench works, but see FAQ)
  2. Use following credentials:
    • hostname: db.relational-data.org
    • port: 3306
    • username: guest
    • password: relational
  3. Export "CiteSeer" database (or other version of the dataset, if available) in your favourite format (e.g. CSV or SQL dump).