Register | Log in | FAQ      [?] 

Available datasets

What data is available for analysis?


CiteULike Datasets

The CiteULike database is potentially useful for researchers in various fields. Physicists and computer scientists have expressed an interest in trying to analyse the structure of the data, and frequently ask for datasets to be made available.

Previously this was done on an ad-hoc basis, and it relied on us remembering to update the data file. Now, there is an automatic process which runs every night producing a snapshot summary of what articles have been posted with which tags.

Who-posted-what data

The latest data snapshot can always be downloaded at http://static.citeulike.org/data/current.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/2007-05-30.bz2

Data is available from 2007-05-30 onwards.

The file constitutes an anonymous dump of who posted what and when the posting took place. There is no data in this file which is not already available publicly through the web site, so there are no privacy implications for making it available. The advantage is that it's available in one file rather than having to spider the entire site to get at the information (please don't do that!).

The file is a simple unix ("\n" line endings) text file with pipe ("|") delimiters. The columns are:

  1. The CiteULike article id which was posted
  2. An obfuscated representation of the username (a salted MD5 hash of the true username). Again, it is possible to piece back together what the true username is by scraping the site, but I'd rather you didn't do that. The reason I've gone to the trouble of obfuscation is primarily a slightly paranoid anti-spam measure
  3. The date and time the article was posted to the site
  4. The tag the user used to post it

NB Note that if a user posts an article with n tags, then this will result in n rows in the file

Article linkout data

Mapping CiteULike article_ids to resources on the web can be done with the linkout table. The current snapshot is available at http://static.citeulike.org/data/linkouts.bz2

Older datasets are available on a daily basis and can be found at URLs of the form http://static.citeulike.org/data/linkouts-2008-02-02.bz2

Data is available from 2008-02-02 onwards.

To understand the data in this file, you should refer to "The linkout formatter" section of the plugin developer's guide.

This file contains a number of spam links. Although CiteULike filters spam postings, traces of the spam still remain in this table. In time this spam content will eventually be removed.

The file is a simple unix ("\n" line endings) text file with pipe ("|") delimiters. Literal pipes within the fields are represented escaped ("\|"). The columns are:

  1. Linkout type
  2. ikey_1
  3. ckey_1
  4. ikey_2
  5. ckey_2
CiteULike organises scholarly (or academic) papers or literature and provides bibliographic (which means it makes bibliographies) for universities and higher education establishments. It helps undergraduates and postgraduates. People studying for PhDs or in postdoctoral (postdoc) positions. The service is similar in scope to EndNote or RefWorks or any other reference manager like BibTeX, but it is a social bookmarking service for scientists and humanities researchers.