Nathan Grigg

Organizing journal articles from the arXiv

This week, my method for keeping track of journal articles I use went from kind of cool to super awesome thanks to pdftotext and a tiny Perl script.

Most math papers I read come from the arXiv (pronounced archive), a scientific paper repository where the vast majority of mathematicians post their journal articles. This is the best place to find recent papers, since they are often posted here a year or more before they appear in a journal. It is a convenient place to find slightly older papers because journals have terrible websites. (Papers from the 60’s, which I do sometimes read, usually require a trip to the library.)

I use BibDesk to organize the papers I read, mostly because it works so well with Latex, which all mathematicians use to write papers. Also, it stores its database in a plain text file, and has done so since long before it was cool.

Every now and then I gather all the papers from my Dropbox and iPad and import them into BibDesk. For each PDF I got from the arXiv I do the following:

  1. Find the arXiv identification number, which is watermarked on the first page of the PDF.

  2. Use my script arxiv2bib, which I have written about before to get the paper’s metadata from the arXiv API. An AppleScript takes the result of the script and imports it into BibDesk.

  3. Drag the PDF onto the reference in BibDesk. BibDesk automatically renames the paper based on the metadata and moves it to a Dropbox subfolder.

Three steps is better than the ten it would take without AppleScript and the arXiv API, but why can’t the computer extract the identification number automatically?

Oh yeah, of course it can.

#!/bin/bash
pdftotext "$1" - | perl -ne 'if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {print "$1\n"; last}'

The pdftotext utility comes with xpdf, which is available from Homebrew. Or can download the binary linked at foolabs. It works as advertised.

The -n argument tells Perl to wrap the script in the while loop to process stdin one line at a time. Here is what the Perl script would look like if I had put it in its own file.

#!/usr/bin/perl
while (<>) {
    if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {
        print "$1\n";
        last;
    }
}

The regular expression looks for a line beginning with an arXiv identifier, which looks like arXiv:1203.1029v1. If it finds something, it prints the captured part, that is, the actual number. Then it exits the loop.

I can pipe the output of this script into arxiv2bib to fetch the metadata from the arXiv API. An AppleScript glues it all together, allowing me to select a whole bunch of PDFs and run the script. A few seconds later, and all the paper metadata is in BibDesk and the files are renamed and in the proper place.