Organizing journal articles from the arXiv

This week, my method for keeping track of journal articles I use went from kind of cool to super awesome thanks to pdftotext and a tiny Perl script.

Most math papers I read come from the arXiv (pronounced archive), a scientific paper repository where the vast majority of mathematicians post their journal articles. This is the best place to find recent papers, since they are often posted here a year or more before they appear in a journal. It is a convenient place to find slightly older papers because journals have terrible websites. (Papers from the 60’s, which I do sometimes read, usually require a trip to the library.)

I use BibDesk to organize the papers I read, mostly because it works so well with Latex, which all mathematicians use to write papers. Also, it stores its database in a plain text file, and has done so since long before it was cool.

Every now and then I gather all the papers from my Dropbox and iPad and import them into BibDesk. For each PDF I got from the arXiv I do the following:

Three steps is better than the ten it would take without AppleScript and the arXiv API, but why can’t the computer extract the identification number automatically?

#!/bin/bash
pdftotext "$1" - | perl -ne 'if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {print "$1\n"; last}'

The pdftotext utility comes with xpdf, which is available from Homebrew. Or can download the binary linked at foolabs. It works as advertised.

The -n argument tells Perl to wrap the script in the while loop to process stdin one line at a time. Here is what the Perl script would look like if I had put it in its own file.

#!/usr/bin/perl
while (<>) {
    if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {
        print "$1\n";
        last;
    }
}

The regular expression looks for a line beginning with an arXiv identifier, which looks like arXiv:1203.1029v1. If it finds something, it prints the captured part, that is, the actual number. Then it exits the loop.

I can pipe the output of this script into arxiv2bib to fetch the metadata from the arXiv API. An AppleScript glues it all together, allowing me to select a whole bunch of PDFs and run the script. A few seconds later, and all the paper metadata is in BibDesk and the files are renamed and in the proper place.