Nathan Grigg

New adventures

Last month I received my mathematics Ph.D. from the University of Washington. My mother-in-law said my hat looked ridiculous, but I say tams are cool.

Graduation Photo

When I began this journey six years ago, my end goal was to become a math professor. Last year, when it was time to begin applying for jobs, I was less sure. I enjoyed the academic lifestyle, the teaching, and the learning, but research was something I did because I was supposed to. A happy academic has a burning desire to break new ground and make new discoveries in their field, but I struggled to nurture my spark.

I was scared to leave academia, thinking that either I was in short-term doldrums or that my fear of not getting a job was affecting my judgement. I applied for post docs, but as my academic future became more clear, I became more sure that I needed to do something else.

So I took the plunge, withdrew my academic applications, and started a new round of job applications. This week I started as a software engineer at a Large Tech Company. I’m excited for this next adventure!


Most common commands

I have been wanting to learn to use pyplot, but haven’t found the time. Last week I was inspired by Seth Brown’s post from last year on command line analytics, and I decided to make a graph of my most common commands.

I began using zshell on my home Mac about six months ago, and I have 15,000 lines of history since then:

$ wc -l ~/.zsh_history
   15273 /Users/grigg/.zsh_history

(Note it is also possible to get unlimited history in bash.)

I compiled a list of my top commands and made a bar chart using pyplot. Since git is never used by itself, I separated out the git subcommands. Here are the results:

top 20 commands

A couple of these are aliases: gis for git status and ipy for ipython. The lunchy command is a launchctl wrapper, j is part of autojump, and rmtex removes Latex log files.

Clearly, it is time to use bb as an alias for bbedit. I already have gic and gia set up as aliases for git commit and git add, but I need to use them more often.

Building the graph

The first step is parsing the history file. I won’t go into details, but I used Python and the Counter class, which takes a list and returns a dictionary-like object whose values are the frequency of each list item. After creating a list of commands, you count them like this:

from collections import Counter
top_commands = Counter(commands).most_common(20)

To make the bar chart, I mostly just copied the pyplot demo from the documentation. Here is what I did.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

width = 0.6
N = 20
ys = np.arange(N)

# change the font
matplotlib.rcParams['font.family'] = 'monospace'

# create a figure of a specific size
fig = plt.figure(figsize=(5, 5))

# create axes with grid
axes = fig.add_subplot(111, axisbelow=True)
axes.xaxis.grid(True, linestyle='-', color='0.75')

# set ymin, ymax explicitly
axes.set_ylim((-width / 2, N))

# set ticks and title
axes.set_yticks(ys + width / 2)
axes.set_yticklabels([x[0] for x in top_commands])
axes.set_title("Top 20 commands")

# put bars
axes.barh(ys, [x[1] for x in top_commands], width, color="purple")

# Without the bbox_inches, the longer labels got cut off
# 2x version. The fractional dpi is to make the pixel width even
fig.savefig('commands.png', bbox_inches='tight', dpi=160.1)

I still find pyplot pretty confusing. There are several ways to accomplish everything. Sometimes you use module functions and sometimes you create objects. Lots of functions return data that you just throw away. But it works!


Amazon S3 has two types of redirects

In the time between when I read up on S3 redirects and when I published a post on what I had learned, Amazon created a second way to redirect parts of S3 websites.

The first way redirects a single URL at a time. These are the redirects I already knew about, which were introduced last October. They are created by attaching a special piece of metadata to an S3 object.

The second way was introduced in December, and redirects based on prefix. This is probably most useful for redirecting entire folders. You can either rewrite a folder name, preserving the rest of the URL, or redirect the entire folder to a single URL. This kind of redirect is created by uploading an XML document containing all of the redirect rules. You can create and upload the XML, without actually seeing any XML, by descending through boto’s hierarchy until you find boto.s3.bucket.Bucket.configure_website.


Managing Amazon S3 Redirects

This week I put together a Python script to manage Amazon S3’s web page redirects. It’s a simple script that uses boto to compare a list of redirects to files in an S3 bucket, then upload any that are new or modified. When you remove a redirect from the list, it is deleted from the S3 bucket. The script is posted on GitHub.

I use Amazon S3 to host this blog. It is a cheap and low-maintenance way to host a static website, although these advantages come with a few drawbacks. For example, up until a few months ago you couldn’t even redirect one URL to another. On a standard web host, this is as easy as making some changes to a configuration file.

Amazon now supports redirects, but they aren’t easy to configure. To set a redirect, you upload a file to your S3 bucket and set a particular piece of metadata. The contents of the file don’t matter; usually you use an empty file. You can use Amazon’s web interface to set the metadata, but this is obviously not a good long-term solution.

Update: There are actually two types of Amazon S3 redirects. I briefly discuss the other here.

So I wrote a Python script. This was inspired partly by a conversation I had with Justin Blanton, and partly by the horror I felt when I ran across a meta refresh on my site from the days before Amazon supported redirects.

Boto

The Boto library provides a pretty good interface to Amazon’s API. (It encompasses the entire API, but I am only familiar with the S3 part.) It does a good job of abstracting away the details of the API, but the documentation is sparse.

The main Boto objects I need are the bucket object and the key object, which of course represent an S3 bucket and a key inside that bucket, respectively.

The script

The script (listed below) connects to Amazon and creates the bucket object on lines 15 and 16. Then it calls bucket.list() on line 17 to list the keys in the bucket. Because of the way the API works, the listed keys will have some metadata (such as size and md5 hash) but not other (like content type or redirect location). We load the keys into a dictionary, indexed by name.

Beginning on line 20, we loop through the redirects that we want to sync. What we do next depends on whether or not the given redirect already exists in the bucket. If it does exist, we remove it from the dictionary (line 23) so it won’t get deleted later. If on the other hand it does not exist, we create a new key. (Note that bucket.new_key on line 25 creates a key object, not an actual key on S3.) In both cases, we use key.set_redirect on line 32 to upload the key to S3 with the appropriate redirect metadata set.

Line 28 short-circuits the loop if the redirect we are uploading is identical to the one on S3. Originally I was going to leave this out, since it requires a HEAD request in the hopes of preventing a PUT request. But HEAD requests are cheaper and probably faster, and in most cases I would expect the majority of the redirects to already exist on S3, so we will usually save some requests. Also, I wanted it to be able to print out only the redirects that had changed.

At the end, we delete each redirect on S3 that we haven’t seen yet. Line 40 uses Python’s ternary if to find each keys redirect using get_redirect, but only if the key’s size is zero. This is to prevent unnecessary requests to Amazon.

I posted a more complex version of the code on GitHub that has a command line interface, reads redirects from a file, and does some error handling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/python
from boto.s3.connection import S3Connection

DRY_RUN = True
DELETE = True  # delete other redirects?
ACCESS = "your-aws-access-key"
SECRET = "your-aws-secret-key"
BUCKET = "name.of.bucket"
REDIRECTS = [("foo/index.html", "/bar"),
                ("google.html", "http://google.com"),
            ]
if DRY_RUN: print "Dry run"

# Download keys from Amazon
conn = S3Connection(ACCESS, SECRET)
bucket = conn.get_bucket(BUCKET)
remote_keys = {key.name: key for key in bucket.list()}

# Upload keys
for local_key, location in REDIRECTS:
    exists = bool(local_key in remote_keys)
    if exists:
        key = remote_keys.pop(local_key)
    else:
        key = bucket.new_key(local_key)

    # don't re-upload identical redirects
    if exists and location == key.get_redirect():
        continue

    if not DRY_RUN:
        key.set_redirect(location)
    print "{2:<6} {0} {1}".format(
        local_key, location, "update" if exists else "new")

# Delete keys
if DELETE:
    for key in remote_keys.values():
        # assume size-non-zero keys aren't redirects to save requests
        redirect = key.get_redirect() if key.size == 0 else None
        if redirect is None:
            continue
        if not DRY_RUN:
            key.delete()
        print "delete {0} {1}".format(key.name, redirect)

Removing Latex log and auxiliary files

(updated )

How to write a shell script to delete Latex log files. Also, why you should think about using zsh. [Update: In addition, I reveal my complete ignorance of Bash. See the note at the end.]

I try not to write a lot of shell scripts, because they get long and complicated quickly and they are a pain to debug. I made an exception recently because Latex auxiliary files were annoying me, and a zsh script seemed to be a better match than Python for what I wanted to do. Of course, by the time I was finished adding in the different options I wanted, Python may have been the better choice. Oh well.

For a long time I have had an alias named rmtex which essentially did rm *.aux *.log *.out *.synctex.gz to rid the current directory of Latex droppings. This is a dangerous alias because it assumes that all *.log files in the directory come from Latex files and are thus unimportant. But I’m careful and have never accidentally deleted anything (at least not in this way). What I really wanted, though, was a way to make rmtex recurse through subsirectories, which requires more safety.

Here is what I came up with. (I warned you it was long!) I will point out some of the key points, especially the useful things that zsh provides.

#!/usr/local/bin/zsh

# suppress error message on nonmatching globs
setopt local_options no_nomatch

USAGE='USAGE: rmtex [-r] [-a] [foo]

Argument:
    [foo]   file or folder (default: current directory)

Options:
    [-h]    Show help and exit
    [-r]    Recurse through directories
    [-a]    Include files that do not have an associated tex file
    [-n]    Dry run
    [-v]    Verbose
'

# Option defaults
folders=(.)
recurse=false
all=false
dryrun=false
verb=false
exts=(aux synctex.gz log out)

# Process options
while getopts ":ranvh" opt; do
    case $opt in
    r)
        recurse=true
        ;;
    a)
        all=true
        ;;
    n)
        dryrun=true
        verb=true
        ;;
    v)
        verb=true
        ;;
    h)
        echo $USAGE
        exit 0
        ;;
    \?)
        echo "rmtex: Invalid option: -$OPTARG" >&2
        exit 1
        ;;
    esac
done

# clear the options from the argument string
shift $((OPTIND-1))

# set the folders or files if given as arguments
if [ $# -gt 0 ]; then
    folders=$@
fi

# this function performs the rm and prints the verbose messages
function my_rm {
    if $verb; then
        for my_rm_g in $1; do
            if [ -f $my_rm_g ]; then
                echo rm $my_rm_g
            fi
        done
    fi

    if ! $dryrun; then
        rm -f $1
    fi
}

# if all, then just do the removing without checking for the tex file
if $all; then
    for folder in $folders; do
        if [[ -d $folder ]]; then
            if $recurse; then
                for ext in $exts; my_rm $folder/**/*.$ext
            else
                for ext in $exts; my_rm $folder/*.$ext
            fi
        else
            # handle the case that they gave a file rather than folder
            for ext in $exts; my_rm "${folder%%.tex}".$ext
        fi
    done

else
    # loop through folders
    for folder in $folders; do
        # set list of tex files inside folder
        if [[ -d $folder ]]; then
            if $recurse; then
                files=($folder/**/*.tex)
            else
                files=($folder/*.tex)
            fi
        else
            # handle the case the the "folder" is actually a single file
            files=($folder)
        fi
        for f in $files; do
            for ext in $exts; do
                my_rm "${f%%.tex}".$ext
            done
        done
    done
fi

# print a reminder at the end of a dry run
if $dryrun; then
    echo "(Dry run)"
fi

It starts out nice and easy with a usage message. (Always include a usage message!) Then it processes the options using getopts.

Zsh has arrays! Notice line 20 defines the default $folders variable to be an array containing only the current directory. Similarly, line 25 defines the extensions we are going to delete, again using an array.

On the subject of arrays, notice that $@ in line 59, which represents the entire list of arguments passed to rmtex, is also an array. So you don’t have to worry about writing "$@" to account for arguments with spaces, like you would have to in Bash.

Lines 63 to 75 define a function my_rm which runs rm, but optionally prints the name of each file that it is deleting. It also allows a “dry run” mode.

On to the deleting. First I handle the dangerous case, which is when the -a option is given. This deletes all files of the given extensions, like my old alias. Notice the extremely useful zsh glob in line 82. The double star means to look in all subdirectories for a match. This is one of the most useful features of zsh and keeps me away from unnecessary use of find.

In lines 93 through 117, I treat the default case. The $files variable is set to an array of all the .tex files in a given folder, optionally using the double star to recurse through subdirectories. We will only delete auxiliary files that live in the same directory as a tex file of the same name. Notice lines 98 and 100, where the arrays are defined using globs.

In line 108, I delete each file using the substitution command ${f%%.tex} which removes the .tex extension from $f so I can replace it with the extension to be deleted. This syntax is also available in Bash.

My most common use of this is as rmtex -r to clean up a tree full of class notes, exams, and quizzes that I have been working on, so that I can find the PDF files more easily. If I’m feeling especially obsessive, I can always run rmtex -r ~, which takes a couple of minutes but leaves everything squeaky clean.

[Update: While zsh is the shell where I learned how to use arrays and advanced globs, that doesn’t mean that Bash doesn’t have the same capabilities. Turns out I should have done some Bash research.

Bash has arrays too! Arrays can be defined by globs, just as in zsh. The syntax is slightly different, but works just the same. Version 4 of Bash can even use ** for recursive globbing.

Thanks to John Purnell for the very gracious email. My horizons are expanded.]


Terminal Productivity

When it comes to working at the command line, there is really no limit to the amount of customization you can do. Sometimes it is hard to know where to start.

These are the three most helpful tools I use.

Autojump

Seth Brown introduced me to Autojump, and I am extremely grateful. It tracks which directories you use and lets you jump from one to another. All you have to do is type j and part of the directory name. So j arx will take me to ~/code/BibDesk/arxiv2bib, and then j nb will take me to ~/Sites/nb. It feels like magic.

If you have homebrew, it can be installed with brew install autojump.

View man pages in BBEdit

I use this function all the time. Man pages aren’t very useful if they are clogging up the terminal. You can easily adapt this for your text editor.

function bbm() {
cmd=$(tr [a-z] [A-Z] <<< "$1")
man $1 | col -b | /usr/local/bin/bbedit --view-top --clean -t "$cmd MANUAL"
}

The second line converts the name of the man page to upper case. This is used in the third line to set a title. The --clean option makes it so BBEdit doesn’t ask you if you want to save when you close the window.

Put the function definition into ~/.bash_profile.

IPython

I tried IPython several times before it stuck, but I can’t imagine going back to the standard Python interactive interpreter.

IPython has many, many features, but it is worth it just for the tab completion. Type the name of a variable, add a dot, and press tab to see all of the object’s methods and other attributes. Also, each history item contains the entire command, as opposed to just one line, making it possible to edit that for loop that you messed up.

The thing that kept me out of IPython for so long was that I didn’t like its default settings and didn’t want to figure out how to fix it.

In case this is stopping anyone from having a much better Python experience, here is a condensed version of my ipython_config.py file for my default profile.

c = get_config()
c.TerminalIPythonApp.display_banner = False     # Skip startup message
c.TerminalInteractiveShell.confirm_exit = False # Ctrl-D means quit!
c.TerminalInteractiveShell.autoindent = False   # I can indent my own lines
c.PromptManager.in_template = '>>> '  # The IPython prompt is
c.PromptManager.in2_template = '... ' # useful, but I prefer
c.PromptManager.out_template = ''     # the standard
c.PromptManager.justify = False       # prompt.

For more information about where this should go, run ipython profile list.

IPython can be installed with easy_install ipython.


Adventures in Python Packaging

After my post two weeks ago about managing my library of academic papers, someone mentioned that I should upload the arxiv2bib script that I wrote to the Python Package Index (PyPI).

I have been curious about Python packaging before, but had never really understood it. This script was the perfect candidate for me to experiment on: a single python file with no dependencies. So I dove in.

I’ll be honest, it wasn’t easy. In the end it worked, and I was even able to use my newfound knowledge to upload my more complicated Day One Export script, which has templates, multiple files, and dependencies. But I spent more time than I wanted to screwing things up. Worst of all, I don’t see any way I could have avoided all these mistakes. It really is that convoluted.

So here is my story. This is not a tutorial, but hopefully my journey will enlighten you. The links should be helpful too.

Distutils

Python packaging is centered around the setup.py script. Python comes with the distutils package, which makes creating the script really easy, assuming that you don’t want to do anything complicated. (Caveat: you often need to do something complicated.) Without needing any extra code, the distutils package empowers setup.py to build and install python modules, upload them to PyPI, even create a bare bones Windows graphical installer.

I followed this guide from Dive into Python 3 (everything applies to Python 2). All you have to do is fill in the arguments to the setup script. Then you run python setup.py sdist to create a tar.gz containing the source. With python setup.py register, it registers the current version of the project on PyPI (you will need an account first). Finally, python setup.py upload will upload the distribution to PyPI.

At this point, things were working, but not as well as I wanted. First of all, I wanted my script to work with either Python 2 or Python 3. This isn’t too difficult; I just copied some install code from Beautiful Soup.

I also wanted things to work on Windows, but this was much more difficult. You can designate certain files as “scripts”, and distutils will copy them into /usr/local/bin (or similar). On Windows, it copies to C:\Python27\Scripts, but Windows won’t recognize a Python script as executable unless it ends in .py. So I made the install script rename the file if it was installing on Windows.

Because setup.py is just a Python file, it can really do just about anything. (Another reason to use virtualenv, so that you don’t have to sudo someone else’s script.) But if you find yourself doing crazy things, take a deep breath, and just use setuptools.

Setuptools

Setuptools is not part of the Python standard library, but it is almost universally installed. This is the package that brings eggs and easy_install and is able to resolve dependencies on other packages. It extends the capabilities of distutils, and makes a lot of things possible with a lot less hacking.

There are plenty of setuptools guides. For my Day One Export script, I was most interested in declaring dependencies, which is done with the install_requires argument (as in this). For both scripts, I was also interested in the entry_points argument, which allows you to make executable scripts that run both in Windows (by creating an exe wrapper) and in Unix (the usual way).

If I were to do it again, I would skip distutils and just use setuptools.

One thing I did stress about was what to do for users that don’t have setuptools installed. Some packages use distutils as a fallback, while others don’t. In the end, I settled for printing a nice error message if setuptools is not installed.

Distribute?

Here is where things get really confusing. Distribute is a fork of setuptools. For the most part, you can pretend it doesn’t exist. It acts like setuptools (but with fewer bugs), so some people will technically use distribute to install the package instead of setuptools. But this doesn’t affect you.

Distribute also has Python 3 support, so all Python 3 users will be using it instead of setuptools. Again, this doesn’t affect you much, except that distribute offers some tools to automatically run 2to3 during installation.

Update: Now you have even less reason to care about distribute, because it was merged back into setuptools.

The future!

It is confusing enough to have three different packaging systems, but the Python maintainers aren’t happy enough with setuptools/distribute to bring them into the standard library. The goal is to replace all three with a better system. It is currently being developed as distutils2, but will be called packaging when it lands in the standard library. At one point, this was scheduled to happen with Python 3.3, but that has been pushed back to version 3.4.

Was it worth it?

Well, I’m glad I finally understand how these things work. And this is the sort of thing that you will never understand until you do it yourself. So in that sense, it was worth it.

The number of people who will use a script that can be installed, especially via pip or easy_install, is likely an order of magnitude more than then number of people who would use the script otherwise. So packaging is the nice thing to do.

Even for my own use, it is nice to have these scripts “installed” instead of “in ~/bin”. I can use pip to check the version number, quickly install a new version, or uninstall.


BBEdit Grammar Highlighting

In January, Gabe Weatherhead posted a great way to proofread using BBEdit. His idea is to define a custom “language” with a list of commonly confused words as the “keywords”. So just like how in Python mode BBEdit changes the color of if and else, in Grammar mode it can highlight their and there, or whatever other words you specify.

To do this, you create a plist file and save it into the Language Modules folder inside the BBEdit Application Support folder. Gabe links to his plist file in the post referenced above.

Highlight regular expression matches

You can take this idea one step further. A language module can use a regular expression to define what a “string” should look like. This gives you a lot of flexibility. For example, you can use this to highlight all duplicate words in the document.

Screenshot with duplicate word colored red

Here is the regular expression that matches duplicate words. You put this into the language module plist as the String Pattern.

\b(?P<dupe>\w+)\s+(?P=dupe)\b

The \b is a word anchor, and matches at the beginning or end of a word. Then (?P<dupe>\w+) finds one or more word characters and captures them to the dupe register. Next, \s+ looks for one or more whitespace characters. Finally, (?P=dupe) looks for the contents of the dupe register. The final \b ensures the end of the match is at the end of a word.

BBEdit lets you customize the color scheme for each different language. So even though my strings are usually green, I made them red for the Grammar language.

Here is the language module I use: Grammar.plist.


Real time publishing with Google Reader

(updated )

Google Reader fetches my RSS feed about once every hour. So if I publish a new post, it will be 30 minutes, on average, before the post appears there. If I notice a typo but Google Reader already cached the feed, then I have to wait patiently until the Feed Fetcher returns. In the mean time, everyone reads and makes fun of my mistake.

Here is how to let Google Reader know that it should cache your feed now. This uses a protocol called PubSubHubbub, sometimes abbreviated PSH or PuSH.

It seems like almost no one uses PSH. Even Google, who created the protocol, apparently has forgotten about it. But Google Reader supports it, and almost everyone uses Google Reader, so that makes this useful.

Include a rel="hub" link in your feed to specify a PSH hub. Google runs one at pubsubhubbub.appspot.com. In theory, there could be other hubs, but I don’t know of any.

If you have an Atom feed, put the following line after the feed element open tag:

<link href="http://pubsubhubbub.appspot.com/" rel="hub" />

If you have an RSS feed, make sure that you first specify the Atom namespace in your rss tag:

<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">

Then include the following line after the channel element [open tag]:

<atom:link href="http://pubsubhubbub.appspot.com/" rel="hub" />

Now wait until the Google Bot reads your feed for the changes to take effect. Google Reader sees the link and asks the hub to notify it when the feed changes.

2. Ping the hub

Each time you publish a new post or edit a recent post, run this command at the terminal:

curl -d "hub.mode=publish&hub.url={feed_url}" "http://pubsubhubbub.appspot.com"

Of course, change {feed_url} to the URL for your feed. This command tells the hub to fetch your updated feed. The hub informs Google Reader and any other service that has asked to be notified. In my tests, it took about five seconds for the new (or updated) story to appear in Google Reader.

More information

At https://pubsubhubbub.appspot.com/topic-details?hub.url={feed_url}, you can see the last time the hub checked your feed.

Using PSH makes it so Google Reader polls your site less often, since it assumes you will notify them when something changes. So either go all in or don’t bother.


Jumpcut, TextExpander edition

I use Jumpcut to save and use my clipboard history. One downside to Jumpcut is that it doesn’t play nicely with TextExpander. Because TextExpander manipulates the clipboard, Jumpcut captures every expanded snippet into its history. Some people may be able to tolerate this kind of pollution, but I cannot.

So I don’t use TextExpander. I have often wanted to, but not enough to give up Jumpcut. There are clipboard managers built into Launchbar and Alfred which purportedly play nicely with TextExpander (I can verify Launchbar but not Alfred), but even this cannot persuade me to give up Jumpcut.

Jumpcut is open source, but I know nothing of Objective-C or Mac development. I do know that TextExpander identifies things it puts on the clipboard with org.nspasteboard.AutoGeneratedType. Luckily for me, Wolf Rentzsch, whose podcast I just discovered, made a patch to ignore passwords put on the clipboard by PasswordWallet, which apparently uses a similar identification scheme. It seemed straightforward to adjust his patch to avoid capturing TextExpander snippets.

Surprisingly, it worked. TextExpander, welcome to the team.

Here is the single line I changed in Jumpcut’s AppController.m:

        pbCount = [[NSNumber numberWithInt:[jcPasteboard changeCount]] retain];
        if ( type != nil ) {
        NSString *contents = [jcPasteboard stringForType:type];
-       if ( contents == nil ) {
+       if ( contents == nil || [jcPasteboard stringForType:@"org.nspasteboard.AutoGeneratedType"] ) {
    //            NSLog(@"Contents: Empty");
            } else {

I have a Jumpcut fork on Github. In addition to this change, I have incorporated some stylistic changes from Dr. Drang’s fork.

Obviously you’ll need Xcode to build it. I’d be happy to send a link to a binary if someone wants it, but I don’t really know what I’m doing. All I know is it works for me.


Organizing journal articles from the arXiv

This week, my method for keeping track of journal articles I use went from kind of cool to super awesome thanks to pdftotext and a tiny Perl script.

Most math papers I read come from the arXiv (pronounced archive), a scientific paper repository where the vast majority of mathematicians post their journal articles. This is the best place to find recent papers, since they are often posted here a year or more before they appear in a journal. It is a convenient place to find slightly older papers because journals have terrible websites. (Papers from the 60’s, which I do sometimes read, usually require a trip to the library.)

I use BibDesk to organize the papers I read, mostly because it works so well with Latex, which all mathematicians use to write papers. Also, it stores its database in a plain text file, and has done so since long before it was cool.

Every now and then I gather all the papers from my Dropbox and iPad and import them into BibDesk. For each PDF I got from the arXiv I do the following:

  1. Find the arXiv identification number, which is watermarked on the first page of the PDF.

  2. Use my script arxiv2bib, which I have written about before to get the paper’s metadata from the arXiv API. An AppleScript takes the result of the script and imports it into BibDesk.

  3. Drag the PDF onto the reference in BibDesk. BibDesk automatically renames the paper based on the metadata and moves it to a Dropbox subfolder.

Three steps is better than the ten it would take without AppleScript and the arXiv API, but why can’t the computer extract the identification number automatically?

Oh yeah, of course it can.

#!/bin/bash
pdftotext "$1" - | perl -ne 'if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {print "$1\n"; last}'

The pdftotext utility comes with xpdf, which is available from Homebrew. Or can download the binary linked at foolabs. It works as advertised.

The -n argument tells Perl to wrap the script in the while loop to process stdin one line at a time. Here is what the Perl script would look like if I had put it in its own file.

#!/usr/bin/perl
while (<>) {
    if (/^arXiv:(\d{4}\.\d{4}v\d+)/) {
        print "$1\n";
        last;
    }
}

The regular expression looks for a line beginning with an arXiv identifier, which looks like arXiv:1203.1029v1. If it finds something, it prints the captured part, that is, the actual number. Then it exits the loop.

I can pipe the output of this script into arxiv2bib to fetch the metadata from the arXiv API. An AppleScript glues it all together, allowing me to select a whole bunch of PDFs and run the script. A few seconds later, and all the paper metadata is in BibDesk and the files are renamed and in the proper place.


Overzealous Gatekeeper

This week my Mac refused to open a plain text file, complaining that it was from an unidentified developer.

Gatekeeper dialog

Mac OS X 10.8 introduces a new feature called Gatekeeper, which prevents applications from running unless they are either from the App Store or signed by a registered developer.

You can turn Gatekeeper off, but I have kept it on so far. I am willing to run unsigned Apps, but it is nice to be notified that they are unsigned before having to make that choice.

Don’t open that file!

I never expected to be prohibited from opening a text file. Double-clicking the file got me nowhere. Dragging the file onto BBEdit in my dock did nothing. TextEdit, no. Safari, no. Eventually I right clicked the file and clicked “Open”, which is the prescribed way to get around Gatekeeper’s restriction. Of course this worked, but it opened the file in Byword, which is not my default text editor. I was perplexed.

For this to happen, I found that two things were necessary. One, of course, the file had to be downloaded from the internet. Your web browser sets the “com.apple.quarantine” extended attribute, which tells the computer to be extra careful with this file of dubious origin. Two, the file must be set to open in some application other than the default application. You can change which application is set to open a file by selecting “Get Info” in the Finder, and changing which application appears under “Open with”. This information is stored in the file’s resource fork.

This is clearly a bug. Maybe the operating system would want to prevent someone from tricking me into open a file in something other than my default application, but it should definitely allow me to open it by dragging it onto an application of my choice.

Here is a file that should make this happen on your computer: Harmless File.txt. In this case, I set the file to be opened with Safari, instead of whatever your default text editor is.

If you are curious, you can see the extended attributes by using the command xattr -l "Harmless File.txt" on the command line.

Don’t view that shell script!

There is one other situation that causes similar results. If you have an executable file with no extension, OS X identifies it as a “Unix Executable File” and by default uses Terminal to open it. Gatekeeper also prevents you from opening these files. This makes a little more sense, because opening them in the Terminal actually runs them, which is not what you should do with a random script downloaded from the internet.

What you should do instead is drag them onto a text editor and look at them. But Gatekeeper won’t let you do this either. Worse, if you try to get around Gatekeeper by right clicking and selecting “Open”, the file gets executed in a Terminal window. Oops.

Unquarantine

These seem like edge cases, but they both hit me in the last week, so I created a service in Automator to clear the quarantine bit. (If you download the one I created, you will have to sidestep Gatekeeper, but for valid reasons.)

An automator service to unquarantine a file. Text is shown below.

Here is the shell script from the Automator service:

for f in "$@"
do
    xattr -d com.apple.quarantine "$f"
done

BBEdit filter: Rewrite shebang

This BBEdit filter allows you to cycle through choices for a script’s shebang line. It is a fun little script, and has been surprisingly useful to me.

BBEdit text filters are shell scripts that read from stdin and write to stdout. When you select a filter from the “Apply Text Filter” submenu of the “Text” menu, the text of the current document is fed through your shell script, and the document text is replaced by the output. If some text is selected, only that part passes through the script.

My Python script is super simple. It reads the first line of the file and determines the name of the interpreter. Then it calculates three possible shebang lines: the stock system command (e.g. /usr/bin/python), the one on my custom path (e.g. /usr/local/bin/python), and the env one (e.g. /usr/bin/env python). Then it cycles from one to the next each time you run it through the filter.

As an added bonus, the script is forgiving. So if I want a Python script, I can just write python on the first line, and the filter will change it to #!/usr/bin/python. This is good for me, because for some reason it always takes me three or four seconds to remember if it should be #! or !#. (At least only one of these makes sense. I have even worse problems remembering the diference between $! and !$ in bash.)

The python script

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#!/usr/bin/python

import sys
import os

LOCAL = "/usr/local/bin:/usr/local/python/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/texbin:/Users/grigg/bin"
SYSTEM = "/usr/bin:/bin:/usr/sbin:/sbin"

def which(command, path):
    """Emulate the 'which' utility"""
    for p in path.split(':'):
        full_command = os.path.join(p, command)
        if os.path.isfile(full_command) and os.access(full_command, os.X_OK):
            return full_command
    return ""

transformations = [lambda s: which(s, SYSTEM),
                    lambda s: which(s, LOCAL),
                    lambda s: "/usr/bin/env " + s]

# deal with the first line
line = original_line = next(sys.stdin).strip('\n')
if line[:2] == "#!":
    line = line[2:].strip()
base = line.rsplit('/', 1)[-1]
if base[:4] == 'env ':
    base = base[4:].strip()
if ' ' in base:
    base, args = base.split(' ', 1)
    args = ' ' + args
else:
    args = ''

# do the transformations
options = [T(base) for T in transformations]
# filter out the empty ones while appending args
options = [o + args for o in options if o]
# if the only one is the /usr/bin/env, don't do anything
if len(options) <= 1:
    print original_line
else:
    dedupe = list(set(options))
    if line in dedupe:
        dedupe.sort()
        index = dedupe.index(line)
        line = dedupe[(index + 1) % len(dedupe)] # cycle
    else:
        # can't cycle, just use the first option
        line = options[0]
    print "#!" + line

# print every other line
for line in sys.stdin: print line,

The possible transformations are listed beginning on line 17. The order of this list sometimes matters; it determines which transformation should be used if the current shebang doesn’t match any of the options (see line 49).

The current shebang is interpreted on lines 22 through 32. It’s pretty basic, just using the first word after the last slash as the interpreter. That should cover most use cases.

(I am very happy that when I write base[:4] in line 26, Python doesn’t complain if base has fewer than four characters. Contrast this with AppleScript, which fails if base is short and you say text 1 thru 4 of base. You can get around this by testing base begins with "env ". Sorry for the tangent.)

In lines 42 through 46, we get to the cycling. I deduplicate the list, alphabetize, look up the current shebang in the list, and cycle to the next. It is fun how the filter uses the first line of the input file to essentially save state, so it knows where to go next time it is run.


Install Python packages to a custom directory with virtualenv

Mountain Lion deleted all my Python packages, just like Lion did last year. I’m determined not to be fooled a third time, so this time I installed my packages to a custom directory using virtualenv.

Virtualenv is advertised as a way to create multiple isolated Python environments on the same computer, switch between them easily, etc. I don’t need all of that. I just want to control where Python packages are installed with no fuss. Virtualenv does that also.

Set up a virtual environment

First, install virtualenv. I recommend installing pip and running pip install virtualenv. If you think it is already installed but it isn’t working, this is probably because Mountain Lion deleted it with the rest of your packages. Reinstall virtualenv.

Second, run virtualenv /usr/local/python. This creates a new Python environment based at /usr/local/python. Of course, you can make this any path you want.

You now have two new directories (among others):

/usr/local/python/bin
/usr/local/python/lib/python2.7/site-packages

The site-packages directory is where your packages will be installed. The bin directory contains a special pip executable, which automatically installs new packages to your custom directory. It also contains a python executable, which will make these packages available to a Python session. Also, if you install any package that has a command line interface, the executable file will go in this bin directory. I added /usr/local/python/bin to (the front of) my PATH to make these easily accessible.

Make these packages available to the system’s Python

Third, create a file local.pth (name doesn’t matter, but extension does) inside the /Library/Python/2.7/site-packages folder with a single line:

/usr/local/python/lib/python2.7/site-packages

This tells the regular system Python to also load packages from my custom location. So even if I run /usr/bin/python, I will be able to import my packages.

As long as I always use the pip command that virtualenv created to install new packages, this third step is the only thing I will have to repeat next year, when OS X 10.9 “Volcano Lion” clears out the system site-packages folder again.


New line after punctuation, revisited

A while back, I wrote about ending lines at natural breaks, like at the end of phrases and sentences. In Latex documents, code documentation, git commits, and other situations where I am writing prose but “soft wrapping” is undesirable, this makes revising and rearranging the text much easier. My favorite part is that I no longer have to worry about refilling the lines every time I make a change.

In that post, I linked to a “new line after punctuation” BBEdit AppleScript I wrote. Some time later, I realized that using AppleScript to run BBEdit searches can be way more efficient than pure AppleScript or even Python alternatives. I thought I would use this to extend my new line script to also handle some math-breaking heuristics. That failed because it got too complicated for me to predict what it was going to do. The most important thing about text editor commands is that they be completely predictable so they don’t interrupt your editing flow.

Still, using BBEdit’s find command simplified my new line script and made it one-third as long. I thought it would be another helpful example of how to script BBEdit, so I’m am posting it here. If you are looking for other examples of this kind of script, Oliver Taylor recently posted a collection of helpful cursor movement scripts, most of which use BBEdit’s find function in a similar way.

The new “new line after punctuation”

tell application "BBEdit"
    -- save location of current cursor
    set cursor to selection
    set cursor_char to characterOffset of cursor

    tell document 1
    set search to find "^([ \\t]*)(.*)([.,!?;:])[ \\t]+" ¬
        searching in text 1 ¬
        options {search mode:grep, backwards:true} without selecting match

    if found of search and startLine of found object of search ¬
        is equal to startLine of cursor then
        -- found punctuation, insert return
        set replacement to grep substitution of "\\1\\2\\3\\r\\1"
        set diff to (replacement's length) ¬
        - (length of found object of search)
        set contents of found object of search to replacement
    else
        -- no punctuation, just insert a return here
        set search to find "^[ \\t]*" searching in text 1 ¬
        options {search mode:grep, backwards:true} without selecting match
        if found of search and startLine of found object of search ¬
        is equal to startLine of cursor then
        set indent to contents of found object of search
        else
        set indent to ""
        end if
        set diff to (length of indent) + 1
        set contents of character cursor_char to ¬
        ("\r" & indent & contents of character cursor_char)
    end if

    -- adjust cursor to keep it in the same relative location
    select insertion point before character (cursor_char + diff)
    end tell
end tell

We begin by telling BBEdit to get the selection object. This contains information about the cursor location if no text is selected. For convenience, we save the characterOffset of the selection, which is the character count from the beginning of the document to the cursor location.

The heavy lifting is done by the grep search in line 7. Here it is with undoubled backslashes, which had been escaped for AppleScript.

^([ \t]*)(.*)([.,!?;:])[ \t]+

The ^ anchors the search to the beginning of a line. The ([ \t]*) captures any combination of spaces and tabs to \1. The (.*) captures as many characters as possible to \2. The ([.,!?;;]) captures a single punctuation mark to \3. The final [ \t]+ requires one or more whitespace characters after the punctuation and captures this whitespace to the search result, but not to any particular group. This is so I don’t insert a new line where there wasn’t already whitespace, and also to throw away whatever whitespace was there.

Lines 8 and 9 search backwards from the current cursor position. If this command (and many that follow) weren’t inside a tell document 1 block, I would need text 1 of document 1 instead of text 1. If you allow BBEdit to select the match, you can see selection flashing while the script runs, which I don’t like. Also, it messes up the cursor location, which will change the result of future searches within the script.

Lines 11 and 12 make sure the script found something on the current line. If so, line 14 uses BBEdit’s grep substitution command to form a string with the whitespace, followed by the line up to the punctuation, followed by the punctuation, a return, and then the whitespace repeated. Line 17 does the replacement.

Lines 20 through 31 deal with the case that no punctuation is found. This just inserts a return at the current cursor location, matching the current line’s indentation.

Line 34 moves the cursor to a new location to make up for the text that has been added.

This script is also available as a gist. I have it assigned to the keyboard shortcut Control+Return.