Nathan Grigg

Taskpaper Inbox

Here’s my new favorite way to get tasks into TaskPaper. It’s a combination of Drafts, Dropbox, launchd, a Python script, and a shell script.

That sounds convoluted, but once each piece of the pipeline is in place, I just enter one or more tasks into Drafts on my phone, and three seconds later, it is in my TaskPaper file on my Mac. It’s like iCloud, but without the mystery.

Merge new tasks into TaskPaper

I wrote a Python script to insert new tasks in the proper place in my TaskPaper file. Since TaskPaper files are just plain text, this is not too complicated.

My script reads in a text file and interprets each line as a new task. If the task has a project tag, it removes the tag, and then it groups the tasks by project. Anything without a project is assumed to be in the inbox. Next, it reads my main TaskPaper file, and figures out where each project begins and ends. Finally, it inserts each new task at the end of the appropriate project.

A shell script calls the Python script with the correct arguments, merging my inbox.txt file into my tasks.taskpaper file, and deleting the now-redundant inbox.txt file. Update: To avoid corrupting my TaskPaper file, I use some AppleScript within this shell script to first save the file if it is open.

(Of course, the Python script could have done these last steps also, but it’s much better to make the Python script generic, so I can use it for other purposes.)

Watch inbox for changes

The next step is to automate the merging. This is where OS X’s launchd is useful. One solution would be to run the shell script on some kind of timed interval. But launchd is smarter than that.

Using the WatchPaths key, I can have the shell script run whenever my inbox.txt file is modified. Since OS X keeps an eye on all filesystem changes, this actually has a very low overhead and means that my shell script will be run within seconds of any modifications to inbox.txt.

Here is my Launch Agent definition, stored in a plist file in ~/Library/LaunchAgents.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>net.nathangrigg.taskpaper-merge-inbox</string>
    <key>Program</key>
    <string>/Users/grigg/bin/taskpaper_merge_inbox.sh</string>
    <key>StandardErrorPath</key>
    <string>/Users/grigg/Library/Logs/LaunchAgents/taskpaper_merge_inbox.log</string>
    <key>StandardOutPath</key>
    <string>/Users/grigg/Library/Logs/LaunchAgents/taskpaper_merge_inbox.log</string>
    <key>WatchPaths</key>
    <array>
        <string>/Users/grigg/Dropbox/Tasks/inbox.txt</string>
    </array>
</dict>
</plist>

Drafts and Dropbox

With the hard work out of the way, I just define a custom Dropbox action in Drafts that appends text to inbox.txt in my Dropbox folder. With no fuss, Drafts sends the new task or tasks off to Dropbox, which dutifully copies them to my Mac, which springs into action, merging them into my TaskPaper file.

With so many applications and services fighting to be the solution to all of our problems, it is refreshing to see tools that are happy solving their portion of a problem and letting you go elsewhere to solve the rest.


Automounting Time Machine

I use Time Machine to back up my home iMac to a USB external hard drive. But I don’t want the Time Machine volume mounted all of the time. It adds clutter and slows down Finder.

I’ve been using a shell script and a Launch Agent to automatically mount my Time Machine volume, back it up, and unmount it again.

Since this takes care of running Time Machine, I have Time Machine turned off in System Preferences.

Shell script

The shell script used to be more complicated, but Apple has been been improving their tools. You could actually do this in three commands:

  1. Mount the volume (line 6).
  2. Start the backup (line 14). The --block flag prevents the command from exiting before the backup is complete.
  3. Eject the volume (line 16).

Everything else is either logging or to make sure that I only eject the volume if it wasn’t mounted to begin with. In particular, line 4 checks if the Time Machine volume is mounted at the beginning.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
#!/bin/bash
date=$(date +"%Y-%m-%d %H:%M:%S")

if [[ -d "/Volumes/Time Machine Backups" ]]; then
    eject=false
elif diskutil quiet mount "Time Machine Backups"; then
    eject=true
else
    echo>&2 "$date Cannot mount backup volume"
    exit 1
fi

echo $date Starting backup
if tmutil startbackup --block; then
    echo $date Backup finished
    if [[ $eject = true ]]; then
        diskutil quiet eject "Time Machine Backups"
    fi
else
    echo>&2 "$date Backup failed"
    exit 1
fi

Launch Agent

Nothing complicated here. This uses launchd to run the shell script every two hours and capture the output to a log file.

I save this as “net.nathangrigg.time-machine.plist” in “/Library/LaunchDaemons”, so that it is run no matter who is logged in. If you do this, you need to use chown to set the owner to root, or it will not be run.

If you are the only one that uses your computer, you can just save it in “~/Library/LaunchAgents”, and you don’t have to worry about changing the owner.

Either way, run launchctl load /path/to/plist to load your agent for the first time. (Otherwise, it will load next time you log in to your computer.)

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
    <key>Label</key>
    <string>net.nathangrigg.time-machine</string>
    <key>Program</key>
    <string>/Users/grigg/bin/time-machine.sh</string>
    <key>StandardErrorPath</key>
    <string>/Users/grigg/Library/Logs/LaunchAgents/time-machine.log</string>
    <key>StandardOutPath</key>
    <string>/Users/grigg/Library/Logs/LaunchAgents/time-machine.log</string>
    <key>StartInterval</key>
    <integer>7200</integer>
</dict>
</plist>

Fstab

OS X will still mount your Time Machine volume every time you log in. You can fix this by adding one line to “/etc/fstab” (which you may need to create).

UUID=79CA38B7-BA13-4A15-A080-D3A8B568D860 none hfs rw,noauto

Replace the UUID with your drive’s UUID, which you can find using diskutil info "/Volumes/Time Machine Backups". For more detailed instructions, see this article by Topher Kessler.


LaunchControl for managing launchd jobs

Launchd is a Mac OS X job scheduler, similar to cron. One key advantage is that if your computer is asleep at a job’s scheduled time, it will run the job when your computer wakes up.

LaunchControl is a Mac app by soma-zone that helps manage launchd lobs. It aims to do “one thing well” and succeeds spectacularly. Whether you are new to writing launchd agents or you already have some system in place, go buy LaunchControl now.

(I tried to make this not sound like an advertisement, but I failed. This is not a paid advertisement.)

Complete control

At its core, LaunchControl is a launchd-specific plist editor. There is no magic. You simply drag the keys you want into your document and set their values. There is no translation layer, forcing you to guess what to type into the app to get the functionality you know launchd provides.

It is an excellent launchd reference. Every option is fully annotated, so you won’t have to search the man page or the internet to know what arguments you need to specify.

LaunchControl window

Helpful hints

LaunchControl is extremely helpful. If you specify an option that doesn’t make sense, it will tell you. If the script you want to run doesn’t exist or is not executable, it will warn you. If you are anything like me, this will save you four or five test runs as you iron out all of the details of a new job.

Debugging

LaunchControl also acts as a launchd dashboard. It lets you start jobs manually. It shows you which jobs are running, and for each job, whether the last run succeeded or failed. For jobs that fail, it offers to show you the console output. This is all information you could have found on your own, but it is very useful to have it all in one place and available when you need it.


Repeating tasks for TaskPaper

I’ve been kicking the tires of TaskPaper lately. I’m intrigued by its minimalist, flexible, plain-text approach to managing a to-do list.

I have a lot of repeating tasks, some with strange intervals. For example, once per year, I download a free copy of my credit report. But I can’t just do it every year on January 1, because if I’m busy one year and don’t do it until the 4th, I have to wait until at least the 4th the following year. You see the problem. The solution is to give myself a buffer, and plan on downloading my credit report every 55 weeks.

Taskpaper has no built-in support for repeating tasks, but its plain-text format makes it easy to manipulate using external scripts. So, for example, I can keep my repeating tasks in an external file, and then once a month have them inserted into my to-do list.

The plain-text calendar tool when, which I also use to remember birthdays, seems like the perfect tool for the job. You store your calendar entries in a text file using a cron-like syntax. You can also do more complicated patterns. For example, I put this line in my file:

!(j%385-116), Transunion credit report

The expression !(j%385-116) is true whenever the modified Julian day is equal to 116 modulo 385. This happens every 385 days, starting today.

When I run when with my new calendar file, I get this output:

today      2014 Feb 22 Transunion credit report

I wrote a quick Python script to translate this into TaskPaper syntax.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#!/usr/bin/python

import argparse
from datetime import datetime
import re
import subprocess

WHEN = "/usr/local/bin/when"

def When(start, days, filename):
    command = [
            WHEN,
            "--future={}".format(days),
            "--past=0",
            "--calendar={}".format(filename),
            "--wrap=0",
            "--noheader",
            "--now={:%Y %m %d}".format(start),
            ]
    return subprocess.check_output(command)


def Translate(line):
    m = re.match(r"^\S*\s*(\d{4} \w{3} +\d+) (.*)$", line)
    try:
        d = datetime.strptime(m.group(1), "%Y %b %d")
    except AttributeError, ValueError:
        return line
    return "    - {} @start({:%Y-%m-%d})".format(m.group(2), d)


def NextMonth(date):
    if date.month < 12:
        return date.replace(month=(date.month + 1))
    else:
        return date.replace(year=(date.year + 1), month=1)


def StartDateAndDays(next_month=False):
    date = datetime.today().replace(day=1)
    if next_month:
        date = NextMonth(date)
    days = (NextMonth(date) - date).days - 1
    return date, days


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
            description="Print calendar items in taskpaper format")
    parser.add_argument("filename", help="Name of calendar file")
    parser.add_argument("-n", "--next", action="store_true",
            help="Use next month instead of this month")
    args = parser.parse_args()

    date, days = StartDateAndDays(args.next)
    out =  When(date, days, args.filename)
    for line in out.split('\n'):
        if line:
            print Translate(line)

This takes the when output, and translates it into something I can dump into my TaskPaper file:

- Transunion credit report @start(2014-02-22)

Calculating the rate of return of a 401(k)

After many years of school, I now have a Real Job. Which means I need to save for retirement. I don’t do anything fancy, just index funds in a 401(k). Nevertheless, I am curious about how my money is growing.

The trouble with caring even a little about the stock market is that all the news and charts focus on a day at a time. Up five percent, down a percent, down another two percent. I don’t care about that. I could average the price changes over longer periods of time, but that is not helpful because I’m making periodic contributions, so some dollars have been in the account longer than others.

What I really want to know is, if I put all my money into a savings account with a constant interest rate, what would that rate need to be to have the same final balance as my retirement account?

Now it’s math. A single chunk of money P with interest rate r becomes the well-known Pert after t years. So if I invest a bunch of amounts Pi, each for a different ti years at interest rate r, I get ∑ Pierti. I need to set this equal to the actual balance B of my account and solve for r.

At this point, I could use solve the equation using something from scipy.optimize. But since I’m doing this for fun, I may as well write something myself. The nice thing about my interest function is that it increases if I increase r and decreases if I decrease r. (This is called monotonic and is a property of the exponential function, but is also intuitively obvious.) So I can just pick values for r and plug them in, and I’ll immediately know if I need to go higher or lower. This is a textbook scenario for a binary search algorithm.

The following Python function will find when our monotonic function is zero.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
from __future__ import division  # For Python 2.

def FindRoot(f, lower, upper, tolerance):
    """Find the root of a monotonically increasing function."""
    r = (lower + upper) / 2
    while abs(upper - lower) > tolerance:
        r = (lower + upper) / 2
        if f(r) > 0:
            upper = r
        else:
            lower = r
    return (lower + upper) / 2

This will look for a root between lower and upper, stopping when it gets within tolerance. At each stage of the loop, the difference between lower and upper is cut in half, which is why it is called binary search, and which means it will find the answer quickly.

Now suppose that I have a Python list transactions of pairs (amount, time), where amount is the transaction amount and time is how long ago in years (or fractions of years, in my case) the transaction happened. Also, I have the current balance stored in balance. The difference between our hypothetical savings account and our actual account is computed as follows:

import math

diff = lambda r: sum(p * math.exp(r * t) for p, t in transactions) - balance

Now I can use FindRoot to find when this is zero.

rate = FindRoot(diff, -5, 5, 1e-4)

This will go through the loop about 16 times. (log2((upper−lower)/tolerance))

The U.S. government mandates that interest rates be given as annual percentage yield (APY), which is the amount of interest you would earn on one dollar in one year, taking compounding into consideration. Since I have assumed interest is compounded continuously, I should convert to APY for easier comparison. In one year, one dollar compounded continuously becomes er. Subtracting the original dollar, I get the APY:

apy = math.exp(rate) - 1

Jekyll plugin to look up page by url

I have used Jekyll for this site ever since I first created it. I’ve contemplated switching to something Python and Jinja based, since I’m more much more familiar with these tools than I am with Ruby. But there is something about Jekyll’s simple model that keeps me here. It’s probably for the best, since it mostly keeps me from fiddling, and there are better directions to steer my urge to fiddle.

Having said that, I couldn’t help but write one little plugin. I wrote this so I can look up a page or post by its URL. It is an excellent companion to Jekyll’s recent support for data files.

The plugin defines a new Liquid tag called assign_page which works kind of like the built-in assign tag. If you write {% assign_page foo = '/archive.html' %}, it creates a variable called foo that refers to object containing information about archive.html. You can then follow with {{ foo.title }} to get the page’s title.

The plugin code

Here is the code that I store in my _plugins folder.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
module Jekyll
  module Tags
    class AssignPage < Liquid::Assign
    TrailingIndex = /index\.html$/

      def page_hash(context)
        reg = context.registers
        site = reg[:site]
        if reg[:page_hash].nil?
          reg[:page_hash] = Hash[ (site.posts + site.pages).collect {
            |x| [x.url.sub(TrailingIndex, ''), x]}]
        end
        return reg[:page_hash]
      end

      # Assign's Initializer stores variable name
      # in @to and the value in @from.
      def render(context)
        url = @from.render(context)
        page = page_hash(context)[url.sub(TrailingIndex, '')]
        raise ArgumentError.new "No page with url #{url}." if page.nil?
        context.scopes.last[@to] = page
        ''
      end
    end
  end
end

Liquid::Template.register_tag('assign_page', Jekyll::Tags::AssignPage)

On Line 3, you see that my AssignPage class is a subclass of Liquid’s Assign class. Assign defines an intialize method to parse the tag, storing the variable name in @to and the value in @from. By not overriding initialize, I get that functionality for free.

On Line 6, I define a function that creates a hash table associating URLs with pages. Liquid lets you store stuff in context.registers, and Jekyll stores the site’s structure in context.registers[:site]. Lines 10 and 11 create the hash table and store it in context.registers so I don’t have to recreate it for each assign_page tag. Ignoring the removal of trailing index.html, this is the same as the Python dictionary comprehension

{x.url: x for x in site.posts + site.pages}

Line 20 uses the hash table to look up the URL. The rest of the lines are pretty much copied from Assign. Line 19 evaluates @from, which lets you specify a variable containing the URL instead of just a URL. Line 22 puts the page in the proper variable. Line 23 is very important because Ruby functions return the result of the last statement. Since Liquid will print our function’s return value, we want to make sure it is blank.


Virtualenv makes upgrading simple

Apple has a history of erasing Python’s site-packages folder during operating system upgrades, leaving users without their third-party Python modules and breaking scripts everywhere. Although I’ve heard that some reports of the upgrade to 10.9 leaving things alone, mine were wiped once again.

Last year when this happened, I vowed to switch everything over to virtualenv, which allows you to install packages in a custom location. With this setup, getting things working again was as easy as recreating my local.pth file:

sudo vim /Library/Python/2.7/site-packages/local.pth

with a single line containing the path to my virtualenv site packages:

/usr/local/python/lib/python2.7/site-packages

Reproducing Vim help as a fully cross-referenced PDF

It’s a long story, but for the last six months, I have been using Vim as my primary text editor. As I began to use Vim more often, I was frustrated by the lack of a tutorial that went beyond the basics. I finally found what I was looking for in Steve Losh’s Learn Vimscript the Hard Way, which is an excellent introduction to Vim’s power features. I also discovered the real reason there are no advanced tutorials, which is that everything you need to know is contained in Vim’s help files.

Vim’s documentation is incredibly complete and very useful. Unfortunately, it makes heavy use of cross references, and the cross references only work with Vim’s internal help viewer. I have no qualms about reading a reference document, but I would strongly prefer to do this kind of reading reclining on a couch with an iPad, rather that Control+F-ing my way through a read-only Vim buffer.

So I made it happen.

Choosing a format

I wanted a way to read and annotate the help files on my iPad. The files were available as HTML, but annotating HTML files is complicated. There are some apps that can annotate HTML, but there is no standard or portable way to do so.

I converted the HTML files to ePub using Calibre, but Vim’s help is very dependent on having lines that are 80 characters long. This caused problems in iBooks.

So instead, I settled on the old favorite, PDF. I can easily annotate a PDF on my iPad and then move those annotations to my computer or another device. Actually, the Vim documentation was already available in PDF format, but without the internal links.

To convert the Vim help files, which are specially-formated plain text, into a hyperlinked PDF, I started with Carlo Teubner’s HTML conversion script, which takes care of the syntax highlighting and linking. I just needed a way to programmatically make a PDF file.

Latex

Latex is clearly the wrong tool for the job. I don’t need the hyphenation or intelligent line breaking that Latex excels at. All I need is to display the text on a PDF page in a monospace font, preserving whitespace and line breaks. Latex ignores whitespace and line breaks.

But Latex is what I know, and I am very familiar with the hyperref package, which can make internal links for the cross references, so I used it anyway.

I used the fancyvrb package, which allows you to preserve whitespace and special characters, like the built-in verbatim environment does, but also allows you to use some Latex commands. This allowed me to do syntax highlighting and internal hyperlinks.

At one point, I ran into an issue where Latex was botching hyphenated urls. The good people at the Latex StackExchange site figured out how to fix it. The level at which they understand the inner workings of Tex amazes me.

The result

I produced an 11-megabyte, 2500-page monster that taught me enough to finally feel at home in Vim.


New adventures

Last month I received my mathematics Ph.D. from the University of Washington. My mother-in-law said my hat looked ridiculous, but I say tams are cool.

Graduation Photo

When I began this journey six years ago, my end goal was to become a math professor. Last year, when it was time to begin applying for jobs, I was less sure. I enjoyed the academic lifestyle, the teaching, and the learning, but research was something I did because I was supposed to. A happy academic has a burning desire to break new ground and make new discoveries in their field, but I struggled to nurture my spark.

I was scared to leave academia, thinking that either I was in short-term doldrums or that my fear of not getting a job was affecting my judgement. I applied for post docs, but as my academic future became more clear, I became more sure that I needed to do something else.

So I took the plunge, withdrew my academic applications, and started a new round of job applications. This week I started as a software engineer at a Large Tech Company. I’m excited for this next adventure!


Most common commands

I have been wanting to learn to use pyplot, but haven’t found the time. Last week I was inspired by Seth Brown’s post from last year on command line analytics, and I decided to make a graph of my most common commands.

I began using zshell on my home Mac about six months ago, and I have 15,000 lines of history since then:

$ wc -l ~/.zsh_history
   15273 /Users/grigg/.zsh_history

(Note it is also possible to get unlimited history in bash.)

I compiled a list of my top commands and made a bar chart using pyplot. Since git is never used by itself, I separated out the git subcommands. Here are the results:

top 20 commands

A couple of these are aliases: gis for git status and ipy for ipython. The lunchy command is a launchctl wrapper, j is part of autojump, and rmtex removes Latex log files.

Clearly, it is time to use bb as an alias for bbedit. I already have gic and gia set up as aliases for git commit and git add, but I need to use them more often.

Building the graph

The first step is parsing the history file. I won’t go into details, but I used Python and the Counter class, which takes a list and returns a dictionary-like object whose values are the frequency of each list item. After creating a list of commands, you count them like this:

from collections import Counter
top_commands = Counter(commands).most_common(20)

To make the bar chart, I mostly just copied the pyplot demo from the documentation. Here is what I did.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import matplotlib.pyplot as plt
import matplotlib
import numpy as np

width = 0.6
N = 20
ys = np.arange(N)

# change the font
matplotlib.rcParams['font.family'] = 'monospace'

# create a figure of a specific size
fig = plt.figure(figsize=(5, 5))

# create axes with grid
axes = fig.add_subplot(111, axisbelow=True)
axes.xaxis.grid(True, linestyle='-', color='0.75')

# set ymin, ymax explicitly
axes.set_ylim((-width / 2, N))

# set ticks and title
axes.set_yticks(ys + width / 2)
axes.set_yticklabels([x[0] for x in top_commands])
axes.set_title("Top 20 commands")

# put bars
axes.barh(ys, [x[1] for x in top_commands], width, color="purple")

# Without the bbox_inches, the longer labels got cut off
# 2x version. The fractional dpi is to make the pixel width even
fig.savefig('commands.png', bbox_inches='tight', dpi=160.1)

I still find pyplot pretty confusing. There are several ways to accomplish everything. Sometimes you use module functions and sometimes you create objects. Lots of functions return data that you just throw away. But it works!


Amazon S3 has two types of redirects

In the time between when I read up on S3 redirects and when I published a post on what I had learned, Amazon created a second way to redirect parts of S3 websites.

The first way redirects a single URL at a time. These are the redirects I already knew about, which were introduced last October. They are created by attaching a special piece of metadata to an S3 object.

The second way was introduced in December, and redirects based on prefix. This is probably most useful for redirecting entire folders. You can either rewrite a folder name, preserving the rest of the URL, or redirect the entire folder to a single URL. This kind of redirect is created by uploading an XML document containing all of the redirect rules. You can create and upload the XML, without actually seeing any XML, by descending through boto’s hierarchy until you find boto.s3.bucket.Bucket.configure_website.


Managing Amazon S3 Redirects

This week I put together a Python script to manage Amazon S3’s web page redirects. It’s a simple script that uses boto to compare a list of redirects to files in an S3 bucket, then upload any that are new or modified. When you remove a redirect from the list, it is deleted from the S3 bucket. The script is posted on GitHub.

I use Amazon S3 to host this blog. It is a cheap and low-maintenance way to host a static website, although these advantages come with a few drawbacks. For example, up until a few months ago you couldn’t even redirect one URL to another. On a standard web host, this is as easy as making some changes to a configuration file.

Amazon now supports redirects, but they aren’t easy to configure. To set a redirect, you upload a file to your S3 bucket and set a particular piece of metadata. The contents of the file don’t matter; usually you use an empty file. You can use Amazon’s web interface to set the metadata, but this is obviously not a good long-term solution.

Update: There are actually two types of Amazon S3 redirects. I briefly discuss the other here.

So I wrote a Python script. This was inspired partly by a conversation I had with Justin Blanton, and partly by the horror I felt when I ran across a meta refresh on my site from the days before Amazon supported redirects.

Boto

The Boto library provides a pretty good interface to Amazon’s API. (It encompasses the entire API, but I am only familiar with the S3 part.) It does a good job of abstracting away the details of the API, but the documentation is sparse.

The main Boto objects I need are the bucket object and the key object, which of course represent an S3 bucket and a key inside that bucket, respectively.

The script

The script (listed below) connects to Amazon and creates the bucket object on lines 15 and 16. Then it calls bucket.list() on line 17 to list the keys in the bucket. Because of the way the API works, the listed keys will have some metadata (such as size and md5 hash) but not other (like content type or redirect location). We load the keys into a dictionary, indexed by name.

Beginning on line 20, we loop through the redirects that we want to sync. What we do next depends on whether or not the given redirect already exists in the bucket. If it does exist, we remove it from the dictionary (line 23) so it won’t get deleted later. If on the other hand it does not exist, we create a new key. (Note that bucket.new_key on line 25 creates a key object, not an actual key on S3.) In both cases, we use key.set_redirect on line 32 to upload the key to S3 with the appropriate redirect metadata set.

Line 28 short-circuits the loop if the redirect we are uploading is identical to the one on S3. Originally I was going to leave this out, since it requires a HEAD request in the hopes of preventing a PUT request. But HEAD requests are cheaper and probably faster, and in most cases I would expect the majority of the redirects to already exist on S3, so we will usually save some requests. Also, I wanted it to be able to print out only the redirects that had changed.

At the end, we delete each redirect on S3 that we haven’t seen yet. Line 40 uses Python’s ternary if to find each keys redirect using get_redirect, but only if the key’s size is zero. This is to prevent unnecessary requests to Amazon.

I posted a more complex version of the code on GitHub that has a command line interface, reads redirects from a file, and does some error handling.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
#!/usr/bin/python
from boto.s3.connection import S3Connection

DRY_RUN = True
DELETE = True  # delete other redirects?
ACCESS = "your-aws-access-key"
SECRET = "your-aws-secret-key"
BUCKET = "name.of.bucket"
REDIRECTS = [("foo/index.html", "/bar"),
                ("google.html", "http://google.com"),
            ]
if DRY_RUN: print "Dry run"

# Download keys from Amazon
conn = S3Connection(ACCESS, SECRET)
bucket = conn.get_bucket(BUCKET)
remote_keys = {key.name: key for key in bucket.list()}

# Upload keys
for local_key, location in REDIRECTS:
    exists = bool(local_key in remote_keys)
    if exists:
        key = remote_keys.pop(local_key)
    else:
        key = bucket.new_key(local_key)

    # don't re-upload identical redirects
    if exists and location == key.get_redirect():
        continue

    if not DRY_RUN:
        key.set_redirect(location)
    print "{2:<6} {0} {1}".format(
        local_key, location, "update" if exists else "new")

# Delete keys
if DELETE:
    for key in remote_keys.values():
        # assume size-non-zero keys aren't redirects to save requests
        redirect = key.get_redirect() if key.size == 0 else None
        if redirect is None:
            continue
        if not DRY_RUN:
            key.delete()
        print "delete {0} {1}".format(key.name, redirect)

Removing Latex log and auxiliary files

(updated )

How to write a shell script to delete Latex log files. Also, why you should think about using zsh. [Update: In addition, I reveal my complete ignorance of Bash. See the note at the end.]

I try not to write a lot of shell scripts, because they get long and complicated quickly and they are a pain to debug. I made an exception recently because Latex auxiliary files were annoying me, and a zsh script seemed to be a better match than Python for what I wanted to do. Of course, by the time I was finished adding in the different options I wanted, Python may have been the better choice. Oh well.

For a long time I have had an alias named rmtex which essentially did rm *.aux *.log *.out *.synctex.gz to rid the current directory of Latex droppings. This is a dangerous alias because it assumes that all *.log files in the directory come from Latex files and are thus unimportant. But I’m careful and have never accidentally deleted anything (at least not in this way). What I really wanted, though, was a way to make rmtex recurse through subsirectories, which requires more safety.

Here is what I came up with. (I warned you it was long!) I will point out some of the key points, especially the useful things that zsh provides.

#!/usr/local/bin/zsh

# suppress error message on nonmatching globs
setopt local_options no_nomatch

USAGE='USAGE: rmtex [-r] [-a] [foo]

Argument:
    [foo]   file or folder (default: current directory)

Options:
    [-h]    Show help and exit
    [-r]    Recurse through directories
    [-a]    Include files that do not have an associated tex file
    [-n]    Dry run
    [-v]    Verbose
'

# Option defaults
folders=(.)
recurse=false
all=false
dryrun=false
verb=false
exts=(aux synctex.gz log out)

# Process options
while getopts ":ranvh" opt; do
    case $opt in
    r)
        recurse=true
        ;;
    a)
        all=true
        ;;
    n)
        dryrun=true
        verb=true
        ;;
    v)
        verb=true
        ;;
    h)
        echo $USAGE
        exit 0
        ;;
    \?)
        echo "rmtex: Invalid option: -$OPTARG" >&2
        exit 1
        ;;
    esac
done

# clear the options from the argument string
shift $((OPTIND-1))

# set the folders or files if given as arguments
if [ $# -gt 0 ]; then
    folders=$@
fi

# this function performs the rm and prints the verbose messages
function my_rm {
    if $verb; then
        for my_rm_g in $1; do
            if [ -f $my_rm_g ]; then
                echo rm $my_rm_g
            fi
        done
    fi

    if ! $dryrun; then
        rm -f $1
    fi
}

# if all, then just do the removing without checking for the tex file
if $all; then
    for folder in $folders; do
        if [[ -d $folder ]]; then
            if $recurse; then
                for ext in $exts; my_rm $folder/**/*.$ext
            else
                for ext in $exts; my_rm $folder/*.$ext
            fi
        else
            # handle the case that they gave a file rather than folder
            for ext in $exts; my_rm "${folder%%.tex}".$ext
        fi
    done

else
    # loop through folders
    for folder in $folders; do
        # set list of tex files inside folder
        if [[ -d $folder ]]; then
            if $recurse; then
                files=($folder/**/*.tex)
            else
                files=($folder/*.tex)
            fi
        else
            # handle the case the the "folder" is actually a single file
            files=($folder)
        fi
        for f in $files; do
            for ext in $exts; do
                my_rm "${f%%.tex}".$ext
            done
        done
    done
fi

# print a reminder at the end of a dry run
if $dryrun; then
    echo "(Dry run)"
fi

It starts out nice and easy with a usage message. (Always include a usage message!) Then it processes the options using getopts.

Zsh has arrays! Notice line 20 defines the default $folders variable to be an array containing only the current directory. Similarly, line 25 defines the extensions we are going to delete, again using an array.

On the subject of arrays, notice that $@ in line 59, which represents the entire list of arguments passed to rmtex, is also an array. So you don’t have to worry about writing "$@" to account for arguments with spaces, like you would have to in Bash.

Lines 63 to 75 define a function my_rm which runs rm, but optionally prints the name of each file that it is deleting. It also allows a “dry run” mode.

On to the deleting. First I handle the dangerous case, which is when the -a option is given. This deletes all files of the given extensions, like my old alias. Notice the extremely useful zsh glob in line 82. The double star means to look in all subdirectories for a match. This is one of the most useful features of zsh and keeps me away from unnecessary use of find.

In lines 93 through 117, I treat the default case. The $files variable is set to an array of all the .tex files in a given folder, optionally using the double star to recurse through subdirectories. We will only delete auxiliary files that live in the same directory as a tex file of the same name. Notice lines 98 and 100, where the arrays are defined using globs.

In line 108, I delete each file using the substitution command ${f%%.tex} which removes the .tex extension from $f so I can replace it with the extension to be deleted. This syntax is also available in Bash.

My most common use of this is as rmtex -r to clean up a tree full of class notes, exams, and quizzes that I have been working on, so that I can find the PDF files more easily. If I’m feeling especially obsessive, I can always run rmtex -r ~, which takes a couple of minutes but leaves everything squeaky clean.

[Update: While zsh is the shell where I learned how to use arrays and advanced globs, that doesn’t mean that Bash doesn’t have the same capabilities. Turns out I should have done some Bash research.

Bash has arrays too! Arrays can be defined by globs, just as in zsh. The syntax is slightly different, but works just the same. Version 4 of Bash can even use ** for recursive globbing.

Thanks to John Purnell for the very gracious email. My horizons are expanded.]


Terminal Productivity

When it comes to working at the command line, there is really no limit to the amount of customization you can do. Sometimes it is hard to know where to start.

These are the three most helpful tools I use.

Autojump

Seth Brown introduced me to Autojump, and I am extremely grateful. It tracks which directories you use and lets you jump from one to another. All you have to do is type j and part of the directory name. So j arx will take me to ~/code/BibDesk/arxiv2bib, and then j nb will take me to ~/Sites/nb. It feels like magic.

If you have homebrew, it can be installed with brew install autojump.

View man pages in BBEdit

I use this function all the time. Man pages aren’t very useful if they are clogging up the terminal. You can easily adapt this for your text editor.

function bbm() {
cmd=$(tr [a-z] [A-Z] <<< "$1")
man $1 | col -b | /usr/local/bin/bbedit --view-top --clean -t "$cmd MANUAL"
}

The second line converts the name of the man page to upper case. This is used in the third line to set a title. The --clean option makes it so BBEdit doesn’t ask you if you want to save when you close the window.

Put the function definition into ~/.bash_profile.

IPython

I tried IPython several times before it stuck, but I can’t imagine going back to the standard Python interactive interpreter.

IPython has many, many features, but it is worth it just for the tab completion. Type the name of a variable, add a dot, and press tab to see all of the object’s methods and other attributes. Also, each history item contains the entire command, as opposed to just one line, making it possible to edit that for loop that you messed up.

The thing that kept me out of IPython for so long was that I didn’t like its default settings and didn’t want to figure out how to fix it.

In case this is stopping anyone from having a much better Python experience, here is a condensed version of my ipython_config.py file for my default profile.

c = get_config()
c.TerminalIPythonApp.display_banner = False     # Skip startup message
c.TerminalInteractiveShell.confirm_exit = False # Ctrl-D means quit!
c.TerminalInteractiveShell.autoindent = False   # I can indent my own lines
c.PromptManager.in_template = '>>> '  # The IPython prompt is
c.PromptManager.in2_template = '... ' # useful, but I prefer
c.PromptManager.out_template = ''     # the standard
c.PromptManager.justify = False       # prompt.

For more information about where this should go, run ipython profile list.

IPython can be installed with easy_install ipython.


Adventures in Python Packaging

After my post two weeks ago about managing my library of academic papers, someone mentioned that I should upload the arxiv2bib script that I wrote to the Python Package Index (PyPI).

I have been curious about Python packaging before, but had never really understood it. This script was the perfect candidate for me to experiment on: a single python file with no dependencies. So I dove in.

I’ll be honest, it wasn’t easy. In the end it worked, and I was even able to use my newfound knowledge to upload my more complicated Day One Export script, which has templates, multiple files, and dependencies. But I spent more time than I wanted to screwing things up. Worst of all, I don’t see any way I could have avoided all these mistakes. It really is that convoluted.

So here is my story. This is not a tutorial, but hopefully my journey will enlighten you. The links should be helpful too.

Distutils

Python packaging is centered around the setup.py script. Python comes with the distutils package, which makes creating the script really easy, assuming that you don’t want to do anything complicated. (Caveat: you often need to do something complicated.) Without needing any extra code, the distutils package empowers setup.py to build and install python modules, upload them to PyPI, even create a bare bones Windows graphical installer.

I followed this guide from Dive into Python 3 (everything applies to Python 2). All you have to do is fill in the arguments to the setup script. Then you run python setup.py sdist to create a tar.gz containing the source. With python setup.py register, it registers the current version of the project on PyPI (you will need an account first). Finally, python setup.py upload will upload the distribution to PyPI.

At this point, things were working, but not as well as I wanted. First of all, I wanted my script to work with either Python 2 or Python 3. This isn’t too difficult; I just copied some install code from Beautiful Soup.

I also wanted things to work on Windows, but this was much more difficult. You can designate certain files as “scripts”, and distutils will copy them into /usr/local/bin (or similar). On Windows, it copies to C:\Python27\Scripts, but Windows won’t recognize a Python script as executable unless it ends in .py. So I made the install script rename the file if it was installing on Windows.

Because setup.py is just a Python file, it can really do just about anything. (Another reason to use virtualenv, so that you don’t have to sudo someone else’s script.) But if you find yourself doing crazy things, take a deep breath, and just use setuptools.

Setuptools

Setuptools is not part of the Python standard library, but it is almost universally installed. This is the package that brings eggs and easy_install and is able to resolve dependencies on other packages. It extends the capabilities of distutils, and makes a lot of things possible with a lot less hacking.

There are plenty of setuptools guides. For my Day One Export script, I was most interested in declaring dependencies, which is done with the install_requires argument (as in this). For both scripts, I was also interested in the entry_points argument, which allows you to make executable scripts that run both in Windows (by creating an exe wrapper) and in Unix (the usual way).

If I were to do it again, I would skip distutils and just use setuptools.

One thing I did stress about was what to do for users that don’t have setuptools installed. Some packages use distutils as a fallback, while others don’t. In the end, I settled for printing a nice error message if setuptools is not installed.

Distribute?

Here is where things get really confusing. Distribute is a fork of setuptools. For the most part, you can pretend it doesn’t exist. It acts like setuptools (but with fewer bugs), so some people will technically use distribute to install the package instead of setuptools. But this doesn’t affect you.

Distribute also has Python 3 support, so all Python 3 users will be using it instead of setuptools. Again, this doesn’t affect you much, except that distribute offers some tools to automatically run 2to3 during installation.

Update: Now you have even less reason to care about distribute, because it was merged back into setuptools.

The future!

It is confusing enough to have three different packaging systems, but the Python maintainers aren’t happy enough with setuptools/distribute to bring them into the standard library. The goal is to replace all three with a better system. It is currently being developed as distutils2, but will be called packaging when it lands in the standard library. At one point, this was scheduled to happen with Python 3.3, but that has been pushed back to version 3.4.

Was it worth it?

Well, I’m glad I finally understand how these things work. And this is the sort of thing that you will never understand until you do it yourself. So in that sense, it was worth it.

The number of people who will use a script that can be installed, especially via pip or easy_install, is likely an order of magnitude more than then number of people who would use the script otherwise. So packaging is the nice thing to do.

Even for my own use, it is nice to have these scripts “installed” instead of “in ~/bin”. I can use pip to check the version number, quickly install a new version, or uninstall.