Saturday, May 21, 2011
Tuesday, May 17, 2011
I just discovered (after way too much time poking about) that Unix sort does not split fields the way I naively assumed. I had thought that it split on whitespace, the way that awk does, but it does not. It splits on the zero-length "character" between a non-space and space character -- and then the space characters become part of the next field.
The nasty consequence of this is that whitespace-tabulated data that have a varying number of spaces (or tabs) between fields -- as opposed to fields separated by a single space, or a single tab, or single comma if you use '-t,' -- will not sort the way you might think.
$ STRING="fibble ab de\ngorkle bc cd\n" $ printf "$STRING" fibble ab de gorkle bc cd $ printf "$STRING" | sort -k 2,2 gorkle bc cd fibble ab de
— i.e. the two spaces in front of 'bc' make ' bc' sort ahead of ' ac'.
$ printf "$STRING" | sort -k 3,3 fibble ab de gorkle bc cd
— where the two spaces in front of 'de' make ' de' sort ahead of ' cd'.
This explains a great deal of bizarre behavior I've dealt with over the years, stuff I never had the time to drill down and deal with.
My usual fix for this sort of situation is to collapse whitespace into a single space character, sort, and then use my ~/bin/tabulate script on the end:
$ printf "$STRING" | perl -pe 's/[ \t]+/ /g' | sort -k 3,3 | ~/bin/tabulate gorkle bc cd fibble ab de
I hope someone might find this useful. In other words, I hope I'm not the only one who took this long to understand this. :-)
From 'info sort' on Ubuntu:
`-t SEPARATOR' `--field-separator=SEPARATOR' Use character SEPARATOR as the field separator when finding the sort keys in each line. By default, fields are separated by the empty string between a non-blank character and a blank character. By default a blank is a space or a tab, but the `LC_CTYPE' locale can change this. That is, given the input line ` foo bar', `sort' breaks it into fields ` foo' and ` bar'. The field separator is not considered to be part of either the field preceding or the field following, so with `sort -t " "' the same input line has three fields: an empty field, `foo', and `bar'. However, fields that extend to the end of the line, as `-k 2', or fields consisting of a range, as `-k 2,3', retain the field separators present between the endpoints of the range. To specify ASCII NUL as the field separator, use the two-character string `\0', e.g., `sort -t '\0''.