Saturday, May 21, 2011
I Think I Figured It Out....
Tuesday, May 17, 2011
Unix sort Insanity (Or is it Just Me?)
I just discovered (after way too much time poking about) that Unix sort does not split fields the way I naively assumed. I had thought that it split on whitespace, the way that awk does, but it does not. It splits on the zero-length "character" between a non-space and space character -- and then the space characters become part of the next field.
The nasty consequence of this is that whitespace-tabulated data that have a varying number of spaces (or tabs) between fields -- as opposed to fields separated by a single space, or a single tab, or single comma if you use '-t,' -- will not sort the way you might think.
For example:
$ STRING="fibble ab de\ngorkle bc cd\n" $ printf "$STRING" fibble ab de gorkle bc cd $ printf "$STRING" | sort -k 2,2 gorkle bc cd fibble ab de
— i.e. the two spaces in front of 'bc' make ' bc' sort ahead of ' ac'.
And
$ printf "$STRING" | sort -k 3,3 fibble ab de gorkle bc cd
— where the two spaces in front of 'de' make ' de' sort ahead of ' cd'.
This explains a great deal of bizarre behavior I've dealt with over the years, stuff I never had the time to drill down and deal with.
My usual fix for this sort of situation is to collapse whitespace into a single space character, sort, and then use my ~/bin/tabulate script on the end:
$ printf "$STRING" |
perl -pe 's/[ \t]+/ /g' |
sort -k 3,3 |
~/bin/tabulate
gorkle bc cd
fibble ab de
I hope someone might find this useful. In other words, I hope I'm not the only one who took this long to understand this. :-)
From 'info sort' on Ubuntu:
`-t SEPARATOR'
`--field-separator=SEPARATOR'
Use character SEPARATOR as the field separator
when finding the sort keys in each line. By
default, fields are separated by the empty
string between a non-blank character and a
blank character. By default a blank is a space
or a tab, but the `LC_CTYPE' locale can change
this.
That is, given the input line ` foo bar', `sort'
breaks it into fields ` foo' and ` bar'. The
field separator is not considered to be part of
either the field preceding or the field following,
so with `sort -t " "' the same input line has
three fields: an empty field, `foo', and `bar'.
However, fields that extend to the end of the
line, as `-k 2', or fields consisting of a range,
as `-k 2,3', retain the field separators present
between the endpoints of the range.
To specify ASCII NUL as the field separator, use
the two-character string `\0', e.g., `sort -t '\0''.