I just discovered (after way too much time poking about) that Unix sort does not split fields the way I naively assumed. I had thought that it split on whitespace, the way that awk does, but it does not. It splits on the zero-length "character" between a non-space and space character -- and then the space characters become part of the next field.
The nasty consequence of this is that whitespace-tabulated data that have a varying number of spaces (or tabs) between fields -- as opposed to fields separated by a single space, or a single tab, or single comma if you use '-t,' -- will not sort the way you might think.
For example:
$ STRING="fibble ab de\ngorkle bc cd\n" $ printf "$STRING" fibble ab de gorkle bc cd $ printf "$STRING" | sort -k 2,2 gorkle bc cd fibble ab de
— i.e. the two spaces in front of 'bc' make ' bc' sort ahead of ' ac'.
And
$ printf "$STRING" | sort -k 3,3 fibble ab de gorkle bc cd
— where the two spaces in front of 'de' make ' de' sort ahead of ' cd'.
This explains a great deal of bizarre behavior I've dealt with over the years, stuff I never had the time to drill down and deal with.
My usual fix for this sort of situation is to collapse whitespace into a single space character, sort, and then use my ~/bin/tabulate script on the end:
$ printf "$STRING" | perl -pe 's/[ \t]+/ /g' | sort -k 3,3 | ~/bin/tabulate gorkle bc cd fibble ab de
I hope someone might find this useful. In other words, I hope I'm not the only one who took this long to understand this. :-)
From 'info sort' on Ubuntu:
`-t SEPARATOR' `--field-separator=SEPARATOR' Use character SEPARATOR as the field separator when finding the sort keys in each line. By default, fields are separated by the empty string between a non-blank character and a blank character. By default a blank is a space or a tab, but the `LC_CTYPE' locale can change this. That is, given the input line ` foo bar', `sort' breaks it into fields ` foo' and ` bar'. The field separator is not considered to be part of either the field preceding or the field following, so with `sort -t " "' the same input line has three fields: an empty field, `foo', and `bar'. However, fields that extend to the end of the line, as `-k 2', or fields consisting of a range, as `-k 2,3', retain the field separators present between the endpoints of the range. To specify ASCII NUL as the field separator, use the two-character string `\0', e.g., `sort -t '\0''.
6 comments:
On Ubuntu 9.10, I get different results:
STRING="fibble ab de\ngorkle bc cd\n"
printf "$STRING" | sort -k 2,2
fibble ab de
gorkle bc cd
This might be Blogger messing with the spacing, but it looks like you only have one space between each word in your STRING. If so, then that explains why you're getting different results than I am.
There are two spaces. I copy/pasted the commands from your post into my terminal.
The spaces disappeared when you pasted back into the comment, then. :-(
My guess is that you have a LANG environment variable or one or more LC_* environment variables that change how sort(1) works. If you remove those, does it still behave the same?
(I always remove LANG and LC_* from my environment because I'm an old fart who learned Unix before all these new-fangled features showed up. Also, because without removing them consistently, working in a heterogeneous environment where some OSes set them and some don't leads to serious insanity. :-))
Yep, you are right! LANG was set to en_US.UTF-8 (no LC_* was set). Unsetting it does produce results like yours! And setting LANG back to what it was produces results like mine.
Thanks for posting this! And sorry about not getting back on this earlier.
Glad to know the mystery is solved. :-)
Post a Comment