Tuesday, May 17, 2011

Unix sort Insanity (Or is it Just Me?)

I just discovered (after way too much time poking about) that Unix sort does not split fields the way I naively assumed. I had thought that it split on whitespace, the way that awk does, but it does not. It splits on the zero-length "character" between a non-space and space character -- and then the space characters become part of the next field.

The nasty consequence of this is that whitespace-tabulated data that have a varying number of spaces (or tabs) between fields -- as opposed to fields separated by a single space, or a single tab, or single comma if you use '-t,' -- will not sort the way you might think.

For example:

$ STRING="fibble ab  de\ngorkle  bc cd\n"
$ printf "$STRING"
fibble ab  de
gorkle  bc cd

$ printf "$STRING" | sort -k 2,2
gorkle  bc cd
fibble ab  de

— i.e. the two spaces in front of 'bc' make ' bc' sort ahead of ' ac'.


$ printf "$STRING" | sort -k 3,3
fibble ab  de
gorkle  bc cd

— where the two spaces in front of 'de' make ' de' sort ahead of ' cd'.


This explains a great deal of bizarre behavior I've dealt with over the years, stuff I never had the time to drill down and deal with.

My usual fix for this sort of situation is to collapse whitespace into a single space character, sort, and then use my ~/bin/tabulate script on the end:

$ printf "$STRING" | 
  perl -pe 's/[ \t]+/ /g' | 
    sort -k 3,3 | 
gorkle bc cd
fibble ab de

I hope someone might find this useful. In other words, I hope I'm not the only one who took this long to understand this. :-)


From 'info sort' on Ubuntu:

     Use character SEPARATOR as the field separator 
     when finding the sort keys in each line.  By 
     default, fields are separated by the empty 
     string between a non-blank character and a 
     blank character.  By default a blank is a space 
     or a tab, but the `LC_CTYPE' locale can change 

     That is, given the input line ` foo bar', `sort'
     breaks it into fields ` foo' and ` bar'.  The 
     field separator is not considered to be part of 
     either the field preceding or the field following,
     so with `sort -t " "' the same input line has 
     three fields: an empty field, `foo', and `bar'.  
     However, fields that extend to the end of the 
     line, as `-k 2', or fields consisting of a range, 
     as `-k 2,3', retain the field separators present 
     between the endpoints of the range.

     To specify ASCII NUL as the field separator, use 
     the two-character string `\0', e.g., `sort -t '\0''.


Anonymous said...

On Ubuntu 9.10, I get different results:

STRING="fibble ab de\ngorkle bc cd\n"
printf "$STRING" | sort -k 2,2
fibble ab de
gorkle bc cd

rantingnerd said...

This might be Blogger messing with the spacing, but it looks like you only have one space between each word in your STRING. If so, then that explains why you're getting different results than I am.

Anonymous said...

There are two spaces. I copy/pasted the commands from your post into my terminal.

rantingnerd said...

The spaces disappeared when you pasted back into the comment, then. :-(

My guess is that you have a LANG environment variable or one or more LC_* environment variables that change how sort(1) works. If you remove those, does it still behave the same?

(I always remove LANG and LC_* from my environment because I'm an old fart who learned Unix before all these new-fangled features showed up. Also, because without removing them consistently, working in a heterogeneous environment where some OSes set them and some don't leads to serious insanity. :-))

Anonymous said...

Yep, you are right! LANG was set to en_US.UTF-8 (no LC_* was set). Unsetting it does produce results like yours! And setting LANG back to what it was produces results like mine.

Thanks for posting this! And sorry about not getting back on this earlier.

rantingnerd said...

Glad to know the mystery is solved. :-)