Saturday, May 21, 2011

I Think I Figured It Out....

I realized why Republicans don't like the Health Care Individual Mandate: because it includes "man date", and that's just too gay.

Tuesday, May 17, 2011

Unix sort Insanity (Or is it Just Me?)

I just discovered (after way too much time poking about) that Unix sort does not split fields the way I naively assumed. I had thought that it split on whitespace, the way that awk does, but it does not. It splits on the zero-length "character" between a non-space and space character -- and then the space characters become part of the next field.

The nasty consequence of this is that whitespace-tabulated data that have a varying number of spaces (or tabs) between fields -- as opposed to fields separated by a single space, or a single tab, or single comma if you use '-t,' -- will not sort the way you might think.

For example:

$ STRING="fibble ab  de\ngorkle  bc cd\n"
$ printf "$STRING"
fibble ab  de
gorkle  bc cd

$ printf "$STRING" | sort -k 2,2
gorkle  bc cd
fibble ab  de

— i.e. the two spaces in front of 'bc' make ' bc' sort ahead of ' ac'.

And

$ printf "$STRING" | sort -k 3,3
fibble ab  de
gorkle  bc cd

— where the two spaces in front of 'de' make ' de' sort ahead of ' cd'.

 

This explains a great deal of bizarre behavior I've dealt with over the years, stuff I never had the time to drill down and deal with.

My usual fix for this sort of situation is to collapse whitespace into a single space character, sort, and then use my ~/bin/tabulate script on the end:

$ printf "$STRING" | 
  perl -pe 's/[ \t]+/ /g' | 
    sort -k 3,3 | 
      ~/bin/tabulate
gorkle bc cd
fibble ab de

I hope someone might find this useful. In other words, I hope I'm not the only one who took this long to understand this. :-)

 


From 'info sort' on Ubuntu:

`-t SEPARATOR'
`--field-separator=SEPARATOR'
     Use character SEPARATOR as the field separator 
     when finding the sort keys in each line.  By 
     default, fields are separated by the empty 
     string between a non-blank character and a 
     blank character.  By default a blank is a space 
     or a tab, but the `LC_CTYPE' locale can change 
     this.

     That is, given the input line ` foo bar', `sort'
     breaks it into fields ` foo' and ` bar'.  The 
     field separator is not considered to be part of 
     either the field preceding or the field following,
     so with `sort -t " "' the same input line has 
     three fields: an empty field, `foo', and `bar'.  
     However, fields that extend to the end of the 
     line, as `-k 2', or fields consisting of a range, 
     as `-k 2,3', retain the field separators present 
     between the endpoints of the range.

     To specify ASCII NUL as the field separator, use 
     the two-character string `\0', e.g., `sort -t '\0''.