Thursday, February 11, 2010

Solaris Zones + pkgserv Hose Zone Boots (Nerdity Level: High)

So SunOracle Solaris has Zones, which are a light-weight virtual server platform that we use a lot at work.

We're finally rebuilding our "chassis" servers -- the ones that host all of our virtual Solaris servers (mostly web and mail apps). Our normal zone creation script kept crapping out )at the point of first booting the new zone) with

zoneadm: zone 'ZONE': These file-systems are mounted on subdirectories of /fs/zones-1/ZONE/root:
zoneadm: zone 'ZONE':   /fs/zones-1/ZONE/root/var/sadm/install/.door
zoneadm: zone 'ZONE': call to zoneadmd failed

(I've redacted the hostname to protect my company. :-))

The first boot of the zone happens — or rather, fails to happen — after zoneadm -z ZONE install and some file copies into the new zone directory tree.

Doing Internet searches for these strings doesn't really help much.

After many hours of digging, I discovered that I could run zoneadm -z ZONE boot and get the zone to boot — but only if I waited quite a while after doing the zoneadm -z ZONE install — at least ten minutes.

I was able to see that a pkgserv process was running:

root 23107 1   0 14:55:27 ?  0:02 pkgserv -d /fs/zones-1/ZONE/root/var/sadm/install -N pkgadd

— and that took between 3 and 4 minutes to quit out.

Now, pkgserv seems to be part of an effort to speed up building and patching zones, which can take a long time. The particular setup we're seeing appears to have shown up with the 2010-01-08 Recommended Patch Cluster (though it might have shown up earlier -- we stepped from the 2009-05-08 cluster to the 2010-01-08 cluster, so if it showed up earlier we wouldn't have seen it).

Once pkgserv finally quit, if I immediately tried zoneadm -z ZONE boot or zoneadm -z ZONE ready, it would give me the same "call to zoneadmd failed". truss showed me that a call to zone_create() was failing with EBUSY, and that was propagating up the stack. The thing that's bizarre is that it never seemed to clear. (If I left it alone, it would eventually clear [as I saw empirically] but I never actually managed to pin down how long the error would take to clear — the loop time was way too long.) I think that running zoneadm -z ZONE ready actually prolongs the error.

I finally gave up and tried

umount -f /fs/zones-1/ZONE/root/var/sadm/install/.door

(plain umount didn't work), and that magically cleared the problem. Both umount and umount -f threw errors, too:

umount: warning: /fs/zones-1/ZONE/root/var/sadm/install/.door not in mnttab
umount: /fs/zones-1/ZONE/root/var/sadm/install/.door not mounted

(The door file was never showing in /etc/mnttab or in mount output. I could never find a clean way to find the mount.

I think it's clear that the pkgserv setup is a bit buggy and needs to be fixed.

7 comments:

automaciej said...
This comment has been removed by the author.
automaciej said...

I posted this link to #opencsw on Freenode, and the feedback I got was that it might be a patching problem: patches don't get propagated from the global zone to the non-global zone.

rantingnerd said...

Yes, that seems quite possible. I hope the next patch cluster fixes it!

Anonymous said...

I'm the author of "pkgserv"; in order to create zones with the pkgserv installed, you MUST also install the additional patches in the global zone, specifically the "Live Upgrade" and the "Live Upgrade Zones Support" patches.

Without those patches, pkgserv will indeed wait 5 minutes before it exits.

Derek said...

I know it's an old post but I'm coming across the same issue nowbuilding out our zones and I'm trying to also track down what the issue is.

The frustrating thing is that it isn't consistent. Out of 11 hosts I just created hosts on, all built exactly the same I get this error on 1 box.

One thing I did notice was that it doesn't show up in the /etc/mntttab, same as you noticed but I'm pretty convinced that it's because it's a hidden file. If you look into the source code for this it accesses /etc/mnttab using the ioctl command with the MNTIOC_SHOWHIDDEN option which I think is how it is finding the mount.

So far I've come to the same conclusion, wait long enough and you can boot the zone.

meierch said...

after the zone install, you can quit the pkgserv proccess with the pkgadm command.
e.g.:
pkgadm sync -R /zones/ZONE/root/ -q

bloodz said...

Today I achieved the same problem, solved with command pkgadm sync on the globalzone.

daje! :)

bloodz