Thursday 18 July 2013

SSH Issue on Solaris 10 Branded Zones

So suddenly one night I get a call from the night operators telling me that they can't ssh to a particular zone. Their ssh sessions are even dying. When I log in (via the global zone), the message log has loads of sshd core dumps!:

Jul 11 20:41:08 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[2791] core dumped: /var/core/core_hostname_ssh_14247_103_1373568067_2791
Jul 11 20:41:15 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[2813] core dumped: /var/core/core_hostname_ssh_14247_103_1373568074_2813
Jul 11 20:41:26 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[2889] core dumped: /var/core/core_hostname_ssh_14247_103_1373568085_2889
Jul 11 20:44:37 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[6711] core dumped: /var/core/core_hostname_ssh_14247_103_1373568276_6711
Jul 11 20:47:54 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[11121] core dumped: /var/core/core_hostname_ssh_14247_103_1373568473_11121
Jul 11 20:59:34 hostname genunix: [ID 603404 kern.notice] NOTICE: core_log: ssh[25061] core dumped: /var/core/core_hostname_ssh_14247_103_1373569173_25061

I try a couple of things, none of which seem to particular help but the problem goes away after about half an hour. Then it comes back a couple of days later. And then again. And then it happens on some other containers.

Of course by this time, my call logged with Oracle has been escalated to the highest level. They come back with this:
It seems at this point that you have bin hit by known issue.
Bug 15781192 - SUNBT7156478-SOLARIS_11U1 double free in kernelSlottable.c kernel_slottable_ini
This was fixed in the S11u1 release .. but now we have started a backport CR for S10. At this point the only workaround is to disable pkcs11 engine in the sshd_conf and restart ssh.
And then gave the complete workaround:

The complete workaround requires three steps to be executed inside the Solaris 10 branded zone: 1) Uninstall the pkcs11 kernel provider:  # cryptoadm uninstall provider='/usr/lib/security/$ISA/pkcs11_kernel.so' 2) Disable the pkcs11 engine for sshd  # vi /etc/ssh/sshd_config add the line "UseOpenSSLEngine no" to this file (without the quotes) 3) Restart the ssh service to pickup the change:  # svcadm restart ssh
EDIT: I updated to the latest patches. I'll have to take some time to reverse these workarounds and see if the problem has been fixed. I've been it told it has but I'll have to confirm for myself.