1

I'm trying to provision Ubuntu 16.04 to a soekris net6501 via foreman. The process itself actually now works quite well.

One thing that doesn't quite do what we want is the step after provisioning. The idea is to be able to deploy a new image/OS to the box at any stage, and have it try to PXE boot by default, and move on to local disk if PXE doesn't happen.

So, when foreman is set to build the host PXE booting works fine (after a bit of tweaking and experimentation), but on the first reboot after the provisioning is done the machine just hangs.

PXE-M0F: Exiting Intel Boot Agent.

If I change the BIOS to prefer the local disk all is well. But that's not what I want; some machines will be in a data centre, and waddling there with a laptop to play with the serial console is undesirable.

[edit 1 below]

Snippet from the foreman log:

14:45:38 foreman dhcpd: DHCPDISCOVER from 00:00:24:d2:05:bc via eth1
14:45:38 foreman dhcpd: DHCPOFFER on 192.168.0.4 to 00:00:24:d2:05:bc via eth1
14:45:42 foreman dhcpd: Dynamic and static leases present for 192.168.0.4.
14:45:42 foreman dhcpd: Remove host declaration testkris or remove 192.168.0.4
14:45:42 foreman dhcpd: from the dynamic address pool for 192.168.0.0/23
14:45:42 foreman dhcpd: DHCPREQUEST for 192.168.0.4 (192.168.0.1) from 00:00:24:d2:05:bc via eth1
14:45:42 foreman dhcpd: DHCPACK on 192.168.0.4 to 00:00:24:d2:05:bc via eth1
14:45:43 foreman in.tftpd[15186]: tftp: client does not accept options

And the PXE.cfg for the machine:

SERIAL 0 19200 0
CONSOLE 0
DEFAULT menu
PROMPT 0
MENU TITLE PXE Menu
TIMEOUT 200
TOTALTIMEOUT 6000
ONTIMEOUT local

LABEL local
     MENU LABEL (local)
     MENU DEFAULT
     LOCALBOOT 0

[edit 2 below] (output garbled, tried to clean up escape sequences as best as I could):

 Intel(R) Boot Agent GE v1.3.72
 Copyright (C) 1997-2010, Intel Corporation

 Initializing and establishing link...                                          
                                     CLIENT MAC ADDR: 00 00 24 D2 05     BC  
 DHCP..                                                                         
      CLIENT IP: 192.168.0.4  MASK: 255.255.254.0  DHCP IP: 192.168.0.1
 GATEWAY IP: 192.168.0.1 

TFTP.                                                                           
    TFTP.                                                                       
        !PXE entry point found (we hope) at 95D2:0106 via plan AUNDI code segment at 95D2 len 5210UNDI data segment at 8F97 len 63 Getting cached packet  01 02 03My IP address seems to be C0A80004 192.168.0.4ip=192.168.0.4:192.168.0.1:192.168.0.1:255.255.254.0BOOTIF=01-00-00-24-d2-05-bcTFTP prefix: Trying to load: pxelinux.cfg/01-00-00-24-d2-05-bc                   ok
 PXELINUX 4.05 20140113  Copyright (C) 1994-2011 H. Peter Anvin et al
 @lqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqkx^O
PXE Menu                         
          tqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqu
          x (local)                                                  x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          x                                                          x
          mqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqj

                          Press [Tab] to edit options

                         Automatic boot in 1 second...




PXE-M0F: Exiting Intel Boot Agent.

And then it just sits there.

tink
  • 1,036
  • 11
  • 19
  • There should be more information before the `PXE-M0F: Exiting Intel Boot Agent.` text, such as something that indicates no response from the DHCP server. On the Foreman Server, look in the logfiles for any DHCP or TFTP traffic from your Soekris node. What should happen is that when the image is finished building, it registers itself with the Foreman host as a built host, and Foreman removes the node from the lists of hosts to be built-- this last step includes modifying the MAC-address specific file at `/var/lib/tftpboot/pxelinux.cfg/00-01-02-AA-BB-CC` to simply boot from the 'local' disk. – Stefan Lasiewski Jun 06 '16 at 21:27
  • Thanks @StefanLasiewski - added some info, will kick off the build again and check what's missing from the PXE error when done ... Cheers – tink Jun 06 '16 at 21:51
  • OK, this is annoying. There were no changes to the environment over the long weekend, but now I don't get the PXE-M0F message any more. The client acquires its DHCP lease on reboot, see the new happenings in the 2nd edit .. – tink Jun 07 '16 at 01:56
  • And that, I believe, is how it's supposed to work. – Stefan Lasiewski Jun 07 '16 at 20:44
  • But the VMs happily boot of the local HDD w/ the same (almost identical) setup. The only difference is that I added the top two lines to the soekris' file so I could see what its doing. – tink Jun 07 '16 at 20:58
  • So @StefanLasiewski - you feel that the VMs are doing the wrong thing by booting of the local disk? Can you explain why this would be more appropriate? – tink Jun 08 '16 at 01:18
  • The boot order on your host is configured to boot from the Network first, and if that boot fails, move on to the local disk (which is to the next device in the boot order). In your case, the Network is PXEbooting, and Foreman (as a safety mechanism) says "I know you've already been provisioned, because here you are in my registry. But you're still doing a PXEboot, so here's a PXELinux image that simply says "boot from local disk". Than your system boots from local disk, and does whatever the local disk image is telling it to do. – Stefan Lasiewski Jun 08 '16 at 18:34
  • Sorry, I didn't see your comment that "And then it just sits there." This tells me that the OS on the local disk is corrupt or not working like you expect it to. I'm not familiar with Soekris or if it prefers to communicate over Serial console. But I've run into similar issues on a CentOS box-- the OS would print some output to the Serial Port, while printing other output to the Monitor. My experience with CentOS is that if I attached a monitor to the system, I could see that the OS failed to load, or that it was waiting for me to type my password at an fsck prompt. – Stefan Lasiewski Jun 08 '16 at 18:40
  • I found it helpful to review http://networkboot.org/fundamentals/ to understand exactly what was happening after seeing the PXELinux screen. – Stefan Lasiewski Jun 08 '16 at 18:41
  • Well that's the catch; when I stop the boot, hop in the BIOS, and change the boot order from eth0 to local disk first the OS boots just fine; which makes me think that BIOS on the box, in particular the bit that acts upon the PXE provided instructions, is at fault. – tink Jun 08 '16 at 18:52
  • 1
    The soekris net6501 doesn't have any video device (or output) other than the serial console, btw. – tink Jun 08 '16 at 18:52

1 Answers1

0

By trial and error I've learned that using the LOCALBOOT 0 command doesn't work on all hardware.

You need to have different disk boot commands for different hardware. Read more about hardware compatibility problems on syslinux wiki: http://www.syslinux.org/wiki/index.php?title=Hardware_Compatibility#LOCALBOOT

Here are the 3 different "hard disk" boot commands we use, and there is no one command works on all hardware.

KERNEL chain.c32
APPEND hd0

LOCALBOOT 0

LOCALBOOT -1

Also another thing I've experienced that not all syslinux versions work equally well. So try different releases and see which best fit your HW.

Raboo
  • 93
  • 4
  • Thanks for your response to this by now truly dated question. :) The question has become moot with the fact that within a few months of each other 10 of our twenty odd soekris' wouldn't reboot anymore ... they're all dead, and we're replacing them with larger and unfortunately more expensive supermicros; at least those also have an LOM that will allow for a remote full reinstall. The method of keeping several syslinux hanging about to use with foreman wouldn't be feasible, though. – tink Oct 24 '17 at 16:30
  • But the answer could serve others. – Raboo Oct 27 '17 at 09:43