The first packet (and all others until negotiation is completed) is always discarded.
This is true of every ISAKMP implementation that I've dealt with. I don't believe that there's necessarily any reason that it couldn't buffer the packets that are being discarded; rather, it shouldn't.
This is an extension of a conscious design decision that's used throughout the internet's routing infrastructure: Don't hold packets.
Routing systems on the internet will always discard a packet instead of delaying it, when they aren't able to (nearly) immediately route it. Packet loss on the internet as a whole could easily be reduced to far lower levels by simply keeping a packet buffered until there's room for it. But, therein lies the problem; an overloaded router running 200ms behind on a first-in, first-out queue would delay every single packet by that 200ms.
Bringing this back to the ISAKMP situation; holding a couple of pings until the path is ready to carry them is great, but what if it's a constant stream of hundreds of thousands of UDP packets? And what if the remote system is inaccessible, so the ISAKMP sits there waiting for an ISAKMP negotiation message 2 for 60 seconds?
While these are not insurmountable engineering problems, the conventional wisdom in the internet engineering community is that it's far simpler and easier to have client systems deal with packet loss issues themselves, primarily through the use of loss-tolerant protocols like TCP.