Encapsulating Security Protocol (ESP) is IP protocol 50. It has a protocol number in its own right, just as ICMP, TCP and UDP do, and is arguably the right protocol to use for encrypted tunnels.
However, although TCP and UDP have both ip addresses and port numbers associated with both source and destination, ICMP and ESP don't. It's the combination of ports and addresses that make NAT and tunneling practical; without them, traffic is very difficult to handle.
The problem is that, when a tunneling (or NAT) device has two or more input UDP streams to pass on to a single endpoint, and the responses come back from that endpoint to the tunneling device, the source port numbers can be used to disambiguate the two streams. With ESP, there is no port number to serve as disambiguator, so it's hard for the tunneling endpoint to know which of several ESP sources that ESP response should be tunneled back to.
IPSec, which by default also uses ESP, some time ago codified the NAT-traversal extensions, which use UDP/4500 instead. I don't know that L2TP has such a mode, and without it, I fear you won't be able to do what you want to do.
I hope I'm wrong about that, and that someone else will come along and post a better answer. But in the absence of that, I thought I should at least try to explain what ESP is, and why it is a tunneling headache.