Brief answer to the question: The network interface hardware/firmware and the OS do all of that.
More precisely, there are a couple of drivers involved. The device usually covers OSI layers 1 (physical) and 2 (data link) itself, i.e. translation between physical signals (electric, radio) and bits and grouping them to frames is done by the hardware/firmware. The device driver controls the according functions. Layers 3 (network) and 4 (transport) are covered by the OS, although privileged userland processes may get access to the data.
How does the OS get a hold of the frame?
Once the network interface has a frame available it triggers an interrupt, so the kernel can read it off the according I/O memory region.
The frame header data tells what to do with the frame. If the kernel (more precisely: the network driver) knows the type of the payload (e.g. IPv4 or IPv6) then the packet (after stripping off layer 2 headers and trailers) is passed up to the according network layer driver.
It will handle all peculiarities of the network layer protocol, such as fragmentation and reassembly, checksum calculation and verification, etc. plus all network layer specific tasks such as routing and packet filtering. Depending on the payload (TCP, UDP, ICMP, ICMPv6, GRE, SCTP, DCCP, etc.) the packet is passed further up to the according transport layer driver. It will handle all the transport layer specific stuff; for connection oriented protocols like TCP it means for instance keeping track of the connection state, maintaining the queues and doing the bookkeeping; another thing is congestion control.
Once a packet has passed that layer it is passed further up (again after stripping the transport layer header), but this time to the userland process. The userland process sees virtually nothing of the lower level processing.
Now, it is possible to access and processes raw data such as frames from the userland, you can do that with software like libpcap. However, doing all of that in the kernel is more efficient.