How does screenshotting work?



I'm currently writing an exam paper about in-game photography and in doing so I wish to (briefly) point out the differences between a camera and screenshotting - but I don't really know enough about the technical side about the latter to do so competently. I was hoping that some of you might know?

In other words, how does screenshotting work? Does it "freeze" pixels? Is it the graphics card or its driver that does the work? Some other hardware or software component?

Where in photography, the light is being captured by a sensor, the computer instead emmits light through the screen - but I'm thinking that the screenshot itself happens way before the data has been turned into pixels?

Erica Wyrdling

Posted 2019-12-30T17:05:53.517

Reputation: 149

Question was closed 2020-01-01T10:12:21.983



but I'm thinking that the screenshot itself happens way before the data has been turned into pixels?

It happens before the data has been turned into physical pixels (if there are any), but it happens after the data has been turned into pixel values, i.e. a bitmap image.

For example, if a program is displaying text or vector graphics or 3D visuals, the screenshot process doesn't care about that at all, it only cares about the resulting image after those graphics have been rendered into a bitmap.

However, the screenshot is taken directly from the OS memory, or at worst, read back from the GPU memory – it is not captured from the actual VGA or HDMI signals.

In other words, how does screenshotting work? Does it "freeze" pixels? Is it the graphics card or its driver that does the work? Some other hardware or software component?

Depends on the OS that you're asking about. Generally, the core graphics system (the same one which lets apps put windows on screen, such as GDI on Windows or X11 on Linux) will keep an in-memory copy of all pixels on screen (i.e. the framebuffer), so that they could be sent to the GPU again whenever needed. So it simply provides functions for programs to retrieve that copy.

For example, on Windows there are the GetDC() and GetWindowDC() functions. On Linux, the X11 system has somewhat similar methods such as XGetImage(). These just give the program a bitmap image that's already held somewhere in the system RAM, without any special hardware involvement.

(Although in some cases, e.g. with GNOME on Linux, the window manager actually uses the GPU to compose the screen's contents – so in order to make a screenshot it actually has to request the data back to the CPU first.)

As a side note, there can be some differences between what's in the framebuffer and what's actually being displayed. For example, many video games will produce very dark screenshots because they use the GPU's gamma correction feature to adjust the image brightness, and this correction is only applied as a last step when producing the video signal – so screenshots will only capture the uncorrected, dark-looking image. (Unless the game actually overrides the whole OS screenshot function with its own.)


Posted 2019-12-30T17:05:53.517

Reputation: 283 655

1So is what you're saying as simple as "the operating system (potentially) has an api which provides functionality to get whatever is currently displaying on screen" and it's just a matter of working out what OS you're on, what apis are available, and then calling the corresponding function? – will – 2019-12-31T18:58:10.320

1@will Pretty much, though it's specifically the graphics system, which might be part of the OS (as in Windows), a managing program running on top of the OS (Windows 3, X, Wayland), or a single program handling all graphics (such as games both modern and ancient). – chrylis -on strike- – 2020-01-01T03:34:26.010

On Linux the OS itself has a graphics API or two (DRI, fbdev) which Xorg itself uses, and of course in an X11 environment you'd normally use the X11 APIs provided by Xorg, but technically you could also grab an image using the privileged DRI or fbdev APIs... On Windows, I actually have no idea where the boundary between OS kernel and graphics system is at all. – user1686 – 2020-01-01T11:00:37.853

@chrylis-onstrike- Thanks. I figured there would be modules/layers/whatever they're called that handle graphics, but didn't know how to describe that aspect of the system as a whole. And i would have guessed that there would be apis at multiple levels that would call the lower level apis of other systems (as (i believe?) open-gl does with the apis of the graphics cards). – will – 2020-01-03T18:41:36.293


One way of looking at the difference is to consider the results of the two.

A screenshot is the equivalent of the computer taking a full screen image in digital form and saving it as a file. In this manner, the digital information is as precise as it can be, based on the monitor and display adapter capability. If you have a 4K capable card and display, your screen capture will be 4K at perfect detail.

A camera snapshot of a screen, on the other hand, is a digital to analog to digital conversion. The first digital is the aforementioned information coming from the display adapter. The analog portion is the transmission of light from the display to your eyes and/or camera, while the final digital is the conversion of that light to digital via the camera digital sensor.

There is going to be a substantial difference in the quality of the image provided by the camera compared to the screen capture. The camera adds even more reduction of quality by passing the "signal" in the form of light through lenses with aberrations and losses.


Posted 2019-12-30T17:05:53.517

Reputation: 1 377

The other difference between a screenshot app and a camera is that some screenshot apps allow the user to select whether or not to record specific elements that appear on the screen such as window border, mouse cursor, pop-up notifications, etc. – karel – 2019-12-30T17:36:27.250


A camera reads data from a light sensor and stores that data in RAM or other storage. In the case of a video camera as opposed to a still one, it's doing this continuously. The "raw" data from the sensor may not be compatible with the format needed by a display device, such as a PC graphics card or the LCD on a camera, so if the device with a camera needs to display what the camera is seeing, a conversion from the camera format to display device format is needed.

A screenshot is an export of data that already exists in RAM being used by a video card or eventually destined for a display device. Typically this data is in the format a PC graphics card or other display device expects. When it's captured, it has to be converted from this format to a well-known image format.

So the main differences are one of data flow:

Camera -> RAW data -> capture (copy) to storage or RAM -> display device binary format -> display device video RAM -> display device (if what camera is seeing should be directly displayed)

Camera -> RAW data -> capture (copy) to temp storage or RAM -> convert from there to JPEG, etc. (if what camera is seeing should be saved to file)

Display device -> display device video RAM -> display device binary format -> capture (copy) to other system RAM -> convert from there to BMP, JPEG, etc. (saving what display device is using to generate picture to file)


Posted 2019-12-30T17:05:53.517

Reputation: 63 487


Preface: This answer isn't meant to fully answer the question (the existing answers do that pretty well), but it's just some conceptual background that's too long for a comment.

A big portion of software engineering just boils down to designing good abstractions, system boundaries, and breaking down big problems into smaller simple modules that compose together to form the total solution. This is a perfect example of that in action.

Operating systems have two broad components in play here: some kind of GUI renderer, and some kind of output mechanism that interfaces with it. While implementation details may differ, conceptually it's really simple. A video screen is just one kind output device, the most common one probably, but not the only one.

A remote desktop client is another. For example, Windows' remote desktop feature lets you log into a session on a computer, even while someone is physically using the computer for another session. Your session's graphics are streamed to your machine over the network, while the other user's session's graphics are displayed on the monitor as normal.

Saving to a file (producing a screenshot) is just another kind of output device.

The beauty here is that there doesn't need to be any separate system for rendering GUIs for screenshots as there are for rending GUIs for the screen. The same rendering can be used, but then it's interfaced with different output systems (Hardware screen/RDP/Screenshot/Screen recorder).

Ideally, the interface for systems like this should be as generic as possible, so that's is simple, and so that any implementation can come and plug itself in, without much complexity.

However, there are times when complicating the interface might pay off, because it lets you do more niche things. For example:

  • Windows' RDP doesn't just consume the video output of the screen and stream it as if it were a Twitch live-stream. That uses too much bandwidth, sends too much redundant data, and has higher latency. Instead, RDP transmits over drawing commands (e.g. write text "Hello World!" at px 200, 200, in 12 pt Helvetica), which the client uses to reproduce the GUI. Thus there must be special mechanism in place to intercept GUI drawing calls before they're sent to the graphics card for rendering like-usual (to a hardware screen).

    • This is in contrast with VNC, which does just stream the video output. This has the performance downsides I mentioned earlier, but it has a key benefit: because the interface is simpler/more-generic, more implementations can conform to it. VNC isn't tied to the particular GUI drawing commands of one OS or another, so it's much more broadly implemented, by more operating systems than just Windows.
  • macOS' screenshot capturing feature allows you to screenshot a window, even if its occluded by another window, or has transparency that shows whats underneath. The resulting screenshot won't be occluded (you can see it in its entirety), and won't show what's underneath the transparency. This tells us that there's some component of their GUI rendering system which allows the screenshot system to intercept the rendered output of a single window, before it's composited with the others to form the full screen's final frame.

Alexander - Reinstate Monica

Posted 2019-12-30T17:05:53.517

Reputation: 283

An interesting implication is that you can make screenshots without showing the image. In X11, what you see is a copy of the screen buffer. It's basically copied to the graphics card to show it. You can just not do that. And copy single frames to a screenshot. You could run a web browser in a screen buffer of arbitrary size (independent of your display), and make a screenshot of that. – Volker Siegel – 2020-01-01T02:38:43.017

Windows RDP did always have the capability to stream the raw bitmap output, although it was rarely needed -- but you could still see a window being loaded in 128x128px tiles. But nowadays more and more programs kind of require it (due to using internal renderers and not GDI), especially the "UWP" apps, so latest RDP versions embrace it -- they added JPEG and even H.264 support into the protocol. On a slower connection, you can frequently see a window first appearing with JPEG artifacts before it's smoothed out. On Lunix, Xpra does the same for X11 too. – user1686 – 2020-01-01T11:05:16.227

As for screenshotting occluded windows: that's usually called "compositing" and is a normal feature on many windowing systems these days. It's present on many Linux desktop environments (GNOME, KWin, Compiz) and Windows gained a full compositing window manager (DWM) in Vista. – user1686 – 2020-01-01T11:07:04.143

@user1686 "But nowadays more and more programs kind of require it (due to using internal renderers and not GDI), especially the "UWP" apps, so latest RDP " that's a real shame. I've always been jealous of RDP's performance exactly because it wasn't jsut a video stream (coming from Mac OS) – Alexander - Reinstate Monica – 2020-01-01T17:30:05.360

I don't know about that, I've still found RDP quite fast these days, even if connecting to my home desktop with its miserable upload bandwidth. Certainly a better experience than with protocols which try to pretend nothing has changed (like ssh -X). It's not just UWP apps, though – you'll see the same with many web browsers (both due to their custom UI toolkits and due to HTML rendering engines making it a necessity), you'll see the same with GTK apps ported from Linux, and so on. And of course, in the Linux world, nearly all apps use client-side text rendering since 2000s. – user1686 – 2020-01-02T05:53:42.090


One thing I didn't see mentioned yet is that "screenshots" aren't always snapshots of the current frame, or captures of a "screen" at all.

You see, modern resolutions require huge amounts of pixel data to be transferred from the graphics processor (GPU) to the monitor many times a second. Both software and hardware have evolved to not transfer repeated information, so in particular the pixels rendered by the GPU are only sent to the monitor, not the CPU, unless requested.

One consequence of this is that for a screenshot, pixel data often has to be "reconstructed", and at the very least it has to be sent back from the GPU to the CPU, both of which can take considerable time from the moment you press the PrtScrn button.

Still, newer GPU's can often reconstruct and send back data from a recent frame to the CPU even under heavy load, but a consequence of this is that the screenshot may be slightly outdated. You'll notice this delay even more when you try to stream/record, it can be over a second on some hardware.

Once again, the reasons for this is an overflow of information; The millons of pixels on the GPU first has to be reconstructed/converted/compressed/whatever before it can be transferred to the CPU at a reasonable speed and in a format the CPU can understand.

Remember that both the CPU and GPU have to communicate and spend time waiting for each-other while doing this, and have to do other stuff in the meantime as well.

We're long gone from the age of sending pixel data directly to the monitor, or even having to worry about sending pixel data at all (software today is sending textures/models/triangles instead, which can convey the same picture with much less information). We are used to multiple/movable/overlappable/transparent windows, but there are actually many complicated systems that allow for this to happen, each of which might have their own way to obtain a "screenshot" with various level of detail. I know of at least 4 ways to obtain screenshots on my Linux machine, each of which has it's own benefits and drawbacks. And importantly, NONE of these methods actually guarantee they capture exactly what was displayed on the screen.

Some systems can screenshot normal windows without delay, but not games or when under high load, some systems may request screenshots from the GPU every frame, so you can get the perfect screenshot you want, even when what you were capturing was occluded, some can only "screenshot" windows one at a time, while others don't support real-time screenshots at all.

No system is the same, but the one thing every screenshotting mechanism has in common is that it has to worry about receiving 1920x1080 (or more!) pixels and converting them into an image file without locking up the entire computer. For that, compromises have to be made which camera's don't have to deal with.


Posted 2019-12-30T17:05:53.517

Reputation: 101