A GPU capture pipeline on AWS was hitting a hard 39fps ceiling. The bottleneck wasn't FFmpeg configuration — it was how x11grab physically moves frame data across the PCIe bus. The research doc below covers root causes, dead ends, the fix that shipped, and the NvFBC zero-copy path for hitting 60fps.
The research doc was fed into NotebookLM to generate an audio overview. The transcript and original notes are below.
NotebookLM Audio Overview — Breaking the 39fps Cloud Capture Limit
Transcript
Speaker 1: I want you to imagine just for a second that you're trying to record the perfect ultra-smooth high-definition video of a web browser.
Speaker 2: Just a standard browser, right?
Speaker 1: Yeah, exactly. But this browser isn't on your local machine. It's running entirely in the cloud on a server, you know, hundreds of miles away.
Speaker 2: Which sounds easy on paper.
Speaker 1: Right. On paper, it sounds incredibly simple. You just hit record. I mean, the hardware backing you up in a data center is basically infinitely powerful. So, what could possibly go wrong?
Speaker 2: A lot actually.
Speaker 1: So much. Because the moment you try to actually execute this, you just slam face first into these invisible walls — like screens that flat out refuse to be larger than a very specific mathematical size.
Speaker 2: Or frame rates that just arbitrarily max out.
Speaker 1: Exactly. Frame rates randomly hard capping at exactly 39 frames per second, or hardware components that absolutely refuse to talk to each other unless you manually override them.
Speaker 2: It's really the ultimate illusion of cloud computing. We expect this unlimited power and seamless operation because, well, that's how it's marketed to you.
Speaker 1: But underneath—
Speaker 2: Underneath that polished surface, it is a chaotic battleground. It's full of highly opinionated hardware and incredibly rigid software rules that just do not want to cooperate.
Speaker 1: Which brings us to today. Welcome to this deep dive. Our mission today is an intensely specific one. We are cracking open an incredibly detailed internal engineering document all about building a GPU capture pipeline.
Speaker 2: It's a fascinating read.
Speaker 1: It really is. We're going to look at the microscopic, frustrating, and honestly just wild battles that engineers have to fight to make cloud-based screen capture work flawlessly. And just to set the stage for you, our battlefield today is an AWS server, specifically a G4DN.XLarge instance.
Speaker 2: Right. And it's running an NVIDIA Tesla T4 GPU on an operating system called Amazon Linux 2023. This is about as under the hood as it gets. And we are going to tear it all apart.
Speaker 1: It really serves as a masterclass in systems engineering. I mean, when you are dealing with hardware in the cloud, it does not behave like a standard workstation sitting on your desk.
Speaker 2: Right. No monitor plugged in.
Speaker 1: Exactly. You don't have a monitor plugged into a display port. You don't have a physical keyboard. You're trying to trick a massive data center graphics card into believing it's a standard desktop setup.
Speaker 2: And it fights back.
Speaker 1: It fights you every single step of the way because its architecture is designed for heavy compute, not for driving a display.
Speaker 1: And the first thing the hardware throws at you is right at the display resolution. Like before we can even talk about getting a buttery smooth frame rate, we have to talk about simply getting the screen to be the correct size.
Speaker 2: The absolute basics.
Speaker 1: Yeah, the absolute basics. In the notes we're looking at, this is flagged as problem one — HiDPI failure.
Speaker 2: Which is a classic modern rendering problem. They wanted to capture this incredibly crisp, high-resolution screen. I think they were aiming for something equivalent to like a 14-inch MacBook Pro Retina display.
Speaker 1: But on startup, the entire system just threw an EINVAL error.
Speaker 2: An invalid argument.
Speaker 1: Right. And it completely crashed. And looking at the logs, it's entirely because of the graphics card. The NVIDIA Tesla T4 GPU utilizes a virtual display subsystem known as VGX.
Speaker 2: And VGX does not mess around.
Speaker 1: It really doesn't. This VGX system has a hard, unbreakable driver limit. It absolutely refuses to render a screen larger than 2560×1600 pixels.
Speaker 2: Which is exactly 4,096,000 pixels.
Speaker 1: Exactly. Not a single pixel more.
Speaker 2: You have to factor in how modern HiDPI screens actually work here. A modern MacBook screen often wants to render internally at twice its physical resolution just to look perfectly crisp, and then it scales that image down.
Speaker 1: So it's asking for a massive canvas.
Speaker 2: Right. So when the capture software asked the cloud GPU to create that massive virtual viewport to support the 2x scaling, the GPU's virtual display driver looked at the memory allocation request and essentially panicked. It just killed the process.
Speaker 1: Okay, let's unpack this. It's like you are a painter, right? And you're ordering a massive custom canvas for a masterpiece.
Speaker 2: Okay, I like this.
Speaker 1: But the factory that supplies your tools flat out refuses to build an easel wider than two and a half feet.
Speaker 2: It just won't do it.
Speaker 1: Right. It doesn't matter how big your studio is or how much paint you have, the easel simply won't expand. But I have to push back here — if I'm understanding this correctly, an EDID override should just fake the monitor's digital ID, right?
Speaker 2: You would think so.
Speaker 1: Usually the operating system blindly trusts the EDID. So how is the hardware completely ignoring a software override?
Speaker 2: What's fascinating here is how deep that limitation is actually buried. Because the engineers absolutely tried to force it. I mean they threw every standard Linux display trick at it.
Speaker 1: Like what?
Speaker 2: Well, they tried modifying the xorg.conf Virtual settings. They tried using NVIDIA's MetaModes. And yeah, they tried EDID overrides — injecting a totally fake monitor profile. Nothing.
Speaker 1: None of it worked.
Speaker 1: So it's like intercepting the request before the OS even gets a vote.
Speaker 2: Precisely. The VGX driver enforces this cap at the hardware abstraction layer — way before any of those higher-level X11 software layers even boot up. It is baked right into the foundational logic of that virtual display driver. It's likely there as a way to manage strict resource allocation across a shared data center environment. The software is shouting "make it bigger," but the hardware constraint is just absolute.
Speaker 1: So how do you bypass a wall you fundamentally cannot break?
Speaker 2: You build a very clever dynamic workaround. Since they literally can't make the virtual screen bigger than 2560×1600 without triggering that EINVAL crash, they implemented a dynamic cap on the capture scale variable.
Speaker 1: Okay. What does that do?
Speaker 2: They mathematically forced the system to evaluate the requested resolution and say: okay, what is the largest multiplier I can use without hitting that 4 million pixel ceiling?
Speaker 1: Oh, I see.
Speaker 2: Yeah. So for the MacBook Pro 14-inch target, that multiplier is forced all the way down to one.
Speaker 1: Okay. So they capture the frame at the smaller native size just to keep the system from crashing, but the user still expects a massive high-res video output at the end of this.
Speaker 2: Exactly. And that's where the post-processing comes in. They hand that smaller natively captured frame off to FFmpeg and have FFmpeg artificially upscale the video using a hardware-accelerated tool called scale_cuda.
Speaker 1: So it's a bit of a compromise.
Speaker 2: It is a tiny quality trade-off. You aren't getting a true native 2x resolution capture, but it totally bypasses the hard hardware wall and most importantly it keeps the capture pipeline alive.
Speaker 1: So we've essentially tricked the hardware into giving us a smaller canvas and we are artificially stretching it in post using CUDA. But the moment you start digitally magnifying a screen like that, you are asking the system to push a massive amount of visual data very quickly.
Speaker 2: And that immediately exposes the next weak link in the chain — the sheer speed at which the computer can move those pixels. The system stops crashing on startup, but it immediately hits a brutal performance ceiling. The engineers noted the system was suddenly stuck at a hard speed limit of about 39 frames per second.
Speaker 1: Which is a complete non-starter. I mean, if you were trying to deliver a buttery smooth 60 frames per second experience to an end user, 39 just isn't going to cut it.
Speaker 2: Not at all.
Speaker 1: The documentation shows the engineers looking at the pipeline and they see it's using a tool called x11grab to pull the video frames from the X server.
Speaker 2: Standard tool.
Speaker 1: But x11grab relies on a function called XShmGetImage. And here is the fatal flaw. That function requires the system CPU to physically read the frame from the X server's shared memory buffer.
Speaker 2: Which means one CPU readback per frame.
Speaker 1: Exactly.
Speaker 2: It is a massive architectural detour.
Speaker 1: It really is. It's like having a master chef — and in our case the chef is the GPU's dedicated hardware encoder, a chip called NVENC that is specifically designed to crunch video. So this chef is standing in a world-class kitchen. They are ready to cook 100 meals an hour. But the chef is completely starved for ingredients because the waiter — which is that CPU readback process — is walking back and forth to the fridge grabbing exactly one carrot at a time.
Speaker 2: One carrot at a time. That's a great way to put it. The chef isn't the bottleneck. The delivery method is. The NVENC encoder is sitting there at only like 18 to 19% utilization. But my instinct as someone who tweaks systems is to just optimize the software — can't we just tune FFmpeg, change the thread count, maybe give it real-time priority?
Speaker 1: If we connect this to the bigger picture of how a motherboard is actually laid out, you'll see why tweaking FFmpeg does absolutely nothing here.
Speaker 2: Really? Nothing?
Speaker 1: Nothing. This isn't a software configuration issue. It is a physical hardware bottleneck. The constraint is the actual delivery rate of the frames across the PCIe bus.
Speaker 2: Because the data has to physically travel from the graphics card across the motherboard to system memory, get processed by the CPU, and then travel all the way back across the PCIe bus to the GPU for encoding. You're basically paying the transit tax twice.
Speaker 1: Wow. That's incredibly inefficient.
Speaker 2: It is. The engineers note that this CPU readback method is actually acceptable for now — at 39fps the recordings still look pretty great for normal web browsing or typical low-motion workloads. But they make it very clear that finding a true solution to hit 60fps means entirely rethinking how the frames move through the system. You basically have to fire the waiter and build a conveyor belt directly to the chef.
Speaker 1: Build the conveyor belt. I love that. But to build that conveyor belt, the engineers realize they have to navigate a completely different nightmare — the treacherous environment of cloud server operating systems and Docker containers.
Speaker 2: Yeah, this is the pivot where it goes from being a pure hardware rendering problem to a massive DevOps labyrinth. Because to implement the ultimate fix for this 39fps limit, they need access to newer capture protocols. But their host operating system, Amazon Linux 2, is going end-of-life in June 2026.
Speaker 1: And because of its age, it's permanently stuck on older NVIDIA drivers — specifically version 550.
Speaker 2: So the logical step is to upgrade the whole server infrastructure to the newer Amazon Linux 2023, right? Which gives them access to the crucial newer driver, version 580.126.16. Upgrade the OS, update the driver. Problem solved.
Speaker 1: In a traditional desktop environment, yes. In a multi-tenant cloud environment, not even close.
Speaker 2: Far from it. Because they are running this entire browser pipeline inside a Docker container. And the cloud environment handles hardware acceleration inside containers in a highly restricted way.
Speaker 1: Very restricted.
Speaker 2: Because of a specific runtime setting — NVIDIA_VISIBLE_DEVICES=void — the standard easy way of injecting the graphics driver into the container is completely disabled.
Speaker 1: It intentionally blinds the container to the host's GPU.
Speaker 2: And you know, this makes sense from an architecture standpoint. In a cloud environment, isolation is your primary security measure. By default, you do not want a containerized application having raw unfiltered access to physical hardware like a GPU because that completely breaks the sandbox.
Speaker 1: So the container is totally blind to the graphics driver. It desperately needs to encode this video. But here's where it gets really interesting. If you've ever tried to install a custom graphics driver on your own Linux machine and ended up staring at a black screen, you know how unforgiving this is.
Speaker 2: Oh yeah, it's brutal.
Speaker 1: Now imagine doing that blind on a server 500 miles away. The engineers can't just download a standard prepackaged driver like version 580.126.09 and install it inside the container.
Speaker 2: No, because they would instantly trigger a kernel panic.
Speaker 1: Exactly. Or rather, the driver would refuse to load to prevent one. The NVIDIA driver operates at ring zero — the deepest privilege level of the operating system.
Speaker 2: It has ultimate power.
Speaker 1: Right. And because it has that level of access, it performs a ruthlessly strict kernel version check. If the driver version inside the container doesn't perfectly, meticulously match the subversion of the host server's kernel, it triggers an instant fatal error. It just says "no screens found."
Speaker 2: It's essentially acting like a bouncer at an exclusive nightclub. If your ID says you are 21 years and 4 months old, but the guest list says 21 years and 5 months — you are not getting in.
Speaker 1: No exceptions. There's no negotiation. Total rejection. And in a cloud context, a kernel mismatch doesn't just crash your little application. It risks destabilizing the entire virtual machine.
Speaker 2: So standard installation is completely off the table. They're essentially forced to go full Ocean's Eleven.
Speaker 1: That's exactly what it is. A heist.
Speaker 2: It is. To get this driver into the container without triggering the security alarms, they literally have to download the massive official NVIDIA Tesla .run file. They crack it open using an extract-only command to prevent it from trying to install itself and polluting the system. And then they surgically remove one specific tiny file — nvidia_drv.so — and manually smuggle it into a highly specific non-default folder inside the container.
Speaker 1: And that folder isn't even in the Linux system's default search path. The dynamic linker has no idea that file exists. They have to write a custom configuration file just to point Xorg to this smuggled .so file. But honestly, the most absurd yet brilliant part of this whole container heist is how they handle the display server's timing.
Speaker 2: Oh, the fake-out ready signal. Explain this — when I read this part of the document, it totally blew my mind.
Speaker 1: So in a standard Linux system, when Xorg starts up, it creates a Unix domain socket. Usually the moment that communication socket appears in the filesystem, it acts as a signal to all the other processes: "I'm initialized. My drivers are loaded. Send me graphics."
Speaker 2: Okay, sounds normal.
Speaker 1: But in this weird headless containerized cloud environment, Xorg creates that socket before screen initialization is actually complete.
Speaker 2: It lies.
Speaker 1: It completely lies. It creates the socket and then a second later, it might fail its internal driver checks and just crash — leaving the rest of the system trying to send video data to a dead program.
Speaker 2: So the engineers had to build in a paranoid delay. You cannot trust the presence of the socket. You have to force the system to wait exactly 3 seconds and then interrogate the system's process ID list to confirm the Xorg server is actually alive, stable, and breathing before you send it a single frame.
Speaker 1: It really emphasizes that cloud infrastructure isn't just magic. We hit deploy on AWS and assume it works seamlessly, but underneath it requires this meticulous exact version matching, file smuggling, and completely bypassing the default behaviors of the operating system just to get a video feed.
Speaker 2: It's a house of cards. But it's glued together with incredibly robust engineering.
Speaker 1: Okay, so let's look at where we are. They've mathematically bypassed the pixel cap. They've smuggled the driver file into the container. They've bypassed the fake-out socket. All of this careful hacking is setting the stage for what the document calls Option E — the holy grail of screen capture.
Speaker 2: The zero-copy dream.
Speaker 1: Yes. Option E utilizes something called NVIDIA Frame Buffer Capture — or NvFBC. They interface with it using a tool called gpu-screen-recorder. And this creates what the engineers call a zero-copy CUDA encode path. Walk us through the mechanics of what that actually means, because it sounds like sci-fi.
Speaker 2: Remember that architectural bottleneck we discussed earlier? The waiter carrying the carrot across the PCIe bus.
Speaker 1: Yeah. The CPU reading the frame from shared memory, dragging it across the motherboard and sending it back.
Speaker 2: Right. Zero-copy architecture entirely eliminates the waiter. When the frame is rendered by the web browser, it lives in the graphics card's local memory — the frame buffer. Instead of copying that massive multi-megabyte image payload over to the CPU, the system just generates a CUDA pointer. A pointer is basically a tiny little sticky note with a memory address that says: "Hey, the image payload is located right over here."
Speaker 1: Such a reference.
Speaker 2: Exactly. The capture software takes that tiny sticky note and hands it directly to the NVENC hardware encoder. The encoder reads the sticky note, reaches directly into its own local graphics memory, grabs the frame, and encodes it into video. The actual image data never touches the CPU.
Speaker 1: It never crosses the PCIe bus. Zero copies of the data are made. That is incredibly elegant. It's just handing over an address instead of physically moving the house to a new location.
Speaker 2: Yeah.
Speaker 1: So wait — does this magical NvFBC pipeline bypass that annoying 2560×1600 pixel cap we talked about at the very beginning? Can we finally ditch the dynamic scaling and get our massive native canvas?
Speaker 2: Ah, this raises an important question about where these protocols actually live in the stack. And the answer is a heartbreaking no.
Speaker 1: Seriously? It's still capped?
Speaker 2: It is still capped. NvFBC is bound directly to the X screen resolution. And remember that VGX virtual display driver caps the resolution way down in the basement long before NvFBC ever gets a chance to see the frame buffer.
Speaker 1: Right. Because the hardware says no first.
Speaker 2: It's a completely different layer of the hardware abstraction. So that 4 million pixel limit remains absolute.
Speaker 1: Okay. Slightly tragic that we are still relying on the scale_cuda upscale workaround. But what about the speed limit? Does this fix the 39fps ceiling?
Speaker 2: That is precisely where Option E shines. Because you remove the CPU readback bottleneck entirely, that frame rate limit completely shatters. You go from a struggling 39fps straight to a buttery smooth 60fps. And realistically, the hardware could push way beyond that if it needed to. The NVENC encoder finally gets the data fast enough to run at full speed.
Speaker 1: Okay, that is a huge win for the architecture. But reading the notes, there are always caveats with this stuff. What's the trade-off here?
Speaker 2: There are a couple of notable ones. First, you run into codec limitations based on the hardware age. The Tesla T4 GPU is a little older — it does not have hardware encoding support for AV1, which is the newest bandwidth-efficient video format.
Speaker 1: Okay.
Speaker 2: So by relying entirely on the hardware encoder, you are forced to use slightly older formats like H.264 or HEVC.
Speaker 1: Which honestly isn't a dealbreaker. Those are still massive industry standards and highly compatible.
Speaker 2: True. But the much weirder caveat is a cold-boot race condition that the engineers discovered.
Speaker 1: A race condition. How does that factor into capturing a screen?
Speaker 2: It happens right when the container first boots up. There is an internal flag in the display pipeline called bInModeset. Mod-setting is the process where the operating system negotiates with the graphics card to figure out the display resolution, refresh rate, and color depth. Standard startup stuff. But because this is a virtual cloud environment without a physical monitor attached, the graphics card takes a seemingly random amount of time to figure out what its monitor situation actually is.
Speaker 1: And if the capture software tries to grab a frame while the GPU is still trying to figure out its mode—
Speaker 2: It completely fails. The capture process crashes. So the software has to be programmed to sit there actively retrying for up to 30 seconds on startup, just waiting for the bInModeset flag to clear and the hardware to finally settle down.
Speaker 1: It's like waiting for a really old vacuum tube TV to warm up before you can actually watch a channel.
Speaker 2: Precisely. You have to build patience into the code. But once it warms up and that mod-setting is complete, you have an unstoppable, CPU-free 60fps capture pipeline that runs beautifully.
Speaker 1: So what does this all mean for you listening to this? Why did we just spend the last 15 minutes dissecting VGX display drivers, PCIe bus bottlenecks, and Linux container configuration?
Speaker 2: Good question.
Speaker 1: It's because of this. The next time you log onto your computer and use a piece of cloud-based software, or stream a high-end video game over the internet, or record a complex remote webinar — and it all runs buttery smooth on your screen — you are experiencing a modern miracle of hidden labor. It's never just plug-and-play.
Speaker 2: Exactly. That smoothness you take for granted is the direct result of unseen engineers fighting these microscopic, maddening battles. Battles against arbitrary hardware caps, against CPU memory bottlenecks, against strict kernel version checks that threaten to bring the whole thing crashing down. We treat the cloud like it's magic, but under the surface it is held together by incredibly brilliant duct tape and smuggled .so files.
Speaker 1: Absolutely. We always say the cloud is just someone else's computer, but we forget that someone else's computer has its own stubborn opinions and security protocols. But before we wrap up today's deep dive, I want to leave you with one final thought from the deepest, most buried part of this engineering document.
Speaker 2: Oh, I love a good footnote. Lay it on us.
Speaker 1: We spent this whole time talking about how that 2560×1600 pixel cap on the Tesla T4 is an absolute, unbreakable hardware wall. How it's enforced by the VGX driver and absolutely no software override can break it.
Speaker 2: Right. The problem that forced us to use dynamic scaling in the first place.
Speaker 1: Well, there is a tiny footnote in the source material — a reference to the official AWS EC2 documentation — and it mentions something called an NVIDIA GRID driver.
Speaker 2: Okay, what is that?
Speaker 1: Apparently, applying this specific enterprise GRID driver enables a completely different operational state for the GPU called Quadro Virtual Workstation mode.
Speaker 2: Wait — what does a virtual workstation mode do to the display pipeline?
Speaker 1: That is the mystery. The sources note that it is completely undocumented whether applying this specific enterprise licensing flag suddenly unlocks that virtual display cap on this specific G4DN server.
Speaker 2: No way. Are you serious? It might just be a licensing lock? It raises a fascinating question about the very nature of the hardware we use and how it's segmented. What if the key to unlocking true native 4K resolution in the cloud isn't about buying a better, physically more powerful graphics card? What if it isn't about writing smarter capture code or bypassing CPU bottlenecks with zero-copy architecture? What if the unbreakable hardware wall we've been fighting this entire time is just an artificial software lock — waiting for the right hidden licensing flag buried deep in a corporate manual to be flipped?
Speaker 1: So to bring it all back to that perfect recording we talked about at the very beginning — you hit record, the screen looks flawless, the frame rate is a perfect 60 — and you have to wonder: did you finally beat the hardware through brilliant engineering? Or did you just finally pay the right toll to unlock the features that were there all along? That is definitely something to think about the next time you stream a video from the cloud.
Research Notes
GPU Capture Pipeline — Research & Options
Last updated: 2026-03-17 Hardware: AWS g4dn.xlarge — NVIDIA Tesla T4 (Turing, datacenter) OS: Amazon Linux 2023 (AL2023) ECS GPU AMI — kernel 6.1.163, NVIDIA driver 580.126.16
Current State
Deployed pipeline (this branch)
Chrome (Vulkan/ANGLE, VK_KHR_xlib_surface) → headless Xorg (:99, NVIDIA DDX, container-internal) → x11grab → NVENC → H.264
Status: Working. Recordings look excellent.
The key quality improvement over the previous host-Xorg path: removing --disable-vulkan-surface allows Chrome to use VK_KHR_xlib_surface (Vulkan WSI) — frames flow from Chrome's Vulkan compositor directly into the NVIDIA driver's X11 surface handler with no application-level GPU→CPU readback on Chrome's side. x11grab's XShmGetImage still involves a CPU copy, but Chrome's compositing is now fully GPU-resident.
Previous pipeline (pre-migration, host Xorg)
Chrome (GLX/ANGLE) → host Xorg (:0, NVIDIA-backed, shared from ECS host) → x11grab → NVENC → H.264
Known Issues
Problem 1 — HiDPI failure (EINVAL on startup) — RESOLVED
captureScale = max(renderScale across outputs)caused x11grab to requestviewport × captureScalepixels.- T4 VGX hard cap: 2560×1600 maximum screen size.
viewport × 2exceeds this for any viewport taller than 800px. - Fix (deployed):
captureScalein GPU mode is now capped tofloor(min(2560/viewportW, 1600/viewportH)). For mbp-14in (1512×982) this givescaptureScale=1. Outputs atrenderScale>captureScaleare upscaled by FFmpegscale_cuda— no crash, minor quality trade-off vs native 2× capture. - Root cause of cap: T4 VGX virtual display subsystem enforces 4,096,000px (2560×1600) as a hard driver limit regardless of xorg.conf
Virtual,MetaModes, or EDID.123 Confirmed viaxrandr:Screen 0: maximum 2560 x 1600.
Problem 2 — x11grab throughput ceiling (~39fps)
- x11grab's
XShmGetImagepulls frames from the X server's shared memory buffer — one CPU readback per frame. - Measured ceiling: ~39fps under real workload. NVENC sits at 18–19% GPU utilisation — it is starved, not the bottleneck.
- Not fixable by tuning FFmpeg. The constraint is frame delivery rate into x11grab's poll.
- Acceptable for now — recordings look great at typical workload fps. Addressable via NvFBC (see Proposed Next Step below).
Confirmed Facts
ECS AMI / Driver Versions
- AL2 ECS GPU AMI (
amzn2-ami-ecs-gpu-hvm-*-x86_64-ebs): ships NVIDIA driver 550.163.01. This is the ceiling for AL2 — it will not receive newer drivers. - AL2023 ECS GPU AMI (
al2023-ami-ecs-gpu-hvm-*-kernel-6.1-x86_64-ebs): ships NVIDIA driver 580.126.16 (AMI release20260307, published 2026-03-10).4 - AL2 ECS GPU AMI EOL: June 30, 2026.5
- NvFBC SDK 9.0.0 requires driver ≥ 570.86.16. AL2023 at 580.126.16 meets this requirement.6
ECS Container Runtime on AL2023 — NVIDIA_VISIBLE_DEVICES=void
- On AL2023 ECS, the ECS agent sets
NVIDIA_VISIBLE_DEVICES=void(CDI-style direct device passthrough). - With
NVIDIA_VISIBLE_DEVICES=void, the NVIDIA Container Toolkit disables all capability injection —NVIDIA_DRIVER_CAPABILITIESenv var is completely ignored regardless of its value. - GPU compute devices (
/dev/nvidia0,/dev/nvidiactl) ARE mounted directly by ECS even withvoid—nvidia-smiand NVENC both work. /dev/nvidia-modesetis NOT automatically mounted. Must be explicitly added via ECS task definitionlinuxParameters.devices. Done — deployed infeature-gpu-compute.tf./dev/dri/renderD128availability under CDI mode: unverified. Not required for current pipeline or NvFBC.
Headless Xorg with NVIDIA DDX inside Container (Ubuntu 22.04 base image)
- Ubuntu's
xserver-xorg-video-nvidia-580-serverpackage shipsnvidia_drv.soversion 580.126.09. The AL2023 ECS GPU AMI host kernel module is 580.126.16. The NVIDIA DDX performs a strict kernel module version check on load — any minor version mismatch causes(EE) NVIDIA: Failed to initialize the NVIDIA kernel module→no screens found. Do not use the Ubuntu package. - Correct approach: extract
nvidia_drv.sodirectly from NVIDIA's official Tesla runfile at the exact matching version (NVIDIA-Linux-x86_64-580.126.16.run), using--extract-only. Onlynvidia_drv.sois needed — all other driver components are injected by the host runtime. Done — deployed indocker/gpu/Dockerfile. nvidia_drv.somust be placed at/usr/lib/x86_64-linux-gnu/nvidia/xorg/. This path is not in Xorg's default module search path — must be declared viaModulePathin xorg.confFilessection.nvidia-xconfigis NOT available in the container. BusID detection usesnvidia-smi --query-gpu=pci.bus_idat runtime. DBDF formatDDDDDDDD:BB:DD.F→ XorgPCI:bus:device:function(decimal). Confirmed BusID on g4dn:PCI:0:30:0(0x1e = 30).- When Xorg is started with
-config /path/to/xorg.conf, xorg.conf.d snippet directories are still read. The-configflag only overrides the primary config file, not the snippets. - Xorg creates the Unix domain socket (
/tmp/.X11-unix/X99) before completing screen initialization. Socket presence is not a reliable ready signal — Xorg can exit withno screens foundafter the socket is created. Liveness is confirmed via PID check after a 3s settle period. - T4 VGX display cap: 2560×1600 hard limit.
Virtual,MetaModes, and EDID overrides do not affect it — the cap is enforced by the VGX virtual display driver before any of those layers.123 Confirmed viaxrandr:Screen 0: minimum 8 x 8, current 2560 x 1600, maximum 2560 x 1600.DVI-D-0 connected primary 2560x1600+0+0— the NVIDIA DDX creates one virtual connected output even withUseDisplayDevice None. - xorg.conf is generated at runtime (BusID varies per EC2 instance).
MetaModesandVirtualset to2560x1600to match the hard cap.
NvFBC — Implementation Requirements (Researched, Not Yet Implemented)
libnvidia-fbc.so.1is part of the NVIDIA driver, not CUDA toolkit. WithNVIDIA_VISIBLE_DEVICES=void, it is NOT injected. Must be extracted from the Tesla runfile during Docker build — same approach asnvidia_drv.so. File:libnvidia-fbc.so.580.126.16inside the extracted runfile tree.NvFBC.his the only compile-time header needed. MIT-licensed, vendored by Sunshine atthird-party/nvfbc/NvFBC.h.67libnvidia-fbc.so.1is dlopen()'d at runtime — not a link-time dependency.NvFBCCreateInstanceis the sole exported symbol; it populates all function pointers viaNVFBC_API_FUNCTION_LIST.8- T4 does not require the consumer GPU whitelist patch (keylase).
bIsCapturePossiblereturns true natively on Tesla/Quadro/GRID hardware. The patch is GeForce-only.9 - NvFBC does NOT bypass the T4 VGX 2560×1600 cap. NvFBC is bound to a display head — capture resolution equals the X screen resolution, which the VGX driver caps before Xorg or NvFBC ever sees it.810 HiDPI is not improved by switching to NvFBC.
- Zero-copy CUDA encode path:
nvFBCToCudaGrabFrame→CUdeviceptr→NvEncRegisterResource(NV_ENC_INPUT_RESOURCE_TYPE_CUDADEVICEPTR)→NvEncEncodePicture. Frame never touches CPU. UseNVFBC_BUFFER_FORMAT_NV12to skip color conversion before NVENC.8 - Device nodes required:
/dev/nvidia0,/dev/nvidiactl,/dev/nvidia-modesetonly./dev/dri/card0and/dev/dri/renderD128are NOT needed. All three are already present in the container. bInModesetflag inNVFBC_GET_STATUS_PARAMSdetects cold-boot race — poll with 1s retry up to ~30s before attemptingnvFBCCreateCaptureSession. Modeset recovery during capture: check forNVFBC_ERR_MUST_RECREATEon every grab; destroy and recreate session without destroying handle.8
gpu-screen-recorder — Integration Requirements (Researched, Not Yet Implemented)
- Repo:
https://git.dec05eba.com/gpu-screen-recorder, v5.12.5. GPL-3.0.7 - Auto-selects NvFBC on NVIDIA + X11 — no explicit flag needed.
- XRandR gate: calls
XRRGetScreenResources()before NvFBC init; exits code 51 if zero outputs. Confirmed not a problem —DVI-D-0 connectedis present withUseDisplayDevice None. Capture target:-w DVI-D-0. - Pipe mode:
-o /dev/stdout -c mkv— FFmpeg reads the MKV stream and muxes/remuxes as needed. Do not use-c mp4for piped output (requires seekable file for moov atom). - Codec:
-k h264or-k hevc. Do not use-k av1— T4 (Turing) has no AV1 NVENC hardware. - Quality:
-bm qp -q very_high(QP=25, closest to CRF). No explicit preset flag; use-tune quality. - No window manager or compositor required. Creates a 16×16 hidden X11 window for GL context only.
- Build from source required on Ubuntu 22.04 — no apt package in official repos. Meson build. Many deps (
libavcodec-dev,libx11-dev,libxrandr-dev,libglvnd-dev,libpipewire-0.3-dev, etc.). libnvidia-fbc.so.1andlibnvidia-encode.so.1must be present in the container at runtime — loaded via dlopen, not linked. Must be baked into the image from the Tesla runfile (same asnvidia_drv.so).
Proposed Next Step (Optional)
Option E — NvFBC via gpu-screen-recorder
Status: Researched, ready to implement. Not yet done — recordings are working well with current x11grab pipeline. Solves: x11grab throughput ceiling (~39fps → 60fps+) Does NOT solve: HiDPI — VGX cap (2560×1600) applies equally to NvFBC When to pursue: if 60fps becomes a hard requirement or throughput ceiling manifests as a quality issue
Proposed pipeline:
Chrome (Vulkan/ANGLE, VK_KHR_xlib_surface)
→ headless Xorg (:99, NVIDIA DDX)
→ gpu-screen-recorder (NvFBC capture + NVENC encode, piped)
→ FFmpeg (mux only, -c copy)
→ output MP4
Command sketch:
gpu-screen-recorder -w DVI-D-0 -f 60 -k h264 -bm qp -q very_high -c mkv -o /dev/stdout \
| ffmpeg -i pipe:0 -c copy output.mp4Implementation work required:
docker/gpu/Dockerfile— build gpu-screen-recorder from source; extractlibnvidia-fbc.so.1andlibnvidia-encode.so.1from Tesla runfile alongsidenvidia_drv.sodocker/ffmpeg-nvenc/Dockerfile— may need FFmpeg rebuild withlibavcodecfor MKV demux if not already presentsrc/WebRecorder.ts— spawn gpu-screen-recorder child process instead of x11grab in FFmpeg command; pipe stdout into FFmpeg mux process- Validate cold-boot modeset timing —
bInModesetwindow in containerised Xorg may delay first frame; gpu-screen-recorder handles this internally via retry but timing needs observation
Other Options Evaluated
Option A — captureScale cap (minimal fix) — IMPLEMENTED
Capture at captureScale=floor(min(2560/W, 1600/H)) for GPU mode. For mbp-14in (1512×982): captureScale=1. FFmpeg scale_cuda upscales to requested renderScale dimensions. No crash; minor quality trade-off vs native HiDPI capture. Deployed in WebRecorder.ts.
Option B — EGL + Xvfb (designed, not implemented)
Chrome renders via EGL (/dev/dri/renderD128), Xvfb at 3840×2160. Fixes HiDPI (no VGX display cap on Xvfb). Worsens throughput ceiling — EGL→Xvfb blit is a CPU copy. Not pursued.
Option C — FFmpeg -f nvfbc — Dead end
Does not exist. No nvfbc.c in libavdevice/. Confirmed against FFmpeg source tree.
Option D — kmsgrab (-f kmsgrab) — Dead end
Three independent blockers: nvidia-drm.modeset=1 almost certainly N on AWS ECS GPU AMIs; NVIDIA proprietary DRM PRIME export broken (EGL_BAD_ATTRIBUTE 0x3004).11 CAP_SYS_ADMIN required. Do not pursue.
Summary
| Option A | Option B | Option C | Option D | Option E | |
|---|---|---|---|---|---|
| Description | captureScale cap | EGL + Xvfb | FFmpeg -f nvfbc | kmsgrab | gpu-screen-recorder (NvFBC) |
| Fixes HiDPI | ✅ (upscaled) | ✅ (native) | — | — | ❌ (VGX cap) |
| Fixes 39fps ceiling | ❌ | ❌ (worse) | — | ❌ broken | ✅ |
| Stays in FFmpeg | ✅ | ✅ | — | ✅ | ❌ (mux only) |
| Status | ✅ Implemented | Designed | Dead end | Dead end | Ready to implement |
Open Questions
Driver version on ECS AMI— RESOLVED. AL2023 ships 580.126.16. Meets NvFBC ≥ 570.86.16.gpu-screen-recorder vs custom Capture SDK app— RESOLVED: gpu-screen-recorder.DVI-D-0 connectedconfirmed via xrandr; XRandR gate is not a problem. No custom C needed.- Cold-boot / container-start NvFBC modeset timing —
bInModesetwindow in containerised Xorg startup not yet measured. gpu-screen-recorder retries internally but cold-start latency should be observed. T4 VGX cap with headless Xorg— RESOLVED (hard limit). 2560×1600 enforced by VGX driver. Confirmed viaxrandr maximum 2560 x 1600. No override possible via xorg.conf. NvFBC is equally bound.60fps requirement— DEFERRED. Recordings look excellent at current fps. Option E available when/if 60fps becomes a priority.— RESOLVED. (a)no screens foundroot causenvidia_drv.soat exact version 580.126.16 from Tesla runfile. (b)/dev/nvidia-modesetmounted vialinuxParameters.devices. Both deployed./dev/dri/renderD128under AL2023 CDI mode — not required for current pipeline or NvFBC. Relevant only if Option B (EGL) is pursued.XRandR outputs with— RESOLVED.UseDisplayDevice NoneDVI-D-0 connected primary 2560x1600+0+0. NVIDIA DDX creates a virtual connected output regardless.
References
Footnotes
-
NVIDIA Developer Forums — "Display resolution limited to 2560x1600" (moderator generix confirms Tesla VGX virtual head limit is driver-enforced, not EDID) — https://forums.developer.nvidia.com/t/display-resolution-limited-to-2560x1600/256723 ↩ ↩2
-
NVIDIA Developer Forums — "Large headless Xorg configuration" (M60 headless Tesla requires GRID/Quadro-VDws license to exceed 2560×1600;
--virtualdoes not override) — https://forums.developer.nvidia.com/t/large-headless-xorg-configuration/55002 ↩ ↩2 -
NVIDIA Developer Forums — "Newer drivers limit resolution to 2560x1600" (post-435.21 drivers enforce hard pixel cap of 4,096,000px = 2560×1600 exactly) — https://forums.developer.nvidia.com/t/newer-drivers-limit-resolution-to-2560x1600/157657 ↩ ↩2
-
aws/amazon-ecs-ami — release
20260307changelog: "Update nvidia driver version al2023 to 580.126.16" — https://github.com/aws/amazon-ecs-ami/releases/tag/20260307 ↩ -
AWS ECS — AL2 to AL2023 ECS AMI transition guide — https://docs.aws.amazon.com/AmazonECS/latest/developerguide/al2-to-al2023-ami-transition.html ↩
-
NVIDIA Capture SDK (v9.0.0) — developer download page and license — https://developer.nvidia.com/capture-sdk ↩ ↩2
-
LizardByte/Sunshine — vendored
NvFBC.h(MIT license, API version 1.7) — https://github.com/LizardByte/Sunshine/blob/master/third-party/nvfbc/NvFBC.h ↩ ↩2 -
NVIDIA Capture SDK Programming Guide v7.1 — API architecture,
NVFBC_CREATE_PARAMSoutput fields,NvFBCCreateInstancedlopen pattern, CUDA zero-copy path,bInModesetmodeset recovery — https://developer.download.nvidia.com/designworks/capture-sdk/docs/7.1/NVIDIA_Capture_SDK_Programming_Guide.pdf ↩ ↩2 ↩3 ↩4 -
NVIDIA Developer Forums — "Does NvFBC support T4 GPUs" (NVIDIA staff confirmation; Tesla/Quadro/GRID do not require consumer GPU whitelist patch) — https://forums.developer.nvidia.com/t/does-nvfbc-support-t4-gpus/79599 ↩
-
gpu-screen-recorder source —
src/capture/nvfbc.c: capture dimensions derived fromXWidthOfScreen(DefaultScreenOfDisplay(...))/XHeightOfScreen(...)— https://git.dec05eba.com/gpu-screen-recorder/tree/src/capture/nvfbc.c ↩ -
LizardByte/Sunshine GitHub issues — DRM PRIME export broken with NVIDIA proprietary driver:
EGL_BAD_ATTRIBUTE (0x3004), GBM incompatibility confirmed across issues #188, #2250, #4106 — https://github.com/LizardByte/Sunshine/issues/188 ↩