Add overclocking option -oc to workaround a NVIDIA driver bug (forcefully set to p2 state when using cuda)

2023-03-17 17:17:14 +01:00
parent 8cbdb596eb
commit 93cb6593a6
20 changed files with 441 additions and 63 deletions
--- a/88-gsr-coolbits.conf
+++ b/88-gsr-coolbits.conf
@@ -0,0 +1,5 @@
+Section "Device"
+    Identifier     "Device0"
+    Driver         "nvidia"
+    Option         "Coolbits" "12"
+EndSection
--- a/README.md
+++ b/README.md
@@ -6,8 +6,8 @@ This screen recorder can be used for recording your desktop offline, for live st
 where only the last few seconds are saved.

 ## Note
-This software works only on x11.\
-If you are using a variable refresh rate monitor, then choose to record "screen-direct". This will allow variable refresh rate to work when recording fullscreen applications. Note that some applications such as mpv will not work in fullscreen mode. A fix is being developed for this.\
+This software works only on x11 and with an nvidia gpu.\
+If you are using a variable refresh rate monitor then choose to record "screen-direct". This will allow variable refresh rate to work when recording fullscreen applications. Note that some applications such as mpv will not work in fullscreen mode. A fix is being developed for this.\
 For screen capture to work with PRIME (laptops with a nvidia gpu), you must set the primary GPU to use your dedicated nvidia graphics card. You can do this by selecting "NVIDIA (Performance Mode) in nvidia settings:\
 ![](https://dec05eba.com/images/nvidia-settings-prime.png)\
 and then rebooting your laptop.
@@ -16,12 +16,18 @@ screen-direct capture has been temporary disabled as it causes issues with stutt

 # Performance
 When recording Legend of Zelda Breath of the Wild at 4k, fps drops from 30 to 7 when using OBS Studio + nvenc, however when using this screen recorder the fps remains at 30.\
-When recording GTA V at 4k on highest settings, fps drops from 60 to 23 when using obs-nvfbc + nvenc, however when using this screen recorder the fps only drops to 55. The quality is also much better when using gpu-screen-recorder.\
+When recording GTA V at 4k on highest settings, fps drops from 60 to 23 when using obs-nvfbc + nvenc, however when using this screen recorder the fps only drops to 58. The quality is also much better when using gpu-screen-recorder.\
 It is recommended to save the video to a SSD because of the large file size, which a slow HDD might not be fast enough to handle.
+## Note about optimal performance on NVIDIA
+NVIDIA driver has a "feature" (read: bug) where it will downclock memory transfer rate when a program uses cuda, such as GPU Screen Recorder. To work around this bug, GPU Screen Recorder can overclock your GPU memory transfer rate to it's normal optimal level.\
+To enable overclocking for optimal performance use the `-oc` option when running GPU Screen Recorder. You also need to have "Coolbits" NVIDIA X setting set to "12". This setting is automatically installed if you run the install script (`install.sh` or `install_ubuntu.sh`), however if you use flatpak then you need to manually run `install_coolbits.sh` and then reboot your computer.\
+Note that this only works when Xorg server is running as root, and using this option will only give you a performance boost if the game you are recording is bottlenecked by your GPU.\
+Obs! use at your own risk!

 # Installation
 If you are running an Arch Linux based distro, then you can find gpu screen recorder on aur under the name gpu-screen-recorder-git (`yay -S gpu-screen-recorder-git`).\
 If you are running an Ubuntu based distro then run `install_ubuntu.sh` as root: `sudo ./install_ubuntu.sh`. You also need to install the `libnvidia-compute` version that fits your nvidia driver to install libcuda.so to run gpu-screen-recorder and `libnvidia-fbc.so.1` when using nvfbc. But it's recommended that you use the flatpak version of gpu-screen-recorder if you use an older version of ubuntu as the ffmpeg version will be old and wont support the best quality options.\
+Note that if you use the flatpak version then you wont be able to use overclocking unless you set "Coolbits" NVIDIA X setting (which you can do by running `install_coolbits.sh` and then rebooting).\
 If you are running another distro then you can run `install.sh` as root: `sudo ./install.sh`, but you need to manually install the dependencies, as described below.\
 You can also install gpu screen recorder ([the gtk gui version](https://git.dec05eba.com/gpu-screen-recorder-gtk/)) from [flathub](https://flathub.org/apps/details/com.dec05eba.gpu_screen_recorder).

@@ -29,7 +35,7 @@ You can also install gpu screen recorder ([the gtk gui version](https://git.dec0
 ## AMD/Intel
 `libglvnd (which provides libgl and libegl), mesa, ffmpeg (libavcodec, libavformat, libavutil, libswresample, libavfilter), libx11, libxcomposite, libpulse, libva.so`.
 ## NVIDIA
-`libglvnd (which provides libgl and libegl), ffmpeg (libavcodec, libavformat, libavutil, libswresample, libavfilter), libx11, libxcomposite, libpulse, libcuda.so`. Additionally, you need to have `libnvidia-fbc.so.1` installed when using nvfbc.
+`libglvnd (which provides libgl and libegl), ffmpeg (libavcodec, libavformat, libavutil, libswresample, libavfilter), libx11, libxcomposite, libpulse, libcuda.so`. Additionally, you need to have `libnvidia-fbc.so.1` installed when using nvfbc and `libXNVCtrl.so.0` when using the `-oc` option.

 # How to use
 Run `scripts/interactive.sh` or run gpu-screen-recorder directly, for example: `gpu-screen-recorder -w $(xdotool selectwindow) -c mp4 -f 60 -a "$(pactl get-default-sink).monitor" -o test_video.mp4` then stop the screen recorder with Ctrl+C, which will also save the recording. You can change -w to -w screen if you want to record all monitors or if you want to record a specific monitor then you can use -w monitor-name, for example -w HDMI-0 (use xrandr command to find the name of your monitor. The name can also be found in your desktop environments display settings).\
--- a/6
+++ b/6
@@ -14,4 +14,8 @@ Support fullscreen capture on amd/intel using external kms process.
 Support amf and qsv.
 Disable flipping on nvidia? this might fix some stuttering issues on some setups. See NvCtrlGetAttribute/NvCtrlSetAttributeAndGetStatus NV_CTRL_SYNC_TO_VBLANK https://github.com/NVIDIA/nvidia-settings/blob/d5f022976368cbceb2f20b838ddb0bf992f0cfb9/src/gtk%2B-2.x/ctkopengl.c.
 Replays seem to have some issues with audio/video. Why?
-Cleanup unused gl/egl functions, macro, etc.
+Cleanup unused gl/egl functions, macro, etc.
+Add option to disable overlapping of replays (the old behavior kinda. Remove the whole replay buffer data after saving when doing this).
+Set audio track name to audio device name (if not merge of multiple audio devices).
+Add support for webcam, but only really for amd/intel because amd/intel can get drm fd access to webcam, nvidia cant. This allows us to create an opengl texture directly from the webcam fd for optimal performance.
+Reverse engineer nvapi so we can disable "force p2 state" on linux too (nvapi profile api with the settings id 0x50166c5e).
--- a/build.sh
+++ b/build.sh
@@ -11,10 +11,12 @@ gcc -c src/capture/xcomposite_cuda.c $opts $includes
 gcc -c src/capture/xcomposite_drm.c $opts $includes
 gcc -c src/egl.c $opts $includes
 gcc -c src/cuda.c $opts $includes
+gcc -c src/xnvctrl.c $opts $includes
+gcc -c src/overclock.c $opts $includes
 gcc -c src/vaapi.c $opts $includes
 gcc -c src/window_texture.c $opts $includes
 gcc -c src/time.c $opts $includes
 g++ -c src/sound.cpp $opts $includes
 g++ -c src/main.cpp $opts $includes
-g++ -o gpu-screen-recorder -O2 capture.o nvfbc.o egl.o cuda.o vaapi.o window_texture.o time.o xcomposite_cuda.o xcomposite_drm.o sound.o main.o -s $libs
+g++ -o gpu-screen-recorder -O2 capture.o nvfbc.o egl.o cuda.o xnvctrl.o overclock.o vaapi.o window_texture.o time.o xcomposite_cuda.o xcomposite_drm.o sound.o main.o -s $libs
 echo "Successfully built gpu-screen-recorder"
--- a/include/capture/nvfbc.h
+++ b/include/capture/nvfbc.h
@@ -12,7 +12,8 @@ typedef struct {
    int fps;
    vec2i pos;
    vec2i size;
-    bool direct_capture; /* temporary disabled */
+    bool direct_capture;
+    bool overclock;
 } gsr_capture_nvfbc_params;

 gsr_capture* gsr_capture_nvfbc_create(const gsr_capture_nvfbc_params *params);
--- a/include/capture/xcomposite_cuda.h
+++ b/include/capture/xcomposite_cuda.h
@@ -11,6 +11,7 @@ typedef struct {
    Window window;
    bool follow_focused; /* If this is set then |window| is ignored */
    vec2i region_size; /* This is currently only used with |follow_focused| */
+    bool overclock;
 } gsr_capture_xcomposite_cuda_params;

 gsr_capture* gsr_capture_xcomposite_cuda_create(const gsr_capture_xcomposite_cuda_params *params);
--- a/include/cuda.h
+++ b/include/cuda.h
@@ -1,12 +1,15 @@
 #ifndef GSR_CUDA_H
 #define GSR_CUDA_H

+#include "overclock.h"
 #include <stddef.h>
 #include <stdbool.h>

 // To prevent hwcontext_cuda.h from including cuda.h
 #define CUDA_VERSION 11070

+#define CU_CTX_SCHED_AUTO 0
+
 #if defined(_WIN64) || defined(__LP64__)
 typedef unsigned long long CUdeviceptr_v2;
 #else
@@ -68,11 +71,12 @@ typedef struct CUDA_MEMCPY2D_st {
 } CUDA_MEMCPY2D_v2;
 typedef CUDA_MEMCPY2D_v2 CUDA_MEMCPY2D;

-#define CU_CTX_SCHED_AUTO 0
-
 typedef struct CUgraphicsResource_st *CUgraphicsResource;

 typedef struct {
+    gsr_overclock overclock;
+    bool do_overclock;
+
    void *library;
    CUcontext cu_ctx;

@@ -95,7 +99,7 @@ typedef struct {
    CUresult (*cuGraphicsSubResourceGetMappedArray)(CUarray *pArray, CUgraphicsResource resource, unsigned int arrayIndex, unsigned int mipLevel);
 } gsr_cuda;

-bool gsr_cuda_load(gsr_cuda *self);
+bool gsr_cuda_load(gsr_cuda *self, Display *display, bool overclock);
 void gsr_cuda_unload(gsr_cuda *self);

 #endif /* GSR_CUDA_H */
--- a/include/overclock.h
+++ b/include/overclock.h
@@ -0,0 +1,17 @@
+#ifndef GSR_OVERCLOCK_H
+#define GSR_OVERCLOCK_H
+
+#include "xnvctrl.h"
+
+typedef struct {
+    gsr_xnvctrl xnvctrl;
+    int num_performance_levels;
+} gsr_overclock;
+
+bool gsr_overclock_load(gsr_overclock *self, Display *display);
+void gsr_overclock_unload(gsr_overclock *self);
+
+bool gsr_overclock_start(gsr_overclock *self);
+void gsr_overclock_stop(gsr_overclock *self);
+
+#endif /* GSR_OVERCLOCK_H */
--- a/include/xnvctrl.h
+++ b/include/xnvctrl.h
@@ -0,0 +1,43 @@
+#ifndef GSR_XNVCTRL_H
+#define GSR_XNVCTRL_H
+
+#include <stdbool.h>
+#include <stdint.h>
+
+#define NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET                            410
+#define NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET_ALL_PERFORMANCE_LEVELS     425
+
+#define NV_CTRL_TARGET_TYPE_GPU                                         1
+
+#define NV_CTRL_STRING_PERFORMANCE_MODES                                29
+
+typedef struct _XDisplay Display;
+
+typedef struct {
+    int type;
+    union {
+        struct {
+            int64_t min;
+            int64_t max;
+        } range;
+        struct {
+            unsigned int ints;
+        } bits;
+    } u;
+    unsigned int permissions;
+} NVCTRLAttributeValidValuesRec;
+
+typedef struct {
+    Display *display;
+    void *library;
+    
+    int (*XNVCTRLQueryExtension)(Display *dpy, int *event_basep, int *error_basep);
+    int (*XNVCTRLSetTargetAttributeAndGetStatus)(Display *dpy, int target_type, int target_id, unsigned int display_mask, unsigned int attribute, int value);
+    int (*XNVCTRLQueryValidTargetAttributeValues)(Display *dpy, int target_type, int target_id, unsigned int display_mask, unsigned int attribute, NVCTRLAttributeValidValuesRec *values);
+    int (*XNVCTRLQueryTargetStringAttribute)(Display *dpy, int target_type, int target_id, unsigned int display_mask, unsigned int attribute, char **ptr);
+} gsr_xnvctrl;
+
+bool gsr_xnvctrl_load(gsr_xnvctrl *self, Display *display);
+void gsr_xnvctrl_unload(gsr_xnvctrl *self);
+
+#endif /* GSR_XNVCTRL_H */
--- a/install.sh
+++ b/install.sh
@@ -8,4 +8,6 @@ cd "$script_dir"
 ./build.sh
 install -Dm755 "gpu-screen-recorder" "/usr/local/bin/gpu-screen-recorder"
 install -Dm755 "gpu-screen-recorder" "/usr/bin/gpu-screen-recorder"
+
+./install_coolbits.sh
 echo "Successfully installed gpu-screen-recorder"
--- a/install_coolbits.sh
+++ b/install_coolbits.sh
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+script_dir=$(dirname "$0")
+cd "$script_dir"
+
+[ $(id -u) -ne 0 ] && echo "You need root privileges to run the install script" && exit 1
+
+for xorg_conf_d in "/etc/X11/xorg.conf.d" "/usr/share/X11/xorg.conf.d" "/usr/lib/X11/xorg.conf.d"; do
+    [ -d "$xorg_conf_d" ] && install -Dm644 "88-gsr-coolbits.conf" "$xorg_conf_d/88-gsr-coolbits.conf"
+done
--- a/src/capture/nvfbc.c
+++ b/src/capture/nvfbc.c
@@ -183,7 +183,7 @@ static bool ffmpeg_create_cuda_contexts(gsr_capture_nvfbc *cap_nvfbc, AVCodecCon

 static int gsr_capture_nvfbc_start(gsr_capture *cap, AVCodecContext *video_codec_context) {
    gsr_capture_nvfbc *cap_nvfbc = cap->priv;
-    if(!gsr_cuda_load(&cap_nvfbc->cuda))
+    if(!gsr_cuda_load(&cap_nvfbc->cuda, cap_nvfbc->params.dpy, cap_nvfbc->params.overclock))
        return -1;

    if(!gsr_capture_nvfbc_load_library(cap)) {
--- a/src/capture/xcomposite_cuda.c
+++ b/src/capture/xcomposite_cuda.c
@@ -222,7 +222,7 @@ static int gsr_capture_xcomposite_cuda_start(gsr_capture *cap, AVCodecContext *v
        return -1;
    }

-    if(!gsr_cuda_load(&cap_xcomp->cuda)) {
+    if(!gsr_cuda_load(&cap_xcomp->cuda, cap_xcomp->dpy, cap_xcomp->params.overclock)) {
        gsr_capture_xcomposite_cuda_stop(cap, video_codec_context);
        return -1;
    }
@@ -269,7 +269,8 @@ static void gsr_capture_xcomposite_cuda_stop(gsr_capture *cap, AVCodecContext *v

    gsr_egl_unload(&cap_xcomp->egl);
    if(cap_xcomp->dpy) {
-        XCloseDisplay(cap_xcomp->dpy);
+        // TODO: This causes a crash, why? maybe some other library dlclose xlib and that also happened to unload this???
+        //XCloseDisplay(cap_xcomp->dpy);
        cap_xcomp->dpy = NULL;
    }
 }
@@ -424,6 +425,7 @@ static int gsr_capture_xcomposite_cuda_capture(gsr_capture *cap, AVFrame *frame)
    vec2i source_size = cap_xcomp->texture_size;

    if(cap_xcomp->window_texture.texture_id != 0) {
+        while(cap_xcomp->egl.glGetError()) {}
        /* TODO: Remove this copy, which is only possible by using nvenc directly and encoding window_pixmap.target_texture_id */
        cap_xcomp->egl.glCopyImageSubData(
            window_texture_get_opengl_texture_id(&cap_xcomp->window_texture), GL_TEXTURE_2D, 0, source_pos.x, source_pos.y, 0,
--- a/src/capture/xcomposite_drm.c
+++ b/src/capture/xcomposite_drm.c
@@ -706,6 +706,7 @@ static void gsr_capture_xcomposite_drm_tick(gsr_capture *cap, AVCodecContext *vi
        #define FOURCC_NV12 842094158

        if(prime.fourcc == FOURCC_NV12) { // This happens on AMD
+            while(cap_xcomp->egl.glGetError()) {}
            while(cap_xcomp->egl.eglGetError() != EGL_SUCCESS){}

            EGLImage images[2];
@@ -902,7 +903,8 @@ static void gsr_capture_xcomposite_drm_destroy(gsr_capture *cap, AVCodecContext
        cap->priv = NULL;
    }
    if(cap_xcomp->dpy) {
-        XCloseDisplay(cap_xcomp->dpy);
+        // TODO: This causes a crash, why? maybe some other library dlclose xlib and that also happened to unload this???
+        //XCloseDisplay(cap_xcomp->dpy);
        cap_xcomp->dpy = NULL;
    }
    free(cap);
--- a/src/cuda.c
+++ b/src/cuda.c
@@ -2,8 +2,9 @@
 #include "../include/library_loader.h"
 #include <string.h>

-bool gsr_cuda_load(gsr_cuda *self) {
+bool gsr_cuda_load(gsr_cuda *self, Display *display, bool do_overclock) {
    memset(self, 0, sizeof(gsr_cuda));
+    self->do_overclock = do_overclock;

    dlerror(); /* clear */
    void *lib = dlopen("libcuda.so.1", RTLD_LAZY);
@@ -76,6 +77,13 @@ bool gsr_cuda_load(gsr_cuda *self) {
        goto fail;
    }

+    if(self->do_overclock) {
+        if(gsr_overclock_load(&self->overclock, display))
+            gsr_overclock_start(&self->overclock);
+        else
+            fprintf(stderr, "gsr warning: gsr_cuda_load: failed to load xnvctrl, failed to overclock memory transfer rate\n");
+    }
+
    self->library = lib;
    return true;

@@ -91,8 +99,13 @@ void gsr_cuda_unload(gsr_cuda *self) {
            self->cuCtxDestroy_v2(self->cu_ctx);
            self->cu_ctx = 0;
        }
-
        dlclose(self->library);
-        memset(self, 0, sizeof(gsr_cuda));
    }
+
+    if(self->do_overclock && self->overclock.xnvctrl.library) {
+        gsr_overclock_stop(&self->overclock);
+        gsr_overclock_unload(&self->overclock);
+    }
+
+    memset(self, 0, sizeof(gsr_cuda));
 }
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -630,7 +630,7 @@ static void open_video(AVCodecContext *codec_context, VideoQuality video_quality
 }

 static void usage() {
-    fprintf(stderr, "usage: gpu-screen-recorder -w <window_id|monitor|focused> [-c <container_format>] [-s WxH] -f <fps> [-a <audio_input>...] [-q <quality>] [-r <replay_buffer_size_sec>] [-k h264|h265] [-ac aac|opus|flac] [-o <output_file>]\n");
+    fprintf(stderr, "usage: gpu-screen-recorder -w <window_id|monitor|focused> [-c <container_format>] [-s WxH] -f <fps> [-a <audio_input>...] [-q <quality>] [-r <replay_buffer_size_sec>] [-k h264|h265] [-ac aac|opus|flac] [-oc yes|no] [-o <output_file>]\n");
    fprintf(stderr, "OPTIONS:\n");
    fprintf(stderr, "  -w    Window to record, a display, \"screen\", \"screen-direct\", \"screen-direct-force\" or \"focused\". The display is the display (monitor) name in xrandr and if \"screen\" or \"screen-direct\" is selected then all displays are recorded. If this is \"focused\" then the currently focused window is recorded. When recording the focused window then the -s option has to be used as well.\n"
        "        \"screen-direct\"/\"screen-direct-force\" skips one texture copy for fullscreen applications so it may lead to better performance and it works with VRR monitors when recording fullscreen application but may break some applications, such as mpv in fullscreen mode. Direct mode doesn't capture cursor either. \"screen-direct-force\" is not recommended unless you use a VRR monitor because there might be driver issues that cause the video to stutter or record a black screen.\n");
@@ -644,6 +644,7 @@ static void usage() {
        " This option has be between 5 and 1200. Note that the replay buffer size will not always be precise, because of keyframes. Optional, disabled by default.\n");
    fprintf(stderr, "  -k    Video codec to use. Should be either 'auto', 'h264' or 'h265'. Defaults to 'auto' which defaults to 'h265' on nvidia unless recording at a higher resolution than 3840x2160. On AMD/Intel this defaults to 'auto' which defaults to 'h264'. Forcefully set to 'h264' if -c is 'flv'.\n");
    fprintf(stderr, "  -ac   Audio codec to use. Should be either 'aac', 'opus' or 'flac'. Defaults to 'opus' for .mp4/.mkv files, otherwise defaults to 'aac'. 'opus' and 'flac' is only supported by .mp4/.mkv files. 'opus' is recommended for best performance and smallest audio size.\n");
+    fprintf(stderr, "  -oc   Overclock memory transfer rate to the maximum performance level. This only applies to NVIDIA and exists to overcome a bug in NVIDIA driver where performance level is dropped when you record a game. Only needed if you are recording a game that is bottlenecked by GPU. Works only if your have \"Coolbits\" set to \"12\" in NVIDIA X settings, see README for more information. Obs! use at your own risk! Optional, disabled by default\n");
    fprintf(stderr, "  -o    The output file path. If omitted then the encoded data is sent to stdout. Required in replay mode (when using -r). In replay mode this has to be an existing directory instead of a file.\n");
    fprintf(stderr, "NOTES:\n");
    fprintf(stderr, "  Send signal SIGINT (Ctrl+C) to gpu-screen-recorder to stop and save the recording (when not using replay mode).\n");
@@ -1064,10 +1065,11 @@ int main(int argc, char **argv) {
        { "-o", Arg { {}, true, false } },
        { "-r", Arg { {}, true, false } },
        { "-k", Arg { {}, true, false } },
-        { "-ac", Arg { {}, true, false } }
+        { "-ac", Arg { {}, true, false } },
+        { "-oc", Arg { {}, true, false } }
    };

-    for(int i = 1; i < argc - 1; i += 2) {
+    for(int i = 1; i < argc; i += 2) {
        auto it = args.find(argv[i]);
        if(it == args.end()) {
            fprintf(stderr, "Invalid argument '%s'\n", argv[i]);
@@ -1079,6 +1081,11 @@ int main(int argc, char **argv) {
            usage();
        }

+        if(i + 1 >= argc) {
+            fprintf(stderr, "Missing value for argument '%s'\n", argv[i]);
+            usage();
+        }
+
        it->second.values.push_back(argv[i + 1]);
    }

@@ -1119,6 +1126,11 @@ int main(int argc, char **argv) {
        usage();
    }

+    const char *overclock_str = args["-oc"].value();
+    if(!overclock_str)
+        overclock_str = "no";
+    const bool overclock = strcmp(overclock_str, "yes") == 0;
+
    const Arg &audio_input_arg = args["-a"];
    const std::vector<AudioInput> audio_inputs = get_pulseaudio_inputs();
    std::vector<MergedAudioInputs> requested_audio_inputs;
@@ -1221,6 +1233,10 @@ int main(int argc, char **argv) {
        usage();
    }

+    vec2i region_size = { 0, 0 };
+    Window src_window_id = None;
+    bool follow_focused = false;
+
    gsr_capture *capture = nullptr;
    if(strcmp(window_str, "focused") == 0) {
        if(!screen_region) {
@@ -1228,7 +1244,6 @@ int main(int argc, char **argv) {
            usage();
        }

-        vec2i region_size = { 0, 0 };
        if(sscanf(screen_region, "%dx%d", &region_size.x, &region_size.y) != 2) {
            fprintf(stderr, "Error: invalid value for option -s '%s', expected a value in format WxH\n", screen_region);
            usage();
@@ -1239,38 +1254,7 @@ int main(int argc, char **argv) {
            usage();
        }

-        switch(gpu_inf.vendor) {
-            case GPU_VENDOR_AMD: {
-                gsr_capture_xcomposite_drm_params xcomposite_params;
-                xcomposite_params.window = 0;
-                xcomposite_params.follow_focused = true;
-                xcomposite_params.region_size = region_size;
-                capture = gsr_capture_xcomposite_drm_create(&xcomposite_params);
-                if(!capture)
-                    return 1;
-                break;
-            }
-            case GPU_VENDOR_INTEL: {
-                gsr_capture_xcomposite_drm_params xcomposite_params;
-                xcomposite_params.window = 0;
-                xcomposite_params.follow_focused = true;
-                xcomposite_params.region_size = region_size;
-                capture = gsr_capture_xcomposite_drm_create(&xcomposite_params);
-                if(!capture)
-                    return 1;
-                break;
-            }
-            case GPU_VENDOR_NVIDIA: {
-                gsr_capture_xcomposite_cuda_params xcomposite_params;
-                xcomposite_params.window = 0;
-                xcomposite_params.follow_focused = true;
-                xcomposite_params.region_size = region_size;
-                capture = gsr_capture_xcomposite_cuda_create(&xcomposite_params);
-                if(!capture)
-                    return 1;
-                break;
-            }
-        }
+        follow_focused = true;
    } else if(contains_non_hex_number(window_str)) {
        if(gpu_inf.vendor != GPU_VENDOR_NVIDIA) {
            fprintf(stderr, "Error: recording a monitor is only supported on NVIDIA right now. Record \"focused\" instead for convenient fullscreen window recording\n");
@@ -1310,23 +1294,26 @@ int main(int argc, char **argv) {
        nvfbc_params.pos = { 0, 0 };
        nvfbc_params.size = { 0, 0 };
        nvfbc_params.direct_capture = direct_capture;
+        nvfbc_params.overclock = overclock;
        capture = gsr_capture_nvfbc_create(&nvfbc_params);
        if(!capture)
            return 1;
    } else {
        errno = 0;
-        Window src_window_id = strtol(window_str, nullptr, 0);
+        src_window_id = strtol(window_str, nullptr, 0);
        if(src_window_id == None || errno == EINVAL) {
            fprintf(stderr, "Invalid window number %s\n", window_str);
            usage();
        }
+    }

+    if(!capture) {
        switch(gpu_inf.vendor) {
            case GPU_VENDOR_AMD: {
                gsr_capture_xcomposite_drm_params xcomposite_params;
                xcomposite_params.window = src_window_id;
-                xcomposite_params.follow_focused = false;
-                xcomposite_params.region_size = { 0, 0 };
+                xcomposite_params.follow_focused = follow_focused;
+                xcomposite_params.region_size = region_size;
                capture = gsr_capture_xcomposite_drm_create(&xcomposite_params);
                if(!capture)
                    return 1;
@@ -1335,8 +1322,8 @@ int main(int argc, char **argv) {
            case GPU_VENDOR_INTEL: {
                gsr_capture_xcomposite_drm_params xcomposite_params;
                xcomposite_params.window = src_window_id;
-                xcomposite_params.follow_focused = false;
-                xcomposite_params.region_size = { 0, 0 };
+                xcomposite_params.follow_focused = follow_focused;
+                xcomposite_params.region_size = region_size;
                capture = gsr_capture_xcomposite_drm_create(&xcomposite_params);
                if(!capture)
                    return 1;
@@ -1345,8 +1332,9 @@ int main(int argc, char **argv) {
            case GPU_VENDOR_NVIDIA: {
                gsr_capture_xcomposite_cuda_params xcomposite_params;
                xcomposite_params.window = src_window_id;
-                xcomposite_params.follow_focused = false;
-                xcomposite_params.region_size = { 0, 0 };
+                xcomposite_params.follow_focused = follow_focused;
+                xcomposite_params.region_size = region_size;
+                xcomposite_params.overclock = overclock;
                capture = gsr_capture_xcomposite_cuda_create(&xcomposite_params);
                if(!capture)
                    return 1;
@@ -1874,8 +1862,10 @@ int main(int argc, char **argv) {

    gsr_capture_destroy(capture, video_codec_context);

-    if(dpy)
-        XCloseDisplay(dpy);
+    if(dpy) {
+        // TODO: This causes a crash, why? maybe some other library dlclose xlib and that also happened to unload this???
+        //XCloseDisplay(dpy);
+    }

    free(empty_audio);
    return should_stop_error ? 3 : 0;
--- a/src/overclock.c
+++ b/src/overclock.c
@@ -0,0 +1,231 @@
+#include "../include/overclock.h"
+#include <X11/Xlib.h>
+#include <stdio.h>
+#include <string.h>
+
+// HACK!!!: When a program uses cuda (including nvenc) then the nvidia driver drops to performance level 2 (memory transfer rate is dropped and possibly graphics clock).
+// Nvidia does this because in some very extreme cases of cuda there can be memory corruption when running at max memory transfer rate.
+// So to get around this we overclock memory transfer rate (maybe this should also be done for graphics clock?) to the best performance level while GPU Screen Recorder is running.
+
+// TODO: Does it always drop to performance level 2?
+// TODO: Also do the same for graphics clock and graphics memory?
+
+// Fields are 0 if not set
+
+static min_int(int a, int b) {
+    return a < b ? a : b;
+}
+
+typedef struct {
+    int perf;
+
+    int nv_clock;
+    int nv_clock_min;
+    int nv_clock_max;
+
+    int mem_clock;
+    int mem_clock_min;
+    int mem_clock_max;
+
+    int mem_transfer_rate;
+    int mem_transfer_rate_min;
+    int mem_transfer_rate_max;
+} NVCTRLPerformanceLevel;
+
+#define MAX_PERFORMANCE_LEVELS 12
+typedef struct {
+    NVCTRLPerformanceLevel performance_level[MAX_PERFORMANCE_LEVELS];
+    int num_performance_levels;
+} NVCTRLPerformanceLevelQuery;
+
+typedef void (*split_callback)(const char *str, size_t size, void *userdata);
+static void split_by_delimiter(const char *str, size_t size, char delimiter, split_callback callback, void *userdata) {
+    const char *it = str;
+    while(it < str + size) {
+        const char *prev_it = it;
+        it = memchr(it, delimiter, (str + size) - it);
+        if(!it)
+            it = str + size;
+
+        callback(prev_it, it - prev_it, userdata);
+        it += 1; // skip delimiter
+    }
+}
+
+// Returns 0 on error
+static int xnvctrl_get_memory_transfer_rate_max(gsr_xnvctrl *xnvctrl, const NVCTRLPerformanceLevelQuery *query) {
+    NVCTRLAttributeValidValuesRec valid;
+    if(xnvctrl->XNVCTRLQueryValidTargetAttributeValues(xnvctrl->display, NV_CTRL_TARGET_TYPE_GPU, 0, 0, NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET_ALL_PERFORMANCE_LEVELS, &valid)) {
+        return valid.u.range.max;
+    }
+
+    if(query->num_performance_levels > 0 && xnvctrl->XNVCTRLQueryValidTargetAttributeValues(xnvctrl->display, NV_CTRL_TARGET_TYPE_GPU, 0, query->num_performance_levels - 1, NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET, &valid)) {
+        return valid.u.range.max;
+    }
+    
+    return 0;
+}
+
+static bool xnvctrl_set_memory_transfer_rate_offset(gsr_xnvctrl *xnvctrl, int num_performance_levels, int offset) {
+    bool success = false;
+
+    // NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET_ALL_PERFORMANCE_LEVELS works (or at least used to?) without Xorg running as root
+    // so we try that first. NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET_ALL_PERFORMANCE_LEVELS also only works with GTX 1000+.
+    // TODO: Reverse engineer NVIDIA Xorg driver so we can set this always without root access.
+    if(xnvctrl->XNVCTRLSetTargetAttributeAndGetStatus(xnvctrl->display, NV_CTRL_TARGET_TYPE_GPU, 0, 0, NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET_ALL_PERFORMANCE_LEVELS, offset))
+        success = true;
+
+    for(int i = 0; i < num_performance_levels; ++i) {
+        success |= xnvctrl->XNVCTRLSetTargetAttributeAndGetStatus(xnvctrl->display, NV_CTRL_TARGET_TYPE_GPU, 0, i, NV_CTRL_GPU_MEM_TRANSFER_RATE_OFFSET, offset);
+    }
+
+    return success;
+}
+
+static void strip(const char **str, int *size) {
+    const char *str_d = *str;
+    int s_d = *size;
+
+    const char *start = str_d;
+    const char *end = start + s_d;
+
+    while(str_d < end) {
+        char c = *str_d;
+        if(c != ' ' && c != '\t' && c != '\n')
+            break;
+        ++str_d;
+    }
+
+    int start_offset = str_d - start;
+    while(s_d > start_offset) {
+        char c = start[s_d];
+        if(c != ' ' && c != '\t' && c != '\n')
+            break;
+        --s_d;
+    }
+
+    *str = str_d;
+    *size = s_d;
+}
+
+static void attribute_callback(const char *str, size_t size, void *userdata) {
+    if(size > 255 - 1)
+        return;
+
+    int size_i = size;
+    strip(&str, &size_i);
+
+    char attribute[255];
+    memcpy(attribute, str, size_i);
+    attribute[size_i] = '\0';
+
+    const char *sep = strchr(attribute, '=');
+    if(!sep)
+        return;
+
+    const char *attribute_name = attribute;
+    size_t attribute_name_len = sep - attribute_name;
+    const char *attribute_value_str = sep + 1;
+
+    int attribute_value = 0;
+    if(sscanf(attribute_value_str, "%d", &attribute_value) != 1)
+        return;
+
+    NVCTRLPerformanceLevel *performance_level = userdata;
+    if(attribute_name_len == 4 && memcmp(attribute_name, "perf", 4) == 0)
+        performance_level->perf = attribute_value;
+    else if(attribute_name_len == 7 && memcmp(attribute_name, "nvclock", 7) == 0)
+        performance_level->nv_clock = attribute_value;
+    else if(attribute_name_len == 10 && memcmp(attribute_name, "nvclockmin", 10) == 0)
+        performance_level->nv_clock_min = attribute_value;
+    else if(attribute_name_len == 10 && memcmp(attribute_name, "nvclockmax", 10) == 0)
+        performance_level->nv_clock_max = attribute_value;
+    else if(attribute_name_len == 8 && memcmp(attribute_name, "memclock", 8) == 0)
+        performance_level->mem_clock = attribute_value;
+    else if(attribute_name_len == 11 && memcmp(attribute_name, "memclockmin", 11) == 0)
+        performance_level->mem_clock_min = attribute_value;
+    else if(attribute_name_len == 11 && memcmp(attribute_name, "memclockmax", 11) == 0)
+        performance_level->mem_clock_max = attribute_value;
+    else if(attribute_name_len == 15 && memcmp(attribute_name, "memTransferRate", 15) == 0)
+        performance_level->mem_transfer_rate = attribute_value;
+    else if(attribute_name_len == 18 && memcmp(attribute_name, "memTransferRatemin", 18) == 0)
+        performance_level->mem_transfer_rate_min = attribute_value;
+    else if(attribute_name_len == 18 && memcmp(attribute_name, "memTransferRatemax", 18) == 0)
+        performance_level->mem_transfer_rate_max = attribute_value;
+}
+
+static void attribute_line_callback(const char *str, size_t size, void *userdata) {
+    NVCTRLPerformanceLevelQuery *query = userdata;
+    if(query->num_performance_levels >= MAX_PERFORMANCE_LEVELS)
+        return;
+
+    NVCTRLPerformanceLevel *current_performance_level = &query->performance_level[query->num_performance_levels];
+    memset(current_performance_level, 0, sizeof(NVCTRLPerformanceLevel));
+    ++query->num_performance_levels;
+    split_by_delimiter(str, size, ',', attribute_callback, current_performance_level);
+}
+
+static bool xnvctrl_get_performance_levels(gsr_xnvctrl *xnvctrl, NVCTRLPerformanceLevelQuery *query) {
+    bool success = false;
+    memset(query, 0, sizeof(NVCTRLPerformanceLevelQuery));
+
+    char *attributes = NULL;
+    if(!xnvctrl->XNVCTRLQueryTargetStringAttribute(xnvctrl->display, NV_CTRL_TARGET_TYPE_GPU, 0, 0, NV_CTRL_STRING_PERFORMANCE_MODES, &attributes)) {
+        success = false;
+        goto done;
+    }
+
+    split_by_delimiter(attributes, strlen(attributes), ';', attribute_line_callback, query);
+    success = true;
+
+    done:
+    if(attributes)
+        XFree(attributes);
+
+    return success;
+}
+
+bool gsr_overclock_load(gsr_overclock *self, Display *display) {
+    memset(self, 0, sizeof(gsr_overclock));
+    self->num_performance_levels = 0;
+
+    return gsr_xnvctrl_load(&self->xnvctrl, display);
+}
+
+void gsr_overclock_unload(gsr_overclock *self) {
+    gsr_xnvctrl_unload(&self->xnvctrl);
+}
+
+bool gsr_overclock_start(gsr_overclock *self) {
+    int basep = 0;
+    int errorp = 0;
+    if(!self->xnvctrl.XNVCTRLQueryExtension(self->xnvctrl.display, &basep, &errorp)) {
+        fprintf(stderr, "gsr warning: gsr_overclock_start: xnvctrl is not supported on your system, failed to overclock memory transfer rate\n");
+        return false;
+    }
+
+    NVCTRLPerformanceLevelQuery query;
+    if(!xnvctrl_get_performance_levels(&self->xnvctrl, &query) || query.num_performance_levels == 0) {
+        fprintf(stderr, "gsr warning: gsr_overclock_start: failed to get performance levels for overclocking\n");
+        return false;
+    }
+    self->num_performance_levels = query.num_performance_levels;
+
+    int target_transfer_rate_offset = xnvctrl_get_memory_transfer_rate_max(&self->xnvctrl, &query) / 2;
+    if(query.num_performance_levels > 3) {
+        const int transfer_rate_max_diff = query.performance_level[query.num_performance_levels - 1].mem_transfer_rate_max - query.performance_level[2].mem_transfer_rate_max;
+        if(transfer_rate_max_diff > 0 && transfer_rate_max_diff < target_transfer_rate_offset)
+            target_transfer_rate_offset = transfer_rate_max_diff;
+    }
+
+    if(xnvctrl_set_memory_transfer_rate_offset(&self->xnvctrl, self->num_performance_levels, target_transfer_rate_offset)) {
+        fprintf(stderr, "gsr info: gsr_overclock_start: sucessfully set memory transfer rate offset to %d\n", target_transfer_rate_offset);
+    } else {
+        fprintf(stderr, "gsr info: gsr_overclock_start: failed to overclock memory transfer rate offset to %d\n", target_transfer_rate_offset);
+    }
+    return true;
+}
+
+void gsr_overclock_stop(gsr_overclock *self) {
+    xnvctrl_set_memory_transfer_rate_offset(&self->xnvctrl, self->num_performance_levels, 0);
+}
--- a/src/vaapi.c
+++ b/src/vaapi.c
@@ -20,7 +20,7 @@ bool gsr_vaapi_load(gsr_vaapi *self) {
    };

    if(!dlsym_load_list(lib, required_dlsym)) {
-        fprintf(stderr, "gsr error: gsr_vaapi_load failed: missing required symbols in libcuda.so/libcuda.so.1\n");
+        fprintf(stderr, "gsr error: gsr_vaapi_load failed: missing required symbols in libva.so\n");
        goto fail;
    }

--- a/src/window_texture.c
+++ b/src/window_texture.c
@@ -88,6 +88,7 @@ int window_texture_on_resize(WindowTexture *self) {
    self->egl->glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
    self->egl->glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);

+    while(self->egl->glGetError()) {}
    while(self->egl->eglGetError() != EGL_SUCCESS) {}

    image = self->egl->eglCreateImage(self->egl->egl_display, NULL, EGL_NATIVE_PIXMAP_KHR, (EGLClientBuffer)pixmap, pixmap_attrs);
--- a/src/xnvctrl.c
+++ b/src/xnvctrl.c
@@ -0,0 +1,44 @@
+#include "../include/xnvctrl.h"
+#include "../include/library_loader.h"
+#include <string.h>
+
+bool gsr_xnvctrl_load(gsr_xnvctrl *self, Display *display) {
+    memset(self, 0, sizeof(gsr_xnvctrl));
+    self->display = display;
+
+    dlerror(); /* clear */
+    void *lib = dlopen("libXNVCtrl.so.0", RTLD_LAZY);
+    if(!lib) {
+        fprintf(stderr, "gsr error: gsr_xnvctrl_load failed: failed to load libXNVCtrl.so.0, error: %s\n", dlerror());
+        return false;
+    }
+
+    dlsym_assign required_dlsym[] = {
+        { (void**)&self->XNVCTRLQueryExtension, "XNVCTRLQueryExtension" },
+        { (void**)&self->XNVCTRLSetTargetAttributeAndGetStatus, "XNVCTRLSetTargetAttributeAndGetStatus" },
+        { (void**)&self->XNVCTRLQueryValidTargetAttributeValues, "XNVCTRLQueryValidTargetAttributeValues" },
+        { (void**)&self->XNVCTRLQueryTargetStringAttribute, "XNVCTRLQueryTargetStringAttribute" },
+
+        { NULL, NULL }
+    };
+
+    if(!dlsym_load_list(lib, required_dlsym)) {
+        fprintf(stderr, "gsr error: gsr_xnvctrl_load failed: missing required symbols in libXNVCtrl.so.0\n");
+        goto fail;
+    }
+
+    self->library = lib;
+    return true;
+
+    fail:
+    dlclose(lib);
+    memset(self, 0, sizeof(gsr_xnvctrl));
+    return false;
+}
+
+void gsr_xnvctrl_unload(gsr_xnvctrl *self) {
+    if(self->library) {
+        dlclose(self->library);
+        memset(self, 0, sizeof(gsr_xnvctrl));
+    }
+}