GGUFNetV1: Accelerated Distilled Diffusion for Temporally Consistent Video Upscaling
Abstract:
Video upscaling inherently suffers from two major challenges: maintaining smooth temporal consistency across frames and hallucinating accurate details without shifting the original artistic intent (color and lighting). GGUFNetV1 introduces a one-step, distilled diffusion architecture designed to solve these problems. By treating video as an overlapping 3D volumetric space and separating frequency data during post-processing, GGUFNetV1 produces high-fidelity, high-resolution video streams in a fraction of traditional diffusion inference times.
Below is the technical whitepaper explaining the lifecycle of a video through the GGUFNetV1 upscaling architecture.
Before & After Upscale Examples
Pipeline Overview Matrix
| Stage | Subsystem | Action Process | Primary Objective |
|---|---|---|---|
| 1. Preparation | Resolution Bounding | Bilinear Interpolation | Establish computational boundaries. |
| 2. Encoding | Spatial Compress | Variational Autoencoding | Compress RGB data to mathematically dense latent representations. |
| 3. Generation | Distilled Diffusion | 1-Step DDIM Integration | Hallucinate missing granular detail based on text/context embeddings. |
| 4. Decoding | Temporal Decompress | 3D VAE Decoding | Reconstruct latent data back into pixel space, bonded by time. |
| 5. Post-Process | Wavelet Filter | Frequency Decomposition | Correct color shifting caused by the generation process. |
Core Architecture Breakdown
1. Pre-Processing and Dimensional Bounding
Because AI models require data to fit within strict dimensional parameters, the source video is first aggressively scaled to a baseline resolution using smooth bilinear interpolation. If a video is too small, the pipeline mathematically stretches it to a minimum baseline, ensuring coordinates are at least 720 pixels high and 1280 pixels wide. Conversely, if the video is massive—exceeding 4K UHD dimensions of 2160 * 3840—it is dynamically scaled down to prevent memory overloads and hardware crashes during generation.
Once the aspect ratio is preserved and the upper or lower limits are met, these dimensions are rounded down to the nearest multiple of 8. This ensures the data perfectly aligns with tensor memory blocks, allowing the neural network to process the data efficiently without leaving behind unusable memory fragments.
2. Volumetric Latent Compression (The Encoder)
Standard RGB video contains an overwhelming amount of data, making it too large to run through a diffusion model efficiently. To solve this, the pipeline essentially acts as a summarizer. It utilizes a Variational Autoencoder (VAE) to compress the spatial dimensions by a factor of 8, turning an 8 * 8 pixel grid into a single, high-density logic block known as a latent representation.
Even compressed, an entire video is still too large to summarize all at once. The system addresses this by parsing the video in 3D (width, height, and time) and reading it in smaller, sliding tiles, typically 512 * 512 blocks with a 25% overlap. To ensure the edges of these overlapping tiles don't look like a visible checkerboard when glued back together, the system applies Gaussian weights—a mathematical bell curve—that places high importance on the center of the tile and smoothly fades out the edges. This creates a perfectly seamless, continuous latent space perfectly prepped for generation.
3. 1-Step Distilled Generative Upscaling (The Core)
Traditional diffusion models operate somewhat like a sculptor slowly chipping away a block of marble over 20 to 50 iterations, stepping backward from static noise to a clean image. GGUFNetV1 drastically accelerates this by using a Distilled DDIM (Denoising Diffusion Implicit Models) process. The network is pre-trained to know exactly where every detail should be instantly, allowing it to calculate the entire trajectory from pure noise to the clean target frame in a single mathematical execution guided by text and context embeddings.
To prevent the common problem of AI video stuttering, boiling, or flickering, the neural network relies on paired Transformer blocks:
- Spatial Attention: Analyzes the X and Y axes of an individual frame to hallucinate sharp textures, fine edges, and local granular details.
- Temporal Attention: Analyzes the Z axis, representing the flow of time. It actively looks at the dense latent data of adjacent frames alongside the current one, enforcing strict temporal coherence so the generated details move fluidly across seconds of video.
4. Temporal Decompression (The Decoder)
Once the core network has hallucinated the high-resolution details, this expanded, ultra-dense latent summary must be translated back into standard RGB pixels so the video can be viewed on a screen. This is handled by a Temporal VAE. Because the video was processed in volumetric chunks, the Decoder is responsible for stitching these newly generated temporal blocks back together.
Processing video in chunks introduces the risk of harsh transitions; for example, frame 100 might suddenly jump in lighting compared to frame 101. To guarantee a flawless visual experience, the decoder uses sliding windows across height, width, and time. Where two temporal chunks overlap, the system calculates a linear crossfade transition, smoothly blending the overlapping frames mathematically to ensure fluid continuous motion across the final video.
5. Wavelet Frequency Recombination (Color Fix)
A persistent challenge with generating video via diffusion is that the AI inherently alters the global color gamut and lighting gradients. A moody, dark blue scene might accidentally be shifted to a bright, saturated purple scene by the generative network. To explicitly repair this without losing detail, GGUFNetV1 deploys a computationally lightweight Wavelet Decomposer.
The decomposes split image attributes into two distinct layers:
- Low-Frequency Data: The general colors, gradients, overall mood, and shadows.
- High-Frequency Data: The sharp outlines, edges, and synthesized fine textures.
The algorithm extracts the perfect Low-Frequency color data from the original, low-resolution video. It then extracts the highly detailed High-Frequency data from the new AI-Upscaled video. By discarding the AI's altered colors and the original video's blurry textures, the system simply merges the original colors directly onto the AI's sharp textures. The final output is an incredibly crisp, hyper-detailed video that flawlessly preserves the artistic lighting and mood of the source material.