The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tabular Data

[ad_1]

Over hundreds of Kaggle competitions, we’ve refined a playbook that consistently lands us near the top of the leaderboard—no matter if we’re working with millions of rows, missing values, or test sets that behave nothing like the training data. This isn’t just a collection of modeling tricks—it’s a repeatable system for solving real-world tabular problems fast. 

Below are seven of our most battle-tested techniques, each one made practical through GPU acceleration. Whether you’re climbing the leaderboard or deploying models in production, these strategies can give you an edge.

We’ve included links to example write-ups or notebooks from past competitions for each technique.
Note: Kaggle and Google Colab notebooks come with free GPUs and accelerated drop-ins like the ones you’ll see below pre-installed.

Core principles: the foundations of a winning workflow

Before diving into techniques, it’s worth pausing to cover the two principles that power everything in this playbook: fast experimentation and careful validation. These aren’t optional best practices—they’re the foundation of how we approach every tabular modeling problem.

Fast experimentation

The biggest lever we have in any competition or real-world project is the number of high-quality experiments we can run. The more we iterate, the more patterns we discover—and the faster we catch when a model is failing, drifting, or overfitting—so we can course-correct early and improve faster.

In practice, that means we optimize our entire pipeline for speed, not just our model training step.

Here’s how we make it work:

  • Accelerate dataframe operations using GPU drop-in replacements for pandas or Polars to transform and engineer features at scale.
  • Train models with NVIDIA cuML or GPU backends of XGBoost, LightGBM, and CatBoost.

GPU acceleration isn’t just for deep learning—it’s often the only way to make advanced tabular techniques practical at scale.

Local Validation

If you can’t trust your validation score, you’re flying blind. That’s why cross-validation (CV) is a cornerstone to our workflow. 

Our approach:

  • Use k-fold cross-validation, where the model trains on most of the data and tests on the part that’s held out.
  • Rotate through folds so every part of the data is tested once.

This gives a much more reliable measure of performance than a single train/validation split.

Pro tip: Match your CV strategy to how the test data is structured. 

For example:

  • Use TimeSeriesSplit for time-dependent data
  • Use GroupKFold for grouped data (like users or patients)

With those foundations in place—moving fast and validating carefully—we can now dive into the techniques themselves. Each one builds on these principles and shows how we turn raw data into world-class models.

1. Start with smarter EDA, not just the basics

Most practitioners know the basics: Check for missing values, outliers, correlations, and feature ranges. Those steps are important, but they’re table stakes. To build models that hold up in the real world, you need to explore the data a little deeper—a couple of quick checks that we’ve found useful, but many people miss:

Train vs. test distribution checks: Spot when evaluation data differs from training, since distribution shift can cause models to validate well but fail in deployment.

Train vs test feature distribution plot showing shift in values (train: lower range, test: higher range).Train vs test feature distribution plot showing shift in values (train: lower range, test: higher range).
Figure 1. Comparing feature distributions between train (blue) and test (red) reveals a clear shift—test data is concentrated in a higher range, with minimal overlap. This kind of distribution shift can cause models to validate well but fail in deployment.

Analyze target variable for temporal patterns: Check for trends or seasonality, since ignoring temporal patterns can lead to models that look accurate in training but break in production.

Time series plot of target variable showing upward trend and seasonal cycles across 2023–2024.Time series plot of target variable showing upward trend and seasonal cycles across 2023–2024.
Figure 2. Analyzing the target variable over time uncovers a strong upward trend with seasonal fluctuations and accelerating growth. Ignoring temporal patterns like these can mislead models unless time-aware validation is used.

These techniques aren’t brand new—but they’re often overlooked, and ignoring them can sink a project.

Why it matters: Skipping these checks can derail an otherwise solid workflow.

In action: In the winning solution to the Amazon KDD Cup ‘23, the team uncovered both a train—test distribution shift and temporal patterns in the target—insights that shaped the final approach. Read the full write-up >

Made practical with GPUs: Real-world datasets are often millions of rows, which can slow to a crawl in pandas. By adding GPU acceleration with NVIDIA cuDF, you can run distribution comparisons and correlations at scale in seconds. Read the technical blog >

2. Build diverse baselines, fast

Most people build a few simple baselines—maybe a mean prediction, a logistic regression, or a quick XGBoost—and then move on. The problem is that a single baseline doesn’t tell you much about the landscape of your data. 

Our approach is different: We spin up a diverse set of baselines across model types right away. Seeing how linear models, GBDTs, and even small neural nets perform side-by-side gives us far more context to guide experimentation. 

Why it matters: Baselines are your gut check—they confirm your model is doing better than guessing, set a minimum performance bar, and act as a rapid feedback loop. Re-running baselines after data changes can reveal whether you’re making progress—or uncover problems like leakage.

Diverse baselines also show you early which model families fit your data best, so you can double-down on what works instead of wasting cycles on the wrong path.

In action: In the Binary Prediction with a Rainfall Dataset competition, we were tasked with forecasting rainfall amounts from weather data. Our baselines carried us far—an ensemble of gradient-boosted trees, neural nets, and Support Vector Regression (SVR) models, without any feature engineering, was enough to earn us second place. And while exploring other baselines, we found that even a single Support Vector Classifier (SVC) baseline would have placed near the top of the leaderboard. Read the full write-up >

Made practical with GPUs: Training a variety of models can be painfully slow on CPUs. With GPU acceleration, it’s practical to try them all—cuDF for quick stats, cuML for linear/logistic regression, and GPU-accelerated XGBoost, LightGBM, CatBoost, and neural nets—so you can get better insight in minutes, not hours.

3. Generate more features, discover more patterns

Feature engineering is still one of the most effective ways to boost accuracy on tabular data. The challenge: generating and validating thousands of features with pandas on CPUs is far too slow to be practical. 

Why it matters: Scaling beyond a handful of manual transformations—into hundreds or thousands of engineered features—often reveals hidden signals that models alone can’t capture. 

Example: Combining categorical columns

In one Kaggle competition, the dataset had eight categorical columns. By combining pairs of them, we created 28 new categorical features that captured interactions the original data didn’t show. Here’s a simplified snippet of the approach:

for i,c1 in enumerate(CATS[:-1]):
     for j,c2 in enumerate(CATS[i+1:]):
        n = f"{c1}_{c2}"
        train[n] = train[c1].astype('str')+"_"+train[c2].astype('str')

In action: Large-scale feature engineering powered first-place finishes in the Kaggle Backpack and Insurance competitions, where thousands of new features made the difference. 

Made practical with GPUs: With cuDF, pandas operations like groupby, aggregation, and encoding run orders of magnitude faster, making it possible to generate and test thousands of new features in days instead of months.

Check out the technical blog and training course below for hands-on examples:

Combing diverse models (ensembling) boosts performance

We found that combining the strengths of different models often pushes performance beyond what any one model can achieve. Two techniques that are particularly useful are hill climbing and model stacking.

4. Hill climbing

Hill climbing is a simple, but powerful way to ensemble models. Start with your strongest single model, then systematically add others with different weights, keeping only the combinations that improve validation. Repeat until no further gains.

Why it matters: Ensembling captures complementary strengths across models, but finding the right blend is hard. Hill climbing automates the search, often squeezing out accuracy and outperforming single model solutions. 

In action: In the Predict Calorie Expenditure competition, we used a hill climbing ensemble of XGBoost, CatBoost, neural nets, and linear models to secure first place. Read the write-up >

Made practical with GPUs: Hill climbing itself isn’t new—it’s a common ensemble technique in competitions—but it normally becomes too slow to apply at large-scale. With CuPy on GPUs, we can vectorize metric calculations (like RMSE or AUC) and evaluate thousands of weight combinations in parallel. That speedup makes it practical to test far more ensembles than would be feasible on CPUs, often uncovering stronger blends.

Here’s a simplified version of the code used to evaluate Hill Climbing ensembles on GPU:

import cupy as cp

def multiple_rmse_scores(actual, predicted):
    if len(actual.shape)==1: 
        actual = actual[:,cp.newaxis]
    rmses = cp.sqrt(cp.mean((actual-predicted)**2.0,axis=0))
    return rmses

def multiple_roc_auc_scores(actual, predicted):
    n_pos = cp.sum(actual)  
    n_neg = len(actual) - n_pos  
    ranked = cp.argsort(cp.argsort(predicted, axis=0), axis=0)+1  
    aucs = (cp.sum(ranked[actual == 1, :], axis=0)- n_pos\
 	*(n_pos + 1)/2) / (n_pos*n_neg)     
    return aucs

5. Stacking

Stacking takes ensembling a step further by training one model on the outputs of others. Instead of averaging predictions with weights (like hill climbing), stacking builds a second-level model that learns how best to combine the outputs of other models. 

Why it matters: Stacking is especially effective when the dataset has complex patterns that different models capture in different ways – like linear trends vs nonlinear interactions. 

Pro tip: Two-ways to stack:

  • Residuals: Train a Stage 2 model on what Stage 1 got wrong (the residuals).
  • OOF Features: Use Stage 1 predictions as new input features for Stage 2.

Both approaches help squeeze more signal out of the data by capturing patterns that base models miss.

In action: Stacking was used to win first place in the Podcast Listening Time competition, where a three-level stack of diverse models (linear, GBDT, neural nets, and AutoML) was used. Read the technical blog >

A flow diagram showing a three-level model stack. Level 1 includes diverse models such as NVIDIA cuML Lasso, SVR, KNN Regressor, Random Forest, neural networks (MLP, TabPFN), and gradient-boosted trees (XGBoost, LightGBM). Their predictions feed into Level 2 models, including XGBoost and MLP. Finally, Level 3 combines outputs with a weighted average to produce the final prediction.
A flow diagram showing a three-level model stack. Level 1 includes diverse models such as NVIDIA cuML Lasso, SVR, KNN Regressor, Random Forest, neural networks (MLP, TabPFN), and gradient-boosted trees (XGBoost, LightGBM). Their predictions feed into Level 2 models, including XGBoost and MLP. Finally, Level 3 combines outputs with a weighted average to produce the final prediction.
Figure 3. The winning entry in the Kaggle April 2025 Playground competition used stacking with three levels of models, with the results of each level used in subsequent levels.

Made practical with GPUs: Stacking is a well-known ensembling technique—but deep stacks quickly become computationally expensive, requiring hundreds of model fits across folds and levels. With cuML and GPU-accelerated GBDTs, we can train an evaluate stacks an order of magnitude faster, making it realistic to explore multi-level ensembles in hours instead of days.

6. Turn unlabeled data into training signal with pseudo-labeling

Pseudo-labeling turns unlabeled data into training signal. You use your best model to infer labels on data that lacks them (for example, test data or external datasets), then fold those “pseudo-labels” back into training to boost model performance.

A flow diagram of the pseudo-labeling process. Train data is used to build an initial model (Level 0), which is validated and tested. The same model generates predictions on unlabeled data, producing pseudo-labels. These pseudo-labels are combined with the original training data to train a second-level model (Level 1), which is then validated and tested.A flow diagram of the pseudo-labeling process. Train data is used to build an initial model (Level 0), which is validated and tested. The same model generates predictions on unlabeled data, producing pseudo-labels. These pseudo-labels are combined with the original training data to train a second-level model (Level 1), which is then validated and tested.
Figure 4. Pseudo-labeling workflow—use a trained model to generate labels for unlabeled data, then fold those pseudo-labels back into training to improve performance.

Why it matters: More data = more signal. Pseudo-labeling improves robustness, acts like knowledge distillation (student models learn from a strong teacher’s predictions), and can even help denoise labeled data by filtering out samples where models disagree. Using soft labels (probabilities instead of hard 0/1s) adds regularization and reduces noise.

Pro tips for effective pseudo-labeling:

  • The stronger the model, the better the pseudo-labels. Ensembles, or multi-round pseudo-labeling usually outperform single-pass approaches
  • Pseudo-labels can also be used for pretraining. Fine-tune on the initial data as a last step to reduce noise introduced earlier.
  • Use soft pseudo-labels. They add more signal, reduce noise, and let you filter out low-confidence samples.
  • Pseudo-labels can be used on labeled data—useful for removing noisy samples.
  • Avoid information leakage. When using k-fold, you must compute k sets of pseudo-labels so that validation data never sees labels from models trained on itself.

In action: In the BirdCLEF 2024 competition, the task was species classification from bird audio recordings. Pseudo-labeling expanded the training set with soft labels on unlabeled clips, which helped our model generalize better to new species and recording conditions. Read the full write-up >

Made practical with GPUs: Pseudo-labeling usually requires retraining pipelines multiple times (baseline > pseudo-labeled > improved pseudo-labels). This can take days on a CPU, making iteration impractical. With GPU acceleration (via cuML, XGBoost or CatBoost GPU backends), you can run several pseudo-labeling cycles in hours.

Even after optimizing our models and ensembles, we found two final tweaks that can squeeze out extra performance:

  • Train with different random seeds. Changing initialization and training paths, then averaging predictions, often improves performance.
  • Retrain on 100% of the data. After finding optimal hyperparameters, fitting your final model on all training data squeezes out extra accuracy.

Why it matters: These steps don’t require new architectures—just more runs of the models you already trust. Together, they boost robustness and ensure you’re making full use of your data.

In action: In the Predicting Optimal Fertilizers challenge, ensembling XGBoost models across 100 different seeds clearly outperformed single-seed training. Retraining on the full dataset provided another leaderboard bump. Read the full write-up >

A line chart showing the benefit of ensembling 100 XGBoost models with different random seeds. The blue line (ensemble) steadily increases and stabilizes around 0.379 MAP@3, while the orange line (average of single seeds) fluctuates around 0.376, showing that seed ensembling improves performance compared to individual models.A line chart showing the benefit of ensembling 100 XGBoost models with different random seeds. The blue line (ensemble) steadily increases and stabilizes around 0.379 MAP@3, while the orange line (average of single seeds) fluctuates around 0.376, showing that seed ensembling improves performance compared to individual models.
Figure 5. Ensembling XGBoost with different random seeds (blue) steadily improves MAP@3 compared to single-seed averages (orange).

Note: MAP@3 (Mean Average Precision at 3) measures how often the correct label appears in the model’s top three ranked predictions.

Made practical with GPUs: Faster training and inference on GPUs make it feasible to rerun models many times. What might take days on CPU becomes hours on GPU—turning “extra” training into a realistic step in every project. 

Wrapping up: the Grandmasters’ playbook

This playbook is battle-tested, forged through years of competitions and countless experiments. It’s grounded in two principles—fast experimentation and careful validation—that we apply to every project. With GPU acceleration, these advanced techniques become practical at scale, making them just as effective for real-world tabular problems as they are for climbing leaderboards.

If you want to put these ideas into practice, here are some resources to get started with GPU acceleration in the tools you already use: 

[ad_2]

The Kaggle Grandmasters Playbook: 7 Battle-Tested Modeling Techniques for Tabular Data

Akselerasi Inferensi LLM Skala Besar dan KV Cache Diperoleh dengan Berbagi Memori CPU-GPU

[ad_1]

Model bahasa besar (LLM) berada di garis depan inovasi AI, tetapi ukurannya yang besar dapat memperumit efisiensi inferensi. Model seperti LLAMA 3 70B dan LLAMA 4 Scout 109b mungkin memerlukan lebih banyak memori daripada yang termasuk dalam GPU, terutama ketika memasukkan jendela konteks yang besar.

Misalnya, memuat model Llama 3 70b dan Llama 4 Scout 109B dalam presisi setengah (FP16) masing -masing masing -masing membutuhkan sekitar 140 GB dan 218 GB memori. Selama inferensi, model-model ini biasanya memerlukan struktur data tambahan seperti cache nilai kunci (kV), yang tumbuh dengan panjang konteks dan ukuran batch. Cache KV yang mewakili jendela konteks token 128K untuk satu pengguna (ukuran batch 1) mengkonsumsi sekitar 40 GB memori dengan Llama 3 70B, dan ini berskala secara linear dengan jumlah pengguna. Dalam penyebaran produksi, mencoba memuat model besar sepenuhnya ke dalam memori GPU dapat menghasilkan kesalahan out-of-memory (OOM).

CPU dan GPU di NVIDIA Grace Blackwell dan Nvidia Grace Hopper Architectures terhubung dengan nvidia nvlink C2C, 900 gb/s, interkoneksi memori-koheren yang memberikan pidato pcu pcu dari pcu. Untuk mengakses dan beroperasi pada data yang sama tanpa transfer data eksplisit atau salinan memori yang berlebihan.

Pengaturan ini memungkinkan kumpulan data dan model yang besar untuk diakses dan diproses lebih mudah, bahkan ketika ukurannya melebihi batas memori GPU tradisional. Koneksi bandwidth tinggi dari koneksi NVLink-C2C dan arsitektur memori terpadu yang ditemukan dalam Grace Hopper dan Grace Blackwell meningkatkan efisiensi penyempurnaan LLM, kV cache yang tidak ada, inferensi, komputasi ilmiah, dan banyak lagi, memungkinkan model untuk memindahkan data dengan cepat dan menggunakan memori CPU jika tidak ada cukup memori GPU.

Gambar tersebut menunjukkan bagaimana memori fisik CPU dan memori fisik GPU berfungsi untuk membuat tabel halaman memori sistem tunggal untuk dibagikan di keduanya.Gambar tersebut menunjukkan bagaimana memori fisik CPU dan memori fisik GPU berfungsi untuk membuat tabel halaman memori sistem tunggal untuk dibagikan di keduanya.
Gambar 1. Koherensi NVLink-C2C dengan Layanan Terjemahan Alamat

Misalnya, ketika model dimuat ke platform seperti NVIDIA GH200 Grace Hopper Superchip, yang menampilkan arsitektur memori terpadu, ia menggunakan 96 GB memori GPU bandwidth tinggi dan mengakses memori LPDDR 480 GB yang terhubung ke CPU tanpa perlu transfer data eksplisit. Ini memperluas total memori yang tersedia, membuatnya layak untuk bekerja dengan model dan kumpulan data yang seharusnya terlalu besar untuk GPU saja.

Panduan Kode

Dalam posting blog ini, menggunakan model LLAMA 3 70B dan Superchip GH200 sebagai contoh kami, kami akan menunjukkan bagaimana model besar dapat dialirkan ke GPU menggunakan memori terpadu, menggambarkan konsep yang dibahas di atas.

Memulai

Untuk memulai, kita perlu mengatur lingkungan kita dan mendapatkan akses ke model LLAMA 3 70B. Perhatikan bahwa sampel kode berikut dirancang untuk dijalankan pada mesin superchip NVIDIA Grace Hopper GH200 untuk menunjukkan manfaat dari arsitektur memori terpadu. Teknik-teknik yang sama ini juga bekerja pada sistem yang berbasis di NVIDIA Grace Blackwell.

Ini melibatkan beberapa langkah sederhana:

  1. Minta akses model dari wajah pelukan: Kunjungi halaman model LLAMA 3 70B tentang memeluk wajah untuk meminta akses.
  2. Menghasilkan token akses: Setelah permintaan Anda disetujui, buat token akses di pengaturan akun pemeluk Anda. Token ini akan digunakan untuk mengotentikasi akses Anda ke model secara terprogram.
  3. Instal Paket yang Diperlukan: Sebelum Anda dapat berinteraksi dengan model, instal perpustakaan Python yang diperlukan. Buka Jupyter Notebook di mesin GH200 dan jalankan perintah berikut:
#Install huggingface and cuda packages
!pip install --upgrade huggingface_hub
!pip install transformers
!pip install nvidia-cuda-runtime-cu12
  1. Masuk ke Wajah Memeluk: Setelah memasang paket, masuk ke wajah memeluk menggunakan token yang Anda hasilkan. Perpustakaan HuggingFace_Hub menyediakan cara yang nyaman untuk melakukan ini:
#Login into huggingface using the generated token

from huggingface_hub import login
login("enter your token")

Apa yang terjadi ketika model LLAMA 3 70B dimuat ke GH200?

Ketika Anda mencoba memuat model LLAMA 3 70B ke dalam memori GPU, parameternya (bobot) dimuat ke memori GPU (memori NVIDIA CUDA). Dalam setengah presisi (FP16), bobot ini membutuhkan sekitar 140 GB memori GPU. Karena GH200 hanya menyediakan 96 GB memori, model tidak dapat sepenuhnya sesuai dengan memori yang tersedia, dan proses pemuatan akan gagal dengan kesalahan OOM. Di sel berikutnya, kami akan menunjukkan perilaku ini dengan contoh kode.

import torch
from transformers import pipeline
pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-70B") #loads the model into the GPU memory

Saat menjalankan perintah di atas, kami melihat pesan kesalahan berikut:

Error message:
OutOfMemoryError: CUDA out of memory. Tried to allocate 896.00 MiB. GPU 0 has a total capacity of 95.00 GiB of which 524.06 MiB is free. Including non-PyTorch memory, this process has 86.45 GiB memory in use. Of the allocated memory 85.92 GiB is allocated by PyTorch, and 448.00 KiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management.

Dari pesan kesalahan, kita dapat melihat bahwa memori GPU maksimal. Anda juga dapat mengonfirmasi status memori GPU Anda dengan menjalankan:

!nvidia-smi

Saat menjalankan perintah, Anda harus mendapatkan output yang mirip dengan gambar berikut. Output memberi tahu kami bahwa kami telah mengkonsumsi 96,746 GB memori dari 97,871 GB pada GPU. Lihat forum ini untuk lebih memahami cara menafsirkan output.

Output dari NVIDIA-SMI menunjukkan bahwa kami telah mengkonsumsi 96,746 GB memori dari 97,871 GB pada GPU.Output dari NVIDIA-SMI menunjukkan bahwa kami telah mengkonsumsi 96,746 GB memori dari 97,871 GB pada GPU.
Gambar 2. Output dari NVIDIA-SMI

Untuk mempersiapkan langkah -langkah kami berikutnya dan melepaskan memori GPU, kami akan menghapus variabel yang tersisa dari upaya yang gagal ini. Di perintah di bawah ini, ganti Dengan ID Proses Python Anda, yang dapat Anda temukan dengan menjalankan perintah! NVIDIA-SMI.

!kill -9 <PID>

Bagaimana cara menyelesaikan kesalahan OOM ini?

Masalah ini dapat diselesaikan dengan menggunakan alokasi memori yang dikelola, yang memungkinkan GPU untuk mengakses memori CPU selain memorinya sendiri. Pada sistem GH200, arsitektur memori terpadu memungkinkan CPU (hingga 480 GB) dan GPU (hingga 144 GB) untuk berbagi ruang alamat tunggal dan mengakses memori masing -masing secara transparan secara transparan. Mengkonfigurasi pustaka Rapids Memory Manager (RMM) untuk menggunakan memori yang dikelola, pengembang dapat mengalokasikan memori yang dapat diakses dari GPU dan CPU, memungkinkan beban kerja untuk melebihi batas memori GPU fisik tanpa transfer data manual.

import rmm
import torch
from rmm.allocators.torch import rmm_torch_allocator
from transformers import pipeline

rmm.reinitialize(managed_memory=True)  #enabling access to CPU memory
torch.cuda.memory.change_current_allocator(rmm_torch_allocator)
 #instructs PyTorch to use RMM memory manager to use unified memory for all memory allocations


pipe = pipeline("text-generation", model="meta-llama/Llama-3.1-70B")

Menjalankan perintah pemuatan model, kami tidak menghadapi kesalahan memori OOO, karena sekarang kami memiliki akses ke ruang memori yang lebih besar.

Anda sekarang dapat menggunakan perintah untuk mengirim prompt ke LLM dan menerima tanggapan.

pipe("Which is the tallest mountain in the world?")

Kesimpulan

Karena ukuran model terus tumbuh, memuat parameter model ke GPU telah menjadi tantangan yang signifikan. Di blog ini, kami mengeksplorasi bagaimana arsitektur memori terpadu membantu mengatasi keterbatasan ini dengan memungkinkan akses ke CPU dan memori GPU tanpa perlu transfer data yang eksplisit, membuatnya lebih mudah untuk bekerja dengan LLMS canggih pada perangkat keras modern.

Untuk mempelajari lebih lanjut tentang cara mengelola memori CPU dan GPU, lihat dokumentasi Rapid Memory Manager.

[ad_2]

Akselerasi Inferensi LLM Skala Besar dan KV Cache Diperoleh dengan Berbagi Memori CPU-GPU

How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

[ad_1]

It’s easy to underestimate how many moving parts a real-world, production-grade AI system involves. Whether you’re building an agent that combines internal data with external LLMs or a service that generates anime on demand, the system must orchestrate multiple models and dynamic data across online and offline components.

Many AI services, from LLMs to vector databases, are readily accessible via off-the-shelf APIs, enabling rapid prototyping and quick demos. As product requirements evolve and API wrappers become increasingly commoditized, differentiated AI products rely more on proprietary data, thoughtfully designed code and agents, and fine-tuned models. This shift often motivates companies to own and operate key components in-house, which also helps alleviate concerns around security, privacy, and compliance.

In this post, we walk through a realistic use case that demonstrates the benefits of operating the stack in-house. We build a Reddit post stylizer and subreddit recommender powered by tens of thousands of vector indices and an online LLM component. Beyond the application itself, we highlight the infrastructure requirements and show how to leverage the new NVIDIA DGX Cloud Lepton for flexible GPU access. We also demonstrate how to use open-source Metaflow—available as a managed service by NVIDIA Inception program partner Outerbounds—to orchestrate the entire system end-to-end.

How Outerbounds helps build differentiated AI products and services

A key challenge to in-sourcing AI components is the operational cost and complexity involved. Nearly all components—including training, inference, and RAG systems—depend on GPUs and require a sophisticated software stack to run efficiently and at-scale. The AI stack is deep: from efficient GPU-centric datacenters, such as Nebius, to optimized models and inference runtimes available as NVIDIA NIM microservices. Then there’s orchestration with developer-friendly APIs, which is where Outerbounds comes in.

Outerbounds provides a secure, cloud-native platform for developing and operating AI systems in your own environment. Built on open source Metaflow, it equips developers with powerful, composable APIs to build, orchestrate, and continuously improve AI products at scale.

How to build AI systems with NVIDIA DGX Cloud Lepton

The GPU cloud landscape has evolved significantly since the early days of the current AI boom. Today, a diverse range of providers, both large and small, offer GPU resources with varying geographic reach and stack depth. Navigating this landscape can be complex, particularly as these clouds must work with your existing hyperscaler infrastructure.

A key benefit of Outerbounds is easy access to diverse compute resources, which removes a major obstacle to building differentiated AI products. From the start, Outerbounds has integrated with NVIDIA Cloud Functions (NVCF) and, more recently, has partnered with Nebius, an NVIDIA Cloud Partner. 

Outerbounds now is enabling early access to NVIDIA DGX Cloud Lepton, which expands access to a growing pool of GPUs through a unified interface.

The following diagram illustrates the new setup in context of a demo application, featured below.

An architecture diagram showing NVIDIA DGX Cloud Lepton integrated with the AI stack on Outerbounds and Nebius cloud infrastructure accelerated by NVIDIA GPUs .An architecture diagram showing NVIDIA DGX Cloud Lepton integrated with the AI stack on Outerbounds and Nebius cloud infrastructure accelerated by NVIDIA GPUs .
Figure 1. NVIDIA DGX Cloud Lepton, integrated with the AI stack on Outerbounds and GPUs through Nebius.

A common obstacle to adopting new GPU clouds is the tight coupling of the company’s existing infrastructure, developer operation (DevOps) practices, and security policies to existing cloud environments. Outerbounds integrates with DGX Cloud Lepton and NVIDIA Cloud Partners, including Nebius, which allows you to bring your own policies and run existing code seamlessly alongside your home cloud without migration. It minimizes the risk and effort involved in getting access to new infrastructure.

Develop a Reddit Agent with DGX Cloud Lepton

To illustrate the benefits of the complete stack and to highlight the intricacies of real-world AI, let’s walk through a fun demo application: an agent that helps you choose the most suitable groups and style when posting on Reddit. A screenshot is worth a thousand words:

Screenshot of a Reddit Agent tool. At the top, a text box contains the user’s prompt: “I think ion thrusters are a good option for future Mars missions.” Below, under “Suggested Subreddits,” three subreddit cards are shown: r/ArtemisProgram, r/SpaceXLounge, and r/IsaacArthur. Each card has a short paragraph post tailored to that subreddit, discussing ion thrusters for Mars missions in contexts such as NASA’s Solar Electric Propulsion, pairing with nuclear power, and their role in space logistics.Screenshot of a Reddit Agent tool. At the top, a text box contains the user’s prompt: “I think ion thrusters are a good option for future Mars missions.” Below, under “Suggested Subreddits,” three subreddit cards are shown: r/ArtemisProgram, r/SpaceXLounge, and r/IsaacArthur. Each card has a short paragraph post tailored to that subreddit, discussing ion thrusters for Mars missions in contexts such as NASA’s Solar Electric Propulsion, pairing with nuclear power, and their role in space logistics.
Figure 2. Example output from the Reddit Agent tool. Each suggestion includes a short, tailored post highlighting the relevance of ion thrusters to that community’s interests.

Although Reddit data is public, we used a preprocessed dataset available on Hugging Face consisting of nearly 100 million posts and comments. (Note that many real-world applications involve private or proprietary data.) In such cases, it is beneficial—and often necessary—to build and operate your own end-to-end stack, including Retrieval-Augmented Generation (RAG), to ensure data privacy and maintain full control over the system, as demonstrated by our example.

The following outlines the system’s high-level architecture and operation:

Diagram of Reddit Agent architecture. At the top, a “Prompt” box leads to databases that match subreddits and comments, then format the content into responses. This process is supported by NVIDIA DGX Cloud Lepton, which contains four components: Embeddings model, Update vector indices, Retrieval model, and Agent deployment. Output flows back to generate the final response. The system is deployed in the cloud and is powered by Nebius.Diagram of Reddit Agent architecture. At the top, a “Prompt” box leads to databases that match subreddits and comments, then format the content into responses. This process is supported by NVIDIA DGX Cloud Lepton, which contains four components: Embeddings model, Update vector indices, Retrieval model, and Agent deployment. Output flows back to generate the final response. The system is deployed in the cloud and is powered by Nebius.
Figure 3. System architecture of the Reddit Agent deployed by Outerbounds.

Here’s what happens when you enter a prompt in the demo app:

  1. The system converts a prompt to an embedding using the nv-embedqa-e5-v5 model, a part of the NVIDIA NeMo Retriever collection, deployed as an NVIDIA NIM container through DGX Cloud Lepton.
  2. The embedding is matched against a GPU-accelerated vector database called FAISS, which contains centroids for all subreddits.
  3. The embedding is then matched against subreddit-specific vector databases for the top subreddits to retrieve topical samples.
  4. The original prompt and topical samples are then passed to a large LLM, llama-3_1-nemotron-70b-instruct (also deployed as a NIM container), to reformat the prompt to match the style of the chosen subreddits.
  5. The agent itself is deployed as a container over DGX Cloud Lepton.

Additionally, a workflow is scheduled to update vector indices. Thanks to an integration between DGX Cloud and Metaflow, you can execute a task responsible for building the indices as a part of a Metaflow workflow by adding the following decorators:

   @conda(packages={'faiss-gpu-cuvs': '1.11.0'}, python='3.11')
   @nvidia(gpu=1, gpu_type='NEBIUS_H100')
   @step
   def build_indices(self):
   	....

Notably, as illustrated by the @conda decorator above, you can take care of the software supply chain efficiently, ensuring that all necessary dependencies, including NVIDIA CUDA drivers, are available for the tasks—no matter what execution environment you choose to target.

Produce lightning fast embeddings and vector indices

Our indexing workflow starts with a dataset containing nearly 100 million posts and comments. After removing comments with fewer than 10 tokens and subreddits with fewer than 100 posts, the dataset contains 50 million passages, spread over 30,000 subreddits.

As a special feature of this example, instead of building a single vector database, the system constructs a separate vector database for each subreddit—over 30,000 vector databases in total—matching samples specific to the style of each community. In addition, the system builds a database for centroids of each community to find the most suitable communities for the prompt.

Due to the large scale of the dataset, the system needs to:

  1. Produce a large set of embeddings in a reasonable amount of time as a batch process.
  2. Index the embeddings quickly, producing tens of thousands of database shards.
  3. Produce an embedding and matching entries with low latency during prompting.

A major benefit of DGX Cloud Lepton is that it provides access to a deep pool of GPU resources across environments. Taking advantage of this feature, the system can parallelize the processing of embeddings—orchestrated by a workflow on Outerbounds—hitting the embedding model across multiple NVIDIA H100 GPUs. The service is able to handle parallel workers, scaling almost linearly:

A bar chart with 10 green bars showing embeddings throughput as a function of the number of parallel workers.A bar chart with 10 green bars showing embeddings throughput as a function of the number of parallel workers.
Figure 4. Embeddings throughput as a function of the number of parallel workers.

Check out this site for further benchmark results using the nv-embedqa-e5-v5 model, as well as other embedding models from NVIDIA on a variety of GPU infrastructures. The resulting dataset of 50 million 1024-dimensional embeddings is nearly 200GB, so Metaflow’s optimized IO path comes in handy when moving the matrix around.

The system achieves very high performance by leveraging the new NVIDIA cuVS-accelerated FAISS library running on an NVIDIA H100 GPU: It can index 10 million embeddings in 80 seconds. In this case, producing 30,000 indices, many of which are small, was 2.5x faster on a single H100 compared to a massive CPU instance, r5.24xlarge, leveraging up to 60 CPU cores in parallel.

Thanks to Nebius, the GPU-accelerated version—using a single H100—is over 2x faster while being 2x cheaper than the CPU instance.

How to assemble building blocks into production-ready AI systems with Outerbounds

The Reddit Recommender Agent illustrates the structure of a typical AI system, spanning:

  • Various LLMs: In this case, an embedding and a retrieval model.
  • Agent deployments: Stateful workers that call LLMs and take actions accordingly.
  • Batch processing: Such as building vector indices and data processing.

You need to orchestrate and operate all these components as a cohesive system, safely and securely deployed within your governance boundary. Importantly, your development workflows and DevOps practices must support safe iteration across the entire system, enabling A/B testing of models, agent versions, and datasets, with detailed tracking of all assets, observation, and evaluation of the results.

Outerbounds addresses these needs by enabling both online agents and offline workflows on a single platform. You can build AI systems with state-of-the-art components, like NIM containers and GPU-accelerated vector indices, while accessing the latest accelerated computing through direct integrations with providers like Nebius or accessing a deep pool of resources via DGX Cloud Lepton. 

Crucially, you can access these resources through simple Python APIs, making the experience as easy as calling off-the-shelf APIs. That helps keep simple things simple while also making sophisticated solutions possible.

To give you an idea, here’s what a live deployment of a particular version of the Reddit Agent looks like on Outerbounds:

 Screenshot of the Outerbounds platform showing the “Reddit Recommender” deployment page. The agent is active and deployed to an NVIDIA H100 GPU compute pool in Nebius, using NVIDIA NIM MessageFormatter and Embeddings models. The interface lists components for Code, Data, and Model, along with 2/64 active workers. A console log displays recent subreddit suggestions for example prompts, such as recommending r/ArtemisProgram, r/Spaceflight, and r/IsaacArthur for a Mars ion thruster discussion. The left sidebar contains navigation links for project assets, components, deployments, workflows, and platform settings. Screenshot of the Outerbounds platform showing the “Reddit Recommender” deployment page. The agent is active and deployed to an NVIDIA H100 GPU compute pool in Nebius, using NVIDIA NIM MessageFormatter and Embeddings models. The interface lists components for Code, Data, and Model, along with 2/64 active workers. A console log displays recent subreddit suggestions for example prompts, such as recommending r/ArtemisProgram, r/Spaceflight, and r/IsaacArthur for a Mars ion thruster discussion. The left sidebar contains navigation links for project assets, components, deployments, workflows, and platform settings.
Figure 5. Outerbounds deployment interface for the Reddit Agent.

As shown in Figure 5 above, Outerbounds keeps track of all the key assets, including code, data, and models that form the end-to-end solution. This is especially useful if you have multiple people working together (or multiple AI co-pilots), as it allows you to safely deploy any number of concurrent variants, each with their own assets, as isolated branched deployments.

Because of these tracking capabilities, you can easily evaluate variants against each other to, for instance, compare the performance of off-the-shelf APIs to custom models.

How to develop differentiated AI systems with full ownership

Building differentiated AI products requires a complete stack from scalable GPU compute to a developer-friendly software layer. Enterprise deployments also need to account for factors like geography, compliance, and data residency, making infrastructure choices important.

DGX Cloud Lepton offers a unified interface to multiple GPU providers, allowing you to match compute demand to the needs of your use case. Outerbounds builds on this foundation, providing the tools to develop and operate AI applications efficiently and reliably.

If you ask the Reddit Agent to highlight the above value proposition in the style of r/dailybargains, which is a popular subreddit for deal hunters, you may get this answer about a promotion Outerbounds is running:

Outerbounds is offering free credits to run workloads on NVIDIA H100 GPUs via DGX Cloud Lepton. You also get access to its enterprise-ready AI platform that helps you build, deploy, and iterate on custom models and agents in your own cloud.

To start testing these capabilities in your environment, get started at Outerbounds. And claim free GPU credits on Nebius’s infrastructure to power your trial.

You can also go deeper with DGX Cloud Lepton in NVIDIA’s Developer Forums or learn more about the NVIDIA Inception program to see how NVIDIA supports AI startups all over the world.

[ad_2]

How to Build AI Systems In House with Outerbounds and DGX Cloud Lepton

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

[ad_1]

As large language models (LLMs) grow larger, they get smarter, with open models from leading developers now featuring hundreds of billions of parameters. At the same time, today’s leading models are also capable of reasoning, which means that they generate many intermediate reasoning tokens before delivering a final response to the user. The combination of these two trends—larger models that think using more tokens—drives the need for significantly higher compute performance. 

Delivering the highest performance on production workloads takes a state-of-the-art technology stack—spanning chips, systems, and software—and an expansive developer ecosystem that is constantly building on that stack. 

MLPerf Inference v5.1 is the latest version of the MLPerf Inference industry standard benchmark. With benchmark rounds held twice per year, the benchmark features many tests of AI inference performance and is regularly updated with new models and scenarios. This round features:

  • DeepSeek-R1 – a popular 671-billion parameter mixture-of-experts (MoE) reasoning model, developed by DeepSeek. In the server scenario, the time-to-first-token (TTFT) threshold is 2 seconds with a 12.5 tokens/second/user (TPS/user) target. All TPS/user targets are 99th percentile, meaning that 99% of tokens meet or exceed that TPS/user speed.
  • Llama 3.1 405B – MLPerf Inference v5.1 adds a new interactive scenario for the largest of the Llama 3.1 series of models, providing a faster 12.5 TPS/user threshold with a shorter 4.5 second TTFT requirement compared to the existing server scenario. 
  • Llama 3.1 8B – an 8-billion parameter member of the Llama 3.1 series of models with offline, server (2 second TTFT, 10 TPS/user), and interactive (0.5 second TTFT, 33 TPS/user) scenarios. This replaces the GPT-J benchmark used in prior rounds. 
  • Whisper – a popular speech recognition model that recently saw nearly 5 million downloads in a month on HuggingFace. This replaces RNN-T, which was featured in prior editions of the MLPerf Inference benchmark suite. 

This round, NVIDIA submitted the first results using the new Blackwell Ultra architecture, announced in March. It came just six months after Blackwell made its debut in the available category in MLPerf Inference v5.0, setting new inference performance records. Additionally, the NVIDIA platform set new performance records on all newly added benchmarks this round—DeepSeek-R1, Llama 3.1 405B, Llama 3.1 8B, and Whisper—and continues to hold per-GPU performance records on all other MLPerf inference benchmarks.

MLPerf Inference Per-Accelerator Records
Benchmark Offline Server Interactive
DeepSeek-R1 5,842 tokens/second/GPU 2,907 tokens/second/GPU **
Llama 3.1 405B 224 tokens/second/GPU 170 tokens/second/GPU 138 tokens/second/GPU
Llama 2 70B 99.9% 12,934 tokens/second/GPU 12,701 tokens/second/GPU 7,856 tokens/second/GPU
Llama 2 70B 99% 13,015 tokens/second/GPU 12,701 tokens/second/GPU 7,856 tokens/second/GPU
Llama 3.1 8B 18,370 tokens/second/GPU 16,099 tokens/second/GPU 15,284 tokens/second/GPU
Stable Diffusion XL 4.07 samples/second/GPU 3.59 queries/second/GPU **
Mixtral 8x7B 16,099 tokens/second/GPU 16,131 tokens/second/GPU **
DLRMv2 99% 87,228 samples/second/GPU 80,515 samples/second/GPU **
DLRMv2 99.9% 48,666 samples/second/GPU 46,259 queries/second/GPU **
Whisper 5,667 tokens/second/GPU ** **
R-GAT 81,404 samples/second/GPU ** **
Retinanet 1,875 samples/second/GPU 1,801 queries/second/GPU **
Table 1. Performance records per GPU based on submissions powered by the NVIDIA platform. 

MLPerf Inference v5.0 and  v5.1, Closed Division. Results retrieved from www.mlcommons.org on September 9, 2025. NVIDIA platform results from the following entries: 5.0-0072, 5.1-0007, 5.1-0053, 5.1-0079, 5.1-0028, 5.1-0062, 5.1-0086, 5.1-0073, 5.1-0008, 5.1-0070,5.1-0046, 5.1-0009, 5.1-0060, 5.1-0072. 5.1-0071, 5.1-0069 Per chip performance derived by dividing total throughput by number of reported chips. Per-chip performance is not a primary metric of MLPerf Inference v5.0 or v5.1.The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

NVIDIA also made extensive use of NVFP4 acceleration across all DeepSeek-R1 and Llama model submissions using the Blackwell and Blackwell Ultra architectures. 

In this post, we take a closer look at these performance results and the full-stack technologies that enabled them. 

Blackwell Ultra sets reasoning records in MLPerf debut

This round, NVIDIA submitted results in the available category using the GB300 NVL72 rack-scale system, the first-ever MLPerf submissions using the Blackwell Ultra architecture. Blackwell Ultra builds upon the many advances in the NVIDIA Blackwell architecture, with several key enhancements:

  • 1.5x higher peak NVFP4 AI compute
  • 2x higher attention-layer compute
  • 1.5x higher HBM3e capacity

Compared to the GB200 NVL72 submission, GB300 NVL72 delivered up to 45% higher performance per GPU, setting the standard on the new DeepSeek-R1 benchmark. And compared to unverified results collected on a Hopper-based system, Blackwell Ultra delivered about 5x higher throughput per GPU—translating into significantly higher AI factory throughput and much lower cost per token.

DeepSeek-R1 Performance
Architecture Offline Server
Hopper 1,253 tokens/second/GPU 556 tokens/second/GPU
Blackwell Ultra 5,842 tokens/second/GPU 2,907 tokens/second/GPU
Blackwell Ultra Advantage 4.7x 5.2x
Table 2. Per-GPU performance on DeepSeek-R1. 

MLPerf Inference v5.1, Closed. Blackwell Ultra results based on results in entry 5.1-0072. Hopper results not verified by MLCommons Association. Per-GPU performance is not a primary metric of MLPerf Inference v5.1 and is calculated by dividing reported throughput by the number of reported accelerators. Verified results retrieved from www.mlcommons.org on September 9, 2025. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

These results were enabled by the world-class architectural capabilities of Blackwell Ultra and the highly optimized and versatile NVIDIA inference stack. Here are some of the key technologies powering the NVIDIA Blackwell Ultra submissions on DeepSeek-R1:

Extensive use of NVFP4

The base DeepSeek-R1 model incorporates weights stored in FP8 precision. Using a quantization recipe developed by NVIDIA and included as part of the NVIDIA TensorRT Model Optimizer library, the majority of the DeepSeek-R1 weights were successfully quantized to NVFP4, a four-bit floating point format developed by NVIDIA and accelerated by Blackwell and Blackwell Ultra Tensor Cores. This optimization led to reduced model size and the ability to use the higher-throughput NVFP4 compute built into Blackwell and the even higher throughput in Blackwell Ultra—all while meeting the strict target accuracy of the benchmark. 

FP8 key-value cache 

In the base DeepSeek-R1 model, the key-value (KV) cache is stored in the BF16 data format. Once again, using both TensorRT Model Optimizer and TensorRT-LLM inference libraries, the KV-cache was quantized to FP8 precision, significantly reducing its memory footprint and enabling higher performance. 

New parallelism techniques

The unique architecture of the DeepSeek-R1 model means that traditional tensor parallel and pipeline parallel techniques used for multi-GPU execution were insufficient for maximum performance. For the NVIDIA DeepSeek-R1 submissions, expert parallelism was used for the MoE portion of model execution, and data parallelism was used for the attention mechanism. This required redesigned MoE and attention kernels, as well as new communication kernels to perform gather and scatter operations. 

With this new parallelism technique, balancing the context query workload across all GPUs is critical. This challenge involves maintaining both a high overall throughput, and a low first-token latency. We developed Attention Data Parallelism Balance (ADP Balance), a technique that intelligently distributes the context query to optimize for both of these metrics.This ensures every GPU remains productive, preventing bottlenecks and delivering a responsive, high-speed experience for all users. For a detailed technical explanation, please refer to our TensorRT-LLM GitHub page.

CUDA Graphs

During iterations of the inference process that were decode-only, NVIDIA submissions use CUDA Graphs to record and replay GPU operations using a single CPU operation. This reduces CPU overhead, leading to higher performance. 

Disaggregated serving Blackwell performance on Llama 3.1 405B Interactive 

The newly added interactive scenario for the Llama 3.1 405B benchmark introduces more stringent TTFT and TPS/user constraints compared to the server scenario, at more than 2x the output token rate and 1.3x faster TTFT. Delivering strong performance on this challenging new benchmark scenario required the application of many state-of-the-art technologies in the NVIDIA Blackwell platform and NVIDIA inference software stack. 

For serving very large models like Llama 3.1 405B at interactive token rates, sharding the models across many GPUs enables more aggregate compute to be used. That enables optimal throughput and meets latency requirements. To support the immense communication needs of large model, multi-GPU inference, both the NVIDIA Blackwell and Blackwell Ultra platforms support all-to-all communication via NVLink fabric at 1,800 GB/s between 72 GPUs for total aggregate bandwidth of 130 TB/s.  

To meet these requirements while delivering maximum throughput, NVIDIA submissions using the GB200 NVL72 rack-scale system on this benchmark also employed disaggregated serving. This implementation contributed significantly to the nearly 1.5x increase in throughput per GPU compared to traditional aggregated serving using in-flight batching on a DGX B200 system. That’s a greater than 5x cumulative improvement compared to in-flight batching results collected on a DGX H200 system.  

On the left is an image of a GB200 NVL72 server rack, with subsets of the compute trays highlighted, some for decode and some for prefill. NVLin Switch trays are also highlighted. On the right, a bar chart showing that Blackwell with Dynamo delivers more than 5x throughput per GPU compared to Hopper without Dynamo. Blackwell with  Dynamo result is in the open division while Hopper without Dynamo is in the closed division.On the left is an image of a GB200 NVL72 server rack, with subsets of the compute trays highlighted, some for decode and some for prefill. NVLin Switch trays are also highlighted. On the right, a bar chart showing that Blackwell with Dynamo delivers more than 5x throughput per GPU compared to Hopper without Dynamo. Blackwell with  Dynamo result is in the open division while Hopper without Dynamo is in the closed division.
Figure 1. Blackwell with disaggregated serving delivers more than 5x Hopper performance on Llama 3.1 405B interactive.

Hopper results from 8-GPU HGX H200 submission in entry  5.1-0075.  Blackwell baseline from result at entry 5.1-0069 using DGX B200 with 8 GPU. Blackwell with disaggregated serving using GB200 NVL72 with 72 GPUs from entry 5.1-0071. Performance is per GPU, calculated by dividing total reported throughput by accelerator count. Performance per GPU is not a primary metric of MLPerf Inference. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

Traditional LLM deployments typically co-locate the two main stages of inference—context and generation—on the same GPU or node. However, these phases have fundamentally different characteristics: Context is token-parallel and compute-intensive, while generation is autoregressive and latency-sensitive. They also operate under distinct service level agreements—TTFT for context, and intertoken latency (ITL) for generation—which call for different model parallelism strategies. Co-locating them often results in inefficient resource use, particularly for long input sequences.

Disaggregated serving decouples context and generation across separate GPUs or nodes, enabling independent optimization for each phase. This approach allows different parallelism techniques and flexible GPU allocation, improving overall system efficiency.

The NVIDIA Dynamo inference framework also provides support for disaggregated serving. The latest release of Dynamo also features many additional capabilities for inference deployments beyond disaggregated serving, including SLA-based autoscaling, real-time LLM observability metrics, and fault tolerance. Learn more here. 

Key takeaways

NVIDIA continues to demonstrate leading inference performance across a breadth of AI models and scenarios, with outstanding results on both newly added and existing benchmarks. The debut submission of the GB300 NVL72 rack-scale system based on the Blackwell Ultra GPU architecture delivered a large boost for reasoning inference just six months after the first available-category submission of the Blackwell-based GB200 NVL72 submission. 

Additionally, the Llama 3.1 405B interactive submission using disaggregated serving demonstrated how state-of-the-art serving techniques can yield significant increases in inference throughput.

To reproduce the great results from this blog, check out the MLPerf Inference v5.1 GitHub repository here. 

And to further accelerate inference performance, NVIDIA also unveiled Rubin CPX—a processor purpose-built to accelerate long context processing. To learn more about this new Rubin CPX, see this technical blog.

[ad_2]

NVIDIA Blackwell Ultra Sets New Inference Records in MLPerf Debut

Nvidia rubin CPX mempercepat kinerja dan efisiensi inferensi untuk beban kerja konteks 1m+ token

[ad_1]

Inferensi telah muncul sebagai perbatasan kompleksitas baru di AI. Model modern berkembang menjadi sistem agen yang mampu melakukan penalaran multi-langkah, memori persisten, dan konteks penjahat panjang-memungkinkan mereka untuk menangani tugas-tugas kompleks di seluruh domain seperti pengembangan perangkat lunak, pembuatan video, dan penelitian mendalam. Beban kerja ini menempatkan tuntutan yang belum pernah terjadi sebelumnya pada infrastruktur, memperkenalkan tantangan baru dalam komputasi, memori, dan jaringan yang membutuhkan pemikiran ulang mendasar tentang bagaimana inferensi diskalakan dan dioptimalkan.

Di antara tantangan -tantangan ini, memproses konteks besar untuk kelas beban kerja tertentu telah menjadi semakin kritis. Dalam pengembangan perangkat lunak, misalnya, sistem AI harus beralasan atas seluruh basis kode, mempertahankan dependensi lintas file, dan memahami struktur tingkat repositori-transformasi pengkodean asisten dari alat pelengkap otomatis menjadi kolaborator cerdas. Demikian pula, aplikasi video dan penelitian bentuk panjang menuntut koherensi dan memori berkelanjutan di jutaan token. Persyaratan ini mendorong batas -batas apa yang dapat didukung oleh infrastruktur saat ini.

Untuk mengatasi pergeseran ini, NVIDIA Smart Framework menyediakan jalur ke depan – mengoptimalkan inferensi di seluruh skala, kinerja multidimensi, arsitektur, ROI, dan ekosistem teknologi yang lebih luas. Ini menekankan infrastruktur terpisah tumpukan penuh yang memungkinkan alokasi sumber daya komputasi dan memori yang efisien. Platform seperti NVIDIA Blackwell dan NVIDIA GB200 NVL72, dikombinasikan dengan NVFP4 untuk inferensi presisi rendah dan perangkat lunak open source seperti NVIDIA Tensorrt-LLM dan NVIDIA Dynamo, adalah performa inferensi mendefinisikan ulang di seluruh lanskap AI.

blog ini mengeksplorasi evolusi berikutnya dalam infrastruktur inferensi terpisah dan memperkenalkan Nvidia Rubin CPX-GPU yang dibangun dengan tujuan yang dirancang untuk memenuhi tuntutan beban kerja AI konteks lama dengan efisiensi dan ROI yang lebih besar.

Inferensi terpilah: Pendekatan yang dapat diskalakan untuk kompleksitas AI

Inferensi terdiri dari dua fase yang berbeda: fase konteks dan fase generasi, masing -masing menempatkan tuntutan yang berbeda secara mendasar pada infrastruktur. Fase konteks terikat komputasi, yang membutuhkan pemrosesan throughput tinggi untuk menelan dan menganalisis volume besar data input untuk menghasilkan hasil output token pertama. Sebaliknya, fase generasi adalah memori yang terikat bandwidth, mengandalkan transfer memori cepat dan interkoneksi berkecepatan tinggi, seperti NVLink, untuk mempertahankan kinerja output token-by-token.

Inferensi terpilah memungkinkan fase -fase ini diproses secara independen, memungkinkan optimalisasi sumber daya komputasi dan memori yang ditargetkan. Pergeseran arsitektur ini meningkatkan throughput, mengurangi latensi, dan meningkatkan pemanfaatan sumber daya secara keseluruhan (Gambar 1).

Diagram pipa inferensi terpilah. Dokumen/database/video memberi makan prosesor konteks (ditampilkan sebagai GPU B dengan pertukaran ke GPU A); Outputnya masuk ke cache kunci -nilai yang dibaca oleh simpul generasi GPU B untuk menghasilkan hasil. Label Catatan GPU A dioptimalkan untuk pemrosesan konteks lama, sementara GPU B memberikan TCO yang kuat untuk konteks dan generasi.Diagram pipa inferensi terpilah. Dokumen/database/video memberi makan prosesor konteks (ditampilkan sebagai GPU B dengan pertukaran ke GPU A); Outputnya masuk ke cache kunci -nilai yang dibaca oleh simpul generasi GPU B untuk menghasilkan hasil. Label Catatan GPU A dioptimalkan untuk pemrosesan konteks lama, sementara GPU B memberikan TCO yang kuat untuk konteks dan generasi.
Gambar 1. Mengoptimalkan inferensi dengan menyelaraskan kemampuan GPU dengan konteks dan beban kerja generasi

Namun, disagregasi memperkenalkan lapisan kompleksitas baru, membutuhkan koordinasi yang tepat di seluruh transfer cache KV latensi rendah, rute-sadar LLM, dan manajemen memori yang efisien. Nvidia Dynamo berfungsi sebagai lapisan orkestrasi untuk komponen -komponen ini, dan kemampuannya memainkan peran penting dalam hasil inferensi MLPERF terbaru. Pelajari bagaimana disagregasi dengan Dynamo di GB200 NVL72 menetapkan catatan kinerja baru.

Untuk memanfaatkan manfaat inferensi terpilah-terutama dalam fase konteks komputasi-intensif-akselerasi khusus sangat penting. Mengatasi kebutuhan ini, NVIDIA memperkenalkan Rubin CPX GPU-solusi yang dibangun khusus yang dirancang untuk memberikan kinerja throughput tinggi untuk beban kerja inferensi konteks panjang bernilai tinggi sementara dengan mulus berintegrasi ke dalam infrastruktur yang terpilah.

Rubin CPX: Dibangun untuk mempercepat pemrosesan konteks panjang

Rubin CPX GPU dirancang untuk meningkatkan kinerja konteks panjang, melengkapi infrastruktur yang ada sambil memberikan efisiensi yang dapat diskalakan dan memaksimalkan ROI dalam penyebaran inferensi yang sadar konteks. Rubin CPX, dibangun dengan arsitektur Rubin, memberikan kinerja terobosan untuk fase konteks inferensi yang intensif. Ini fitur 30 petaflops dari komputasi NVFP4, 128 GB memori GDDR7, dukungan perangkat keras untuk decoding dan pengkodean video, dan akselerasi perhatian 3X (dibandingkan dengan NVIDIA GB300 NVL72).

Dioptimalkan untuk memproses urutan yang panjang secara efisien, Rubin CPX sangat penting untuk kasus penggunaan inferensi bernilai tinggi seperti pengembangan aplikasi perangkat lunak dan pembuatan video HD. Dirancang untuk melengkapi arsitektur inferensi terpilah yang ada, ini meningkatkan throughput dan responsif sambil memaksimalkan ROI untuk beban kerja AI generatif skala besar.

Rubin CPX bekerja bersama-sama dengan NVIDIA VERA CPU dan Rubin GPU untuk pemrosesan fase-generasi, membentuk solusi penyajian yang lengkap dan berkinerja tinggi untuk kasus penggunaan konteks panjang. NVIDIA VERA RUBIN NVL144 CPX Rack mengintegrasikan 144 Rubin CPX GPU, 144 Rubin GPU, dan 36 Vera CPU untuk mengirimkan 8 exaflops dari NVFP4 Compute-7,5 × lebih banyak dari GB300 NVL72-SPIRSIDE 100 TB dari memori pk-pb dalam pb300 pb/s Mate-side 100 TB dari pita tinggi dan pB3 × pbs di dalam pB3 pb nvl72-spried 100 TB dari speed high-speed dan pbs di dalam pB3 pb nvl72-spried 100 TB dari speed high-speed ingatan dan pbs di dalam pb nvl72-spried spide 100 TB high-speed dan high-speed dan high-speed.

Menggunakan NVIDIA Quantum-X800 Infiniband atau Spectrum-X Ethernet, dipasangkan dengan supernik Nvidia ConnectX-9 dan diatur oleh platform Dynamo, Vera Rubin NVL144 CPX dibangun untuk memberi daya pada gelombang konteks yang diperkenalkan di dunia.

Pada skala, platform dapat memberikan pengembalian investasi 30x hingga 50x, yang diterjemahkan ke pendapatan sebanyak $ 5 miliar dari investasi CAPEX $ 100 juta – mengatur tolok ukur baru untuk Ekonomi Inferensi. Dengan menggabungkan infrastruktur terpilah, akselerasi, dan orkestrasi full-stack, Vera rubin NVL144 CPX mendefinisikan kembali apa yang mungkin untuk perusahaan membangun generasi berikutnya dari aplikasi AI generatif.

Gambar di sebelah kiri menampilkan rak NVIDIA VERA Rubin NVL144 CPX, yang mengintegrasikan 144 Rubin CPX GPU untuk mempercepat pemrosesan konteks-fase, 144 Rubin GPU yang terhubung melalui NVLink untuk pemrosesan fase-generasi, dan 36 Vera CPU, semuanya ditempatkan di dalam rak oberon tunggal untuk pembangkitan. Gambar di sebelah kanan menunjukkan baki tunggal dari rak, berisi 2 Vera CPU, 4 Rubin GPU, dan 8 prosesor Rubin CPX, menunjukkan desain sistem modular dan terukur.Gambar di sebelah kiri menampilkan rak NVIDIA VERA Rubin NVL144 CPX, yang mengintegrasikan 144 Rubin CPX GPU untuk mempercepat pemrosesan konteks-fase, 144 Rubin GPU yang terhubung melalui NVLink untuk pemrosesan fase-generasi, dan 36 Vera CPU, semuanya ditempatkan di dalam rak oberon tunggal untuk pembangkitan. Gambar di sebelah kanan menunjukkan baki tunggal dari rak, berisi 2 Vera CPU, 4 Rubin GPU, dan 8 prosesor Rubin CPX, menunjukkan desain sistem modular dan terukur.
Gambar 2. NVIDIA VERA RUBIN NVL144 Rak dan baki CPX yang menampilkan Rubin Context GPU (Rubin CPX), Rubin GPU, dan Vera CPU

Ringkasan

NVIDIA RUBIN CPX GPU dan NVIDIA VERA RUBIN NVL144 CPX RACK mencontohkan filosofi platform pintar-pengiriman yang dapat diskalakan, kinerja multi-dimensi, dan ROI melalui inovasi arsitektur dan integrasi ekosistem. Didukung oleh NVIDIA Dynamo dan dibangun untuk konteks besar-besaran, ini menetapkan standar baru untuk infrastruktur AI full-stack yang menciptakan kemungkinan baru untuk beban kerja, termasuk pengkodean perangkat lunak canggih dan video generatif.

Pelajari lebih lanjut tentang Nvidia Rubin CPX.

[ad_2]

Nvidia rubin CPX mempercepat kinerja dan efisiensi inferensi untuk beban kerja konteks 1m+ token

Cara Menghubungkan Pusat Data Terdistribusi ke pabrik AI besar dengan jaringan skala-akross

[ad_1]

Penskalaan AI sangat kompleks, dan teknik baru dalam pelatihan dan inferensi terus menuntut lebih banyak dari pusat data. Sementara kemampuan pusat data berskala dengan cepat, infrastruktur pusat data tunduk pada keterbatasan fisik mendasar yang tidak berdampak pada algoritma dan model. Ketersediaan daya, kapasitas pendinginan, dan kendala ruang menempatkan batas pada jejak fisik pabrik AI. Untuk terus tumbuh, pusat data baru dibangun, dan konektivitas dari jarak jauh menjadi faktor dalam mengumpulkan sumber daya ini bersama -sama untuk berfungsi bersama -sama pada satu pelatihan atau beban kerja inferensi yang terpilah.

Secara tradisional, ketika menghubungkan pusat data bersama dengan Ethernet jarak jauh yang dibangun dari silikon pedagang “off-the-shelf”, tujuan utama adalah untuk memastikan bahwa data berhasil membuatnya ke tujuannya. Karena jarak bisa panjang dan latensi tinggi, kemungkinan kemacetan juga tinggi, dan dampaknya bisa ekstrem.

Untuk mengurangi tantangan ini dan mencegah paket dijatuhkan, vendor Ethernet di luar rak menciptakan solusi di mana buffer paket yang dalam, yang mampu menyerap ledakan besar lalu lintas jaringan, digunakan. Sementara sakelar buffer yang dalam ini merupakan solusi untuk penyedia layanan dan telekomunikasi jarak jauh, mereka memperkenalkan masalah untuk AI.

Secara khusus, sakelar dengan buffer yang dalam secara inheren menderita latensi yang lebih tinggi. Selain itu, ketika buffer mulai menjadi penuh, itu harus “menguras.” Sehubungan dengan beban kerja AI, kejadian ini tidak dapat diprediksi, menyebabkan sejumlah besar jitter, atau varian dalam pengiriman data. Latensi tinggi dan ketidakpastian dari teknik kejut-penyerap ini menjadi bermasalah untuk pelatihan dan kinerja inferensi terpilah, yang bersifat sinkron dan membutuhkan kinerja yang dapat diprediksi dari jaringan.

Posting ini menjelaskan bagaimana teknologi Ethernet NVIDIA Spectrum-XGS untuk jaringan skala-across memungkinkan konektivitas pusat-data antar-data dengan kinerja tinggi yang diperlukan untuk AI.

Apa itu skala-across networking?

Jaringan Skala-Across adalah kategori baru konektivitas kain komputasi AI yang dapat dianggap sebagai dimensi baru, ortogonal dengan opsi konektivitas yang ada dari peningkatan dan skala-keluar. Dengan Spectrum-XGS Ethernet untuk jaringan skala-across, beberapa pusat data dengan berbagai ukuran dan jarak dapat disatukan sebagai satu pabrik AI besar. Untuk pertama kalinya, jaringan dapat memberikan kinerja yang dibutuhkan untuk pelatihan dan inferensi AI tunggal skala besar di seluruh pusat data yang dipisahkan secara geografis.

Diagram yang menampilkan beberapa pusat data yang terhubung bersama dengan skala-up, skala-out, dan jaringan skala-across.Diagram yang menampilkan beberapa pusat data yang terhubung bersama dengan skala-up, skala-out, dan jaringan skala-across.
Gambar 1. Tiga jenis jenis jaringan yang diperlukan untuk AI adalah skala-up, skala-keluar, dan skala-escross

Bagaimana NVIDIA Spectrum-XGS Ethernet Mengaktifkan Jaringan Skala-Across?

NVIDIA Spectrum-XGS Ethernet adalah tambahan teknologi baru untuk platform NVIDIA Spectrum-X Ethernet. Ini didasarkan pada kombinasi perangkat keras yang sama dari sakelar Ethernet Spectrum-X dan supernik ConnectX-8, dan memanfaatkan tumpukan perangkat lunak dan perpustakaan yang sama yang digunakan untuk konektivitas skala-keluar dalam pusat data.

Dengan Spectrum-XGS Ethernet, konektivitasnya adalah antara pabrik AI dari jarak jauh; Artinya, lebih dari 500 meter. Ini bisa berarti konektivitas antara bangunan di kampus, atau lebih dari puluhan atau ratusan mil, melintasi kota atau bahkan negara bagian dan negara. Untuk membuat konektivitas skala-terlantar layak, algoritma yang bertanggung jawab untuk memastikan bandwidth yang efektif dan isolasi kinerja tinggi harus berkembang.

Apa peran algoritma yang sadar jarak dalam skala-escross networking?

Salah satu tantangan dengan data bergerak di jarak jauh adalah implikasi peningkatan latensi – bahkan untuk data yang melintasi serat optik dalam bentuk cahaya. Data merambat di seluruh untaian kaca dengan kecepatan 5 nanodetik per meter. Ini berarti bahwa bepergian 1 kilometer membutuhkan 5 mikrodetik. Angka-angka ini mungkin tampak kecil secara absolut, tetapi untuk komunikasi GPU-ke-GPU, setiap mikrodetik diperhitungkan.

Spectrum-XGS Ethernet fitur kontrol kemacetan berbasis telemetri yang dimodifikasi dan algoritma routing adaptif yang dioptimalkan di sekitar jarak antara perangkat yang berkomunikasi. Setiap kali koneksi dimulai, jaringan mencatat apakah kedua perangkat bersama -sama di dalam pusat data, atau tidak.

Ini membantu sakelar mengetahui pendekatan terbaik untuk keseimbangan beban untuk perutean adaptif, dan menginformasikan supernik untuk menangani laju injeksi untuk kontrol kemacetan. Di tingkat jaringan, ini memungkinkan Spectrum-XGS Ethernet untuk secara holistik menangani komunikasi tanpa menimbulkan latensi tambahan.

Beberapa manfaat utama dari teknologi Ethernet Spectrum-XGS untuk jaringan skala-akross meliputi:

  • Arsitektur Jaringan Terpadu dan Terpadu: Kedua skala-Ethernet spectrum-x dan spektrum-XGS Ethernet skala-across didasarkan pada perangkat keras, perangkat lunak, dan pustaka yang sama. Ini mengarah pada pendekatan terpadu untuk manajemen beban kerja dan operasi jaringan yang tidak mungkin dengan Ethernet di luar rak.
  • Kontrol kemacetan berbasis telemetri ujung ke ujung: Arsitektur terpadu juga memungkinkan pendekatan global untuk visibilitas jaringan. Dengan data telemetri yang komprehensif dari jaringan baik di dalam maupun di luar pusat data, manajemen kemacetan berbasis telemetri dapat ditangani tanpa perlu switching buffer yang dalam.
  • Penyeimbangan beban yang cerdas dan menyesuaikan otomatis: Kain AI Spectrum-X Ethernet AI adalah Aware-Aware dan NVIDIA Collective Communications Library (NCCL) -WARE, dengan kemampuan untuk memperhitungkan dan mengkompensasi pola lalu lintas jaringan yang dapat bervariasi berdasarkan situs dan secara dinamis menyesuaikan ambang batas dan batasan untuk memastikan kinerja tertinggi.
  • Latensi yang diminimalkan untuk beban kerja skala-melampaui: Spectrum-XGS Ethernet disetel untuk memberikan hasil yang dapat diprediksi. Ini memungkinkan jaringan untuk memperhitungkan dan mengkompensasi aliran data yang bepergian dari jarak jauh, mengurangi hukuman latensi lebih lanjut tanpa memperkenalkan risiko jitter karena buffer yang dalam.
  • Kapasitas skala elastis: Karena perangkat keras yang sama dapat digunakan untuk skala-out dan skala-es, sumber daya jaringan dapat dialokasikan kembali untuk mendukung lalu lintas pusat intra atau antar-data. Sakelar Ethernet buffer dangkal di luar rak tidak dapat diajarkan kembali untuk konektivitas jarak jauh.

Apa manfaat kinerja NVIDIA Spectrum-XGS Ethernet?

Untuk menunjukkan dampak NVIDIA Spectrum-XGS Ethernet pada kinerja skala-sakit, insinyur NVIDIA menjalankan primitif NCCL di beberapa lokasi pada jarak 10 km dan membandingkan hasilnya dengan Ethernet di luar rak. Hasilnya, ditunjukkan pada Gambar 2 di bawah ini, adalah signifikan:

Grafik yang membandingkan kinerja All-Reduce NCCL antara Spectrum-XGS Ethernet dan Ethernet off-the-Shelf yang menunjukkan ukuran pesan dari 128 kb hingga 16 GB. Grafik menunjukkan hingga 1,9x kinerja yang lebih baik menggunakan Spectrum-XGS Ethernet.Grafik yang membandingkan kinerja All-Reduce NCCL antara Spectrum-XGS Ethernet dan Ethernet off-the-Shelf yang menunjukkan ukuran pesan dari 128 kb hingga 16 GB. Grafik menunjukkan hingga 1,9x kinerja yang lebih baik menggunakan Spectrum-XGS Ethernet.
Gambar 2. NVIDIA Spectrum-XGS Ethernet meningkatkan kinerja hingga 1,9x dibandingkan dengan Ethernet di luar rak

NVIDIA Spectrum-XGS Ethernet memberikan bandwidth NCCL All-reduce hingga 1,9x lebih tinggi di atas Ethernet di luar rak. Speedup terbesar terjadi dengan ukuran pesan yang lebih besar, yang paling umum dengan beban kerja pelatihan AI. Perbaikan kinerja NCCL ini diterjemahkan ke dalam waktu penyelesaian pekerjaan yang lebih cepat untuk aplikasi AI.

Bagaimana Jaringan Skala-Across meningkatkan ROI untuk pabrik AI?

NVIDIA Spectrum-XGS Ethernet meningkatkan kesialan infrastruktur AI. Dengan memperkenalkan teknologi yang memungkinkan pusat data untuk berkomunikasi dalam jarak apa pun tanpa degradasi kinerja, Spectrum-XGS Ethernet menciptakan arsitektur umum yang dibagi antara skala-keluar dan jaringan skala-asross. Pusat data Ethernet yang dibangun di atas spektrum-XGS Ethernet dapat dengan mudah digabungkan bersama untuk bertindak sebagai satu, terlepas dari kedekatan.

Pusat data Ethernet yang dibangun di atas spektrum-XGS dapat digabungkan dengan mulus untuk beroperasi sebagai sistem tunggal, tidak peduli seberapa jauh mereka. Ini memungkinkan infrastruktur AI misi-kritis untuk mengumpulkan sumber daya dan secara konsisten memberikan nilai untuk beban kerja AI tingkat lanjut.

Untuk mempelajari lebih lanjut tentang inovasi teknis yang mendasari NVIDIA Spectrum-X Ethernet, lihat Arsitektur Platform Jaringan NVIDIA Spectrum-X.

[ad_2]

Cara Menghubungkan Pusat Data Terdistribusi ke pabrik AI besar dengan jaringan skala-akross

Maximizing Low-Latency Networking Performance for Financial Services with NVIDIA Rivermax and NEIO FastSocket

[ad_1]

Ultra-low latency and reliable packet delivery are critical requirements for modern applications in sectors such as the financial services industry (FSI), cloud gaming, and media and entertainment (M&E). In these domains, microseconds of delay or a single dropped packet can have a significant impact—causing financial losses, degraded user experiences, or visible glitches in media streams.

Why low-latency and dropless packet delivery matter

The following use cases are common examples where solutions with low latency are commonly required:

  • FSI: Algorithmic trading and market data distribution demand deterministic, low-latency networking. Delays or packet losses can result in missed trading opportunities or incorrect decision-making.
  • Cloud gaming: Cloud gaming platforms must deliver real-time rendering and input feedback. High latency or packet drops lead to lag, poor responsiveness, and user dissatisfaction, which is especially problematic given the rapid growth of the cloud gaming market.
  • M&E: Professional live video production and broadcast workflows (e.g., SMPTE ST 2110) require precise timing and zero packet loss to avoid visible artifacts and ensure compliance with industry standards.

For these use cases, achieving high packet rates, sustaining bandwidth at line rates, and minimizing or eliminating packet drops are essential. Traditional networking stacks struggle to meet these demands, particularly as network speeds scale to 10/25/50/100/200 GbE and beyond.

NVIDIA Rivermax: a high-performance streaming solution

NVIDIA Rivermax is a highly optimized IP-based cross-platform software library designed to deliver exceptional performance for media and data streaming applications. By using advanced NVIDIA GPU-accelerated computing technologies and high-performance network interface cards (NICs), Rivermax achieves a unique combination of ultra-high throughput, precise packet pacing in hardware, minimal latency, and low CPU utilization. This makes it ideal for demanding workloads where efficiency and responsiveness are critical.

A block diagram of the layers supporting NVIDIA Rivermax and CUDA-based products. The foundation is built upon NVIDIA hardware, including GPUs, NICs, DPUs, and CPUs. The next layer highlights the core services provided by this hardware, such as GPUDirect, timing, and networking services. Positioned above these are the NVIDIA CUDA and Rivermax SDKs. The top layer illustrates various low-latency solution markets that leverage these underlying technologies.A block diagram of the layers supporting NVIDIA Rivermax and CUDA-based products. The foundation is built upon NVIDIA hardware, including GPUs, NICs, DPUs, and CPUs. The next layer highlights the core services provided by this hardware, such as GPUDirect, timing, and networking services. Positioned above these are the NVIDIA CUDA and Rivermax SDKs. The top layer illustrates various low-latency solution markets that leverage these underlying technologies.
Figure 1. Rivermax software stack overview

Rivermax’s innovative architecture is built on several key technologies:

  • Kernel bypass: By bypassing the traditional OS kernel, it minimizes overhead and enables direct data transfer between user-space memory and the NIC. This reduces latency and maximizes throughput for high-performance streaming.
  • Zero-copy architecture: Rivermax eliminates unnecessary memory copies by transferring data directly between the GPU and NIC. This approach reduces PCIe transactions, lowers CPU usage, and accelerates data processing.
  • GPU acceleration: Using NVIDIA GPUDirect technology, Rivermax facilitates data movement between the GPU and NIC without the CPU. This offloading mechanism ensures efficient resource utilization while maintaining high throughput.
  • Hardware-based packet pacing: Rivermax ensures precise timing for data streams by implementing packet pacing directly in hardware. This is essential for applications requiring strict compliance with standards like SMPTE ST 2110-21 for professional media workflows.
The image illustrates the Rivermax kernel bypass architecture. The control path for managing the connection between the Rivermax software and the NVIDIA network card uses a standard method, involving the socket API, the kernel's network stack, and the network card's kernel driver. In contrast, the data path completely bypasses the kernel. The image illustrates the Rivermax kernel bypass architecture. The control path for managing the connection between the Rivermax software and the NVIDIA network card uses a standard method, involving the socket API, the kernel's network stack, and the network card's kernel driver. In contrast, the data path completely bypasses the kernel.
Figure 2. Rivermax kernel bypass architecture

NEIO FastSocket based on Rivermax technology: reliable low-latency sockets

As network speeds have rapidly increased, traditional socket-based communication struggles to keep pace, especially at 10/25 GbE and higher. FastSockets from NEIO Systems Ltd. is a flexible middleware library designed for high-performance UDP and TCP communications, overcoming these limitations. Its key focus is to deliver dropless technology with the lowest latency and highest bandwidth/throughput.

The image contrasts the data flow paths between kernel-based I/O and FastSockets based on Rivermax kernel bypass architecture and highlights how bypassing the kernel's processing layers streamlines data flow for improved performance.The image contrasts the data flow paths between kernel-based I/O and FastSockets based on Rivermax kernel bypass architecture and highlights how bypassing the kernel's processing layers streamlines data flow for improved performance.
Figure 3. Traditional networking and FastSockets accelerated comparison 

Using NVIDIA ConnectX adapters, FastSockets leverages Rivermax technologies, enabling kernel bypass techniques that deliver data directly from the NIC to the application, minimizing latency and maximizing packet rates.

Ensuring dropless User Datagram Protocol reception for high-performance networking

In modern networking applications, where speed and efficiency are paramount, reliable data transmission is critical. The User Datagram Protocol (UDP) is widely used for scenarios that require low-latency data transfer, such as video streaming in machine vision and financial market data distribution.

A key characteristic of UDP is that it is connectionless and does not guarantee reliable delivery, unlike protocols like TCP. While this design enables faster data transmission, it also introduces the risk of packet loss. In time-sensitive applications, achieving dropless UDP reception is essential for optimal performance.

Preventing retransmissions and reducing latency

UDP does not include built-in mechanisms for packet recovery, so any lost data must be managed by the application itself. If packet loss occurs, it can trigger manual retransmissions or create data gaps. When retransmissions are required, they can introduce significant delays, directly impacting latency-sensitive applications. For instance, FastSockets media extensions support the GigE Vision (GVA) protocol for machine vision, where even minor packet loss can cause visible glitches or buffering delays.

Algorithmic trading systems are another example, where millisecond delays can lead to lost opportunities or incorrect decisions. Retransmitted data may arrive too late to be useful. Latency is therefore critical. FastSockets delivers packets directly from the NIC to the application, minimizing latency by leveraging the foundational features provided by Rivermax.

Maximizing throughput and minimizing system overhead

The system overhead of kernel-based sockets cannot keep up with the highest packet rates, even when optimizations like CPU binding and enlarged socket buffers are applied. As packet rates increase, the kernel becomes the limiting factor, leading to packet drops. Kernel bypass techniques, as enabled by Rivermax, place data directly into application buffers, supporting dynamic buffer sizes and a zero-copy approach that eliminates unnecessary data copies. Lower overhead also means reduced serialization delays, with more packets being distributed efficiently.

Benchmarking

This section presents benchmarks that highlight the superior performance achieved by leveraging Rivermax technology. FastSockets is available for both Linux and Windows; the focus here is on Windows performance, where Rivermax offers unique advantages. Note that the RIO benchmark is limited in scope, as RIO capabilities are constrained for comprehensive networking performance evaluation.

Metrics and methodology

The benchmarks evaluate three key networking performance metrics: sustained throughput, average packet rate, and end-to-end latency. These metrics are critical for applications requiring high throughput with minimal delay, such as financial trading, cloud gaming, and professional media workflows. Comparisons are made between traditional sockets, Registered I/O (RIO), and FastSockets through Rivermax using NVIDIA ConnectX-6 adapters operating at 25 GbE. Evaluation with RIO is limited, reflecting the restricted functionality provided by RIO in this context.

Sustained throughput

Sustained throughput measures the maximum data transfer rate that can be consistently maintained between the NIC and the application. Achieving line-rate throughput is essential for high-performance streaming and real-time data delivery. As shown in Figure 4, FastSockets using Rivermax achieves a sustained line-rate throughput, while traditional sockets fall significantly short.

The image compares the average packet rate in packets per second (pps) for three different technologies: traditional sockets, Microsoft Registered I/O (RIO) Sockets, and FastSockets based on Rivermax technology.The image compares the average packet rate in packets per second (pps) for three different technologies: traditional sockets, Microsoft Registered I/O (RIO) Sockets, and FastSockets based on Rivermax technology.
Figure 4. Sustained throughput comparison 

Average packet rate

The average packet rate reflects the number of packets processed per second, a crucial measure for workloads involving frequent, small data transfers. Higher packet rates reduce serialization delays for timely data delivery. In Figure 5, FastSockets via Rivermax delivers a dramatic increase in average packet rate, outperforming both sockets and RIO by a wide margin.

The image shows a comparison between the average packet rate of traditional Sockets, Microsoft Registered I/O (RIO) Sockets, and FastSockets using Rivermax technology with a magnitude of 3,350,000 pps.The image shows a comparison between the average packet rate of traditional Sockets, Microsoft Registered I/O (RIO) Sockets, and FastSockets using Rivermax technology with a magnitude of 3,350,000 pps.
Figure 5. Comparison of the average packet rate

Latency

Latency measures the time taken for data to travel from the NIC to the application and back, directly impacting responsiveness in real-time applications. In this context, latency can be defined as half round-trip times, which provides a practical measure of the one-way delay experienced by packets. Lower latency is critical for use cases such as algorithmic trading and live media streaming. As shown in Figure 6, FastSockets demonstrate significantly lower minimum, mean, median, and maximum latency compared to traditional sockets, making it ideal for latency-sensitive environments.

The image compares the latency of traditional sockets and FastSockets, with lower values being more desirable.The image compares the latency of traditional sockets and FastSockets, with lower values being more desirable.
Figure 6. Latency comparison

Serialization delay

Serialization delay refers to the time required to place a packet onto the network medium, which directly impacts the rate at which data can be transmitted from the application to the network. Lower serialization delay is crucial for improving overall throughput and reducing end-to-end latency, especially in high-performance and real-time applications. As shown in Figure 7, FastSockets via Rivermax achieves a substantially lower packet serialization delay compared to traditional sockets, further enhancing its suitability for demanding networking environments.

The image shows a comparison of packet serialization delay between traditional sockets and FastSockets based on Rivermax technology. FastSockets exhibit a much lower delay, at approximately 0.25 μs. The graph demonstrates that FastSockets are 8 times faster at serializing packets than traditional sockets.The image shows a comparison of packet serialization delay between traditional sockets and FastSockets based on Rivermax technology. FastSockets exhibit a much lower delay, at approximately 0.25 μs. The graph demonstrates that FastSockets are 8 times faster at serializing packets than traditional sockets.
Figure 7. Packet serialization delay comparison

What’s next in GPUDirect technology?

GPUDirect technology is poised to improve the performance of trading systems by enabling direct memory access between NICs and GPUs, bypassing the CPU to reduce latency. With high-frequency market data received from exchanges, GPUDirect enables this data to stream directly into GPU memory, enabling rapid execution of AI models to detect critical patterns, such as sudden price movements or order book imbalances.

By accelerating this data pipeline, the system can make faster inferences, enabling trading software direct access to advanced quoting algorithms (pause/cancel/widening markets) during periods of high risk or volume, all without burdening the CPU.

AI models deployed for these use cases are carefully optimized for ultra-low-latency inference directly on GPUs, using technologies such as GPUDirect. These models generally include:

  • Anomaly detection models (autoencoders, Isolation Forests, VAEs) to identify abnormal patterns that may precede volatility or manipulation, such as sudden changes in order book dynamics.
  • Time series forecasting models (LSTM, TCNs, transformer-based models) to predict short-term market movements and trigger responses if sharp price moves are anticipated.
  • Classification models for event detection (CNNs, gradient-boosted trees, simple neural nets) to classify market states and halt quoting during risky or abnormal events.
  • Reinforcement learning agents (DQN, policy gradient, actor-critic) that adaptively learn optimal actions (quote, adjust, stop) based on evolving markets to maximize returns or minimize risk.

Feature engineering is performed on real-time order book snapshots, order flow imbalances, trade statistics, and other relevant data. Inference is further optimized using ONNX, NVIDIA TensorRT, and NVIDIA CUDA, with models distilled and quantized for minimal size and latency.

With Rivermax and GPUDirect powering zero-copy access, market data is streamed directly from high-speed NICs into GPU memory, eliminating PCIe bottlenecks. This architecture enables AI models to process and respond to market changes almost instantaneously, critical for deciding when to quote or pull out during volatile periods.

As these AI and GPU acceleration technologies continue to evolve, their integration with high-performance networking solutions like Rivermax will unlock new levels of speed, intelligence, and adaptability, transforming not only trading but any latency-sensitive domain.

Get started with Rivermax and FastSockets for your ultra-low-latency and zero-packet applications:

[ad_2]

Maximizing Low-Latency Networking Performance for Financial Services with NVIDIA Rivermax and NEIO FastSocket

Pengembang sekarang bisa mendapatkan CUDA langsung dari platform pihak ketiga favorit mereka

[ad_1]

Membangun dan menggunakan aplikasi dapat menjadi tantangan bagi pengembang, mengharuskan mereka untuk menavigasi hubungan kompleks antara kemampuan perangkat keras dan perangkat lunak dan kompatibilitas. Memastikan bahwa setiap komponen perangkat lunak yang mendasarinya tidak hanya diinstal dengan benar tetapi juga cocok dengan versi yang diperlukan untuk menghindari konflik dapat menjadi tugas yang memakan waktu, dan sering mengarah pada penundaan penyebaran dan inefisiensi operasional dalam alur kerja aplikasi.

Itulah sebabnya NVIDIA memudahkan pengembang untuk menggunakan tumpukan perangkat lunak CUDA di berbagai sistem operasi (OS) dan manajer paket.

Perusahaan ini bekerja dengan ekosistem platform distribusi untuk memungkinkan redistribusi CUDA. Penyedia OS Canonical, CIQ, dan SUSE, dan Manajer Lingkungan Pengembang Flox – yang memungkinkan manajer paket NIX – akan mendistribusikan kembali perangkat lunak CUDA secara langsung. Mereka sekarang dapat menyematkan CUDA ke dalam feed paket mereka, menyederhanakan instalasi dan resolusi ketergantungan. Ini sangat bermanfaat untuk memasukkan dukungan GPU ke dalam aplikasi kompleks seperti Pytorch dan perpustakaan seperti OpenCV.

Upaya ini memperluas akses CUDA dan kemudahan penggunaan untuk semua pengembang. Ini menambah cara yang ada mereka memiliki akses dengan membiarkan mereka mendapatkan semua perangkat lunak yang mereka butuhkan di satu lokasi. Distributor tambahan akan segera hadir.

Setiap platform distribusi yang mendistribusikan ulang CUDA akan memberikan beberapa hal penting untuk membantu pengembang dan perusahaan tetap selaras dengan perangkat lunak CUDA yang didistribusikan NVIDIA.

  • Penamaan toolkit CUDA yang konsisten: Paket pihak ketiga akan cocok dengan konvensi penamaan NVIDIA untuk menghindari kebingungan dalam dokumentasi dan tutorial.
  • Pembaruan CUDA tepat waktu: Paket pihak ketiga akan diperbarui tepat waktu setelah rilis resmi NVIDIA untuk memastikan kompatibilitas dan mengurangi overhead QA.
  • Akses gratis lanjutan: CUDA sendiri akan tetap tersedia secara bebas – bahkan ketika dikemas dalam perangkat lunak berbayar. Distributor dapat mengenakan biaya untuk akses ke paket atau perangkat lunak mereka tetapi tidak akan memonetisasi CUDA secara khusus.
  • Opsi Dukungan Komprehensif: Anda dapat mengakses dukungan melalui distributor dan juga dapat menemukan bantuan melalui forum NVIDIA atau situs pengembang NVIDIA, seperti biasa.

Mendapatkan perangkat lunak CUDA dari NVIDIA selalu gratis, dan semua jalan saat ini untuk membuat CUDA tetap ada (mereka termasuk mengunduh toolkit CUDA, menarik wadah CUDA, menginstal Python menggunakan PIP atau Conda).

Tetapi kemampuan untuk platform distribusi untuk mengemas CUDA dalam penyebaran perusahaan yang lebih besar dan aplikasi perangkat lunak memungkinkan kami untuk memastikan pengalaman Anda sebagai pengembang sederhana. Anda mengunduh dan menginstal aplikasi Anda, dan di bawah sampulnya, versi CUDA yang benar diinstal juga.

Bekerja dengan ekosistem NVIDIA dengan cara ini merupakan tonggak penting dalam misi kami untuk mengurangi gesekan dalam penyebaran perangkat lunak GPU. Dengan berkolaborasi dengan pemain kunci di seluruh OS dan lansekap manajemen paket, NVIDIA memastikan bahwa CUDA tetap dapat diakses, konsisten, dan mudah digunakan – tidak ada masalah di mana atau bagaimana pengembang memilih untuk membangun.

Tetap disini untuk pembaruan lebih lanjut karena platform pihak ketiga tambahan diumumkan dan ekosistem CUDA terus berkembang.

[ad_2]

Pengembang sekarang bisa mendapatkan CUDA langsung dari platform pihak ketiga favorit mereka

Menyebarkan inferensi AI yang dapat diskalakan dengan operator NVIDIA NIM 3.0.0

[ad_1]

Model AI, cadangan mesin inferensi, dan kerangka kerja inferensi terdistribusi terus berkembang dalam arsitektur, kompleksitas, dan skala. Dengan laju perubahan yang cepat, menyebarkan dan mengelola pipa inferensi AI secara efisien yang mendukung kemampuan canggih ini menjadi tantangan kritis.

Operator NVIDIA NIM dirancang untuk membantu Anda skala dengan cerdas. Ini memungkinkan administrator kluster Kubernetes untuk mengoperasikan komponen dan layanan perangkat lunak yang diperlukan untuk menjalankan layanan microser nim nim nim untuk model LLM dan multimodal AI terbaru, termasuk penalaran, pengambilan, visi, bicara, biologi, dan banyak lagi.

Rilis terbaru NIM Operator 3.0.0 memperkenalkan kemampuan yang diperluas untuk menyederhanakan dan mengoptimalkan penyebaran layanan mikro NVIDIA NIM dan layanan mikro NVIDIA NEMO di seluruh lingkungan Kubernetes. Operator NIM 3.0.0 mendukung pemanfaatan sumber daya yang efisien dan mengintegrasikan dengan mulus dengan infrastruktur Kubernetes yang ada, termasuk penyebaran KServe.

Pelanggan dan mitra NVIDIA telah menggunakan operator NIM untuk mengelola pipa inferensi secara efisien untuk berbagai aplikasi dan agen AI, termasuk chatbots, agen rag, dan penemuan obat virtual.

NVIDIA baru -baru ini berkolaborasi dengan Red Hat untuk memungkinkan penyebaran NIM di KServe dengan operator NIM. “Red Hat berkontribusi pada operator Open Source Open Source Github Repo untuk memungkinkan penyebaran NIM NIM di Kserve,” kata direktur teknik Red Hat Babak Mozaffari. Fitur ini memungkinkan operator NIM untuk menggunakan NIM Microservices yang mendapat manfaat dari manajemen siklus hidup KServe dan menyederhanakan penyebaran NIM yang dapat diskalakan menggunakan layanan NIM. Dukungan kserve asli di operator NIM juga memungkinkan pengguna untuk mendapatkan manfaat dari cache NIM dan leverage yang dipercayai seperti NEMO.

Posting ini menjelaskan kemampuan baru dalam rilis NIM Operator 3.0.0, termasuk:

Grafik yang menunjukkan arsitektur operator NIM dengan bagian horizontal (atas ke bawah): Contoh AI generatif NVIDIA; Nemo Microservices dan NIM Microservices; Operator NIM; Layanan Infrastruktur; Kubernetes; Distribusi Linux. Grafik yang menunjukkan arsitektur operator NIM dengan bagian horizontal (atas ke bawah): Contoh AI generatif NVIDIA; Nemo Microservices dan NIM Microservices; Operator NIM; Layanan Infrastruktur; Kubernetes; Distribusi Linux.
Gambar 1. Arsitektur Operator NIM

Penyebaran NIM Fleksibel: Kompatibel Multi-Llm dan Multi-Node

Operator NIM 3.0.0 menambahkan dukungan untuk penyebaran NIM yang mudah dan cepat. Anda dapat menggunakannya dengan NIM khusus domain-seperti yang untuk biologi, ucapan, atau pengambilan-atau berbagai opsi penyebaran NIM, termasuk multi-llm yang kompatibel, atau multi-node.

  • Penyebaran NIM yang kompatibel dengan multi-llm: Menyebarkan beragam model dengan bobot khusus dari sumber seperti NVIDIA NGC, memeluk wajah, atau penyimpanan lokal. Gunakan Definisi Sumber Daya Kustom NIM Cache (CRD) untuk mengunduh bobot ke PVC dan Layanan NIM CRD untuk mengelola penyebaran, penskalaan, dan masuk.
  • Penyebaran NIM multi-node Mengatasi tantangan menggunakan LLM besar yang tidak dapat muat pada satu GPU atau perlu berjalan pada beberapa GPU dan berpotensi pada beberapa node. Operator NIM mendukung caching untuk penyebaran NIM multi-node menggunakan NIM Cache CRD, dan menyebarkannya menggunakan layanan NIM CRD pada Kubernetes dengan Leaderworkersets (LWS).

Perhatikan bahwa penyebaran NIM multi-node tanpa GPudirect RDMA dapat mengakibatkan sering restart dari pemimpin LWS dan pod pekerja karena model waktu pemuatan shard. Menggunakan konektivitas jaringan cepat seperti IPOIB atau ROCE sangat disarankan dan dapat dengan mudah dikonfigurasi melalui operator jaringan NVIDIA.

Gambar 2 menunjukkan penyebaran model bahasa besar (LLM) dari perpustakaan wajah pemeluk pada kubernet menggunakan operator NVIDIA NIM sebagai penyebaran NIM multi-llm. Ini secara khusus menunjukkan menggunakan model instruksi LLAMA 3 8B, termasuk layanan dan verifikasi status pod, diikuti oleh a curl Perintah untuk mengirim permintaan ke Layanan.

GIF animasi layar komputer yang menunjukkan penyebaran multi-llm dari model instruksi LLAMA 3 8B menggunakan operator NIM. GIF animasi layar komputer yang menunjukkan penyebaran multi-llm dari model instruksi LLAMA 3 8B menggunakan operator NIM.
Gambar 2. Penempatan multi-llm Model Instruksi Llama 3 8B menggunakan operator NIM

Pemanfaatan GPU yang efisien dengan DRA

DRA adalah fitur kubernet built-in yang menyederhanakan manajemen GPU dengan mengganti plugin perangkat tradisional dengan pendekatan yang lebih fleksibel dan dapat diperluas. DRA memungkinkan pengguna untuk mendefinisikan kelas perangkat GPU, meminta GPU berdasarkan kelas -kelas tersebut, dan memfilternya sesuai dengan beban kerja dan kebutuhan bisnis.

Operator NIM 3.0.0 mendukung DRA di bawah Pratinjau Teknologi dengan mengonfigurasi ResourceClaim dan ResourceClaimTemplate di NIM Pod melalui NIM Service CRD dan NIM Pipeline CRD. Anda dapat membuat dan melampirkan klaim Anda sendiri atau membiarkan operator NIM membuat dan mengelolanya secara otomatis.

Operator NIM DRA mendukung:

  • GPU penuh dan penggunaan MIG
  • Berbagi GPU Melalui pengiralan waktu dengan menetapkan klaim yang sama untuk beberapa layanan NIM

Catatan: Fitur ini saat ini tersedia sebagai pratinjau teknologi, dengan dukungan penuh segera tersedia.

Gambar 3 menunjukkan penyebaran LLAMA 3 8B Instruksi NIM menggunakan Kubernetes DRA dengan operator NIM. Pengguna dapat menentukan klaim sumber daya dalam layanan NIM untuk meminta atribut perangkat keras tertentu seperti arsitektur dan memori GPU, dan berinteraksi dengan LLM yang digunakan menggunakan curl.

GIF animasi layar komputer yang menunjukkan penyebaran Llama 3 8B Instruktur NIM menggunakan Kubernetes DRA dengan operator NIM.GIF animasi layar komputer yang menunjukkan penyebaran Llama 3 8B Instruktur NIM menggunakan Kubernetes DRA dengan operator NIM.
Gambar 3. Penempatan Llama 3 8B Instruksikan NIM Menggunakan Kubernetes DRA dengan Operator NIM

Penempatan mulus di kserve

KServe adalah platform pelayanan inferensi open source yang diadopsi secara luas yang digunakan oleh banyak mitra dan pelanggan. Operator NIM 3.0.0 mendukung penyebaran mentah dan tanpa server di KServe dengan mengonfigurasi sumber daya kustom InferencesService untuk mengelola penyebaran, peningkatan, dan autoscaling NIM. Operator NIM menyederhanakan proses penyebaran dengan secara otomatis mengkonfigurasi semua variabel lingkungan dan sumber daya yang diperlukan dalam CRDService CRD.

Integrasi ini memberikan dua manfaat tambahan:

  • Caching cerdas dengan cache NIM untuk mengurangi waktu inferensi awal dan latensi autoscaling, menghasilkan penyebaran yang lebih cepat dan lebih responsif.
  • Nemo Microservices mendukung evaluasi, pagar pembatas, dan penyesuaian untuk meningkatkan sistem AI untuk latensi, akurasi, biaya, dan kepatuhan.

Gambar 4 menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM pada KServe menggunakan operator NIM. Dua metodologi penyebaran yang berbeda ditampilkan: RawDeployment dan Serverless. Penyebaran tanpa server menggabungkan fungsionalitas autoscaling melalui anotasi K8S. Kedua strategi menggunakan perintah curl untuk menguji respons NIM.

Animasi GIF layar komputer yang menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM tentang KServe menggunakan operator NIM.Animasi GIF layar komputer yang menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM tentang KServe menggunakan operator NIM.
Gambar 4. Penyebaran LLAMA 3.2 1B Instruksi NIM di KServe Menggunakan Operator NIM dengan Metodologi RawDeployment dan Serverless Metodologi

Mulailah menskalakan inferensi AI dengan operator NIM 3.0.0

NVIDIA NIM Operator 3.0.0 membuat penyusutan inferensi AI yang dapat diskalakan lebih mudah dari sebelumnya. Apakah Anda bekerja dengan penyebaran NIM multi-llm yang kompatibel atau multi-node, mengoptimalkan penggunaan GPU dengan DRA, atau menggunakan KServe, rilis ini memungkinkan Anda untuk membangun aplikasi AI berkinerja tinggi, fleksibel, dan terukur.

Dengan mengotomatiskan manajemen penyebaran, penskalaan, dan siklus hidup baik NVIDIA NIM dan NVIDIA NEMO Microservices, Operator NIM memudahkan tim perusahaan untuk mengadopsi alur kerja AI. Upaya ini selaras dengan membuat alur kerja AI mudah digunakan dengan cetak biru Nvidia AI, memungkinkan pergerakan cepat untuk produksi. Operator NIM adalah bagian dari NVIDIA AI Enterprise, memberikan dukungan perusahaan, stabilitas API, dan penambalan keamanan proaktif.

Mulailah melalui NGC atau dari repo Open Source Github NVIDIA/K8S-NIM-Operator. Untuk pertanyaan teknis tentang instalasi, penggunaan, atau masalah, mengajukan masalah pada repo Github NVIDIA/K8S-NIM-Operator.

[ad_2]

Menyebarkan inferensi AI yang dapat diskalakan dengan operator NVIDIA NIM 3.0.0

Akselerate Protein Struktur Inferensi Lebih dari 100x dengan NVIDIA RTX Pro 6000 Blackwell Server Edition

[ad_1]

Perlombaan untuk memahami struktur protein tidak pernah lebih kritis. Dari mempercepat penemuan obat hingga mempersiapkan pandemi masa depan, kemampuan untuk memprediksi bagaimana protein lipatan menentukan kapasitas untuk menyelesaikan tantangan biologis yang paling mendesak dari umat manusia. Sejak pelepasan alphafold2, inferensi AI untuk menentukan struktur protein telah meroket. Alat yang tidak dioptimalkan untuk inferensi struktur protein dapat membebankan biaya jutaan karena kehilangan waktu penelitian dan pemanfaatan komputasi yang berkepanjangan.

GPU NVIDIA RTX Pro 6000 Blackwell Edition edisi yang baru secara mendasar mengubah ini. Terlepas dari terobosan Alphafold2, generasi penyelarasan beberapa urutan ganda yang terikat CPU dan inferensi GPU yang tidak efisien tetap menjadi langkah pembatas laju. Membangun upaya kolaboratif sebelumnya, akselerasi baru yang dikembangkan oleh NVIDIA Digital Biology Research Labs memungkinkan inferensi struktur protein yang lebih cepat dari sebelumnya menggunakan OpenFold tanpa biaya akurasi dibandingkan dengan Alphafold2.

Dalam posting ini, kami akan menunjukkan cara menjalankan analisis protein skala besar menggunakan GPU Edisi Blackwell Server Edisi Blackwell, memberikan kinerja inferensi struktur protein yang belum pernah terjadi sebelumnya untuk platform perangkat lunak, penyedia cloud, dan lembaga penelitian.

Video 1. GPU NVIDIA RTX Pro 6000 Blackwell Server Edition menetapkan tolok ukur baru untuk inferensi struktur protein

Mengapa kecepatan dan skala materi dalam prediksi struktur protein?

Lipatan protein berada di persimpangan beban kerja yang paling komputasi dalam biologi komputasi. Pipa penemuan obat modern memerlukan menganalisis ribuan struktur protein. Pada saat yang sama, proyek-proyek rekayasa enzim menuntut siklus iterasi yang cepat untuk mengoptimalkan fungsi biologis, dan aplikasi biotektan pertanian memerlukan penyaringan pustaka protein besar untuk mengembangkan tanaman yang tahan iklim.

Tantangan komputasi dapat menjadi sangat besar: prediksi struktur protein tunggal dapat melibatkan MSA skala metagenomik, langkah-langkah penyempurnaan iteratif, dan perhitungan ensemble yang biasanya membutuhkan waktu penghitungan waktu. Ketika diskalakan di seluruh proteom atau pustaka target obat, beban kerja ini menjadi sangat memakan waktu pada infrastruktur berbasis CPU.

Misalnya, dalam perbandingan langsung alat penyelarasan multi-urutan, MMSEQS2-GPU menyelesaikan penyelarasan 177x lebih cepat pada L40-an tunggal daripada jackhmmer berbasis CPU pada CPU 128-core dan hingga 720X lebih cepat ketika didistribusikan pada delapan GPU NVIDIA L40S. Speedup ini menyoroti bagaimana revolusi GPU secara dramatis mengurangi kemacetan komputasi dalam bioinformatika protein.

Bagaimana NVIDIA memungkinkan AI struktur protein tercepat yang tersedia?

Membangun rilis baru-baru ini seperti CueQuivariance dan Boltz-2 NIM Microservice, NVIDIA Digital Biology Research Lab memvalidasi peningkatan kinerja terobosan untuk openfold menggunakan RTX Pro 6000 Blackwell Server Edition dan NVIDIA TensorRT di seluruh tolok ukur standar industri (Gambar 1).

Grafik ini menggambarkan proses melewati urutan asam amino ke MMSEQS2-GPU untuk menghasilkan penyelarasan urutan berganda, yang kemudian diteruskan ke model AI OpenFold2 untuk memprediksi struktur protein.Grafik ini menggambarkan proses melewati urutan asam amino ke MMSEQS2-GPU untuk menghasilkan penyelarasan urutan berganda, yang kemudian diteruskan ke model AI OpenFold2 untuk memprediksi struktur protein.
Gambar 1. Prediksi struktur protein dengan MMSEQS2-GPU dan OpenFold2

Memanfaatkan instruksi baru dan TensorRT, MMSEQS2-GPU, dan OpenFold pada RTX Pro 6000 Blackwell memberikan kinerja transformasional untuk prediksi struktur protein, mengeksekusi lipatan lebih dari 138x lebih cepat daripada Alphafold2 dan sekitar 2,8x lebih cepat daripada colabfold, sambil mempertahankan skor TM yang identik.

Pertama, kecepatan inferensi yang lebih cepat diaktifkan dengan MMSEQS2-GPU pada RTX Pro 6000 Blackwell, yang berjalan sekitar 190x lebih cepat dari Jackhmmer dan HHBlits pada CPU Dual-Socket AMD 7742. Selain itu, optimasi Tensorrt yang dipesan lebih dahulu menargetkan lipatan terbuka meningkatkan kecepatan inferensi 2,3x dibandingkan dengan lipatan terbuka awal. Divalidasi pada 20 target protein CASP14, tolok ukur ini menetapkan RTX Pro 6000 Blackwell sebagai solusi terobosan untuk prediksi struktur protein end-to-end.

Menghilangkan hambatan memori

Selain itu, 96 GB memori bandwidth tinggi (1,6 tb/s) memungkinkan RTX Pro 6000 Blackwell untuk melipat seluruh ansambel protein dan MSA besar, memungkinkan alur kerja penuh tetap menjadi residen GPU. Fungsionalitas GPU multi-instance (MIG) memungkinkan RTX Pro 6000 Blackwell tunggal untuk bertindak seperti empat GPU, masing-masing cukup kuat untuk mengungguli GPU inti Tensor NVIDIA L4. Ini memungkinkan banyak pengguna atau alur kerja untuk berbagi server tanpa mengurangi kecepatan atau akurasi.

Berikut adalah contoh lengkap yang menunjukkan cara memanfaatkan kinerja RTX Pro 6000 untuk prediksi struktur protein cepat. Langkah pertama adalah menggunakan OpenFold2 NIM pada mesin lokal Anda.

# See https://build.nvidia.com/openfold/openfold2/deploy for
# instructions to configure your docker login, NGC API Key, and
# environment for running the OpenFold NIM on your local system.

# Run this in a shell, providing the username below and your NGC API Key
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

export NGC_API_KEY=<your personal NGC key>

# Configure local NIM cache directory so the NIM model download can be reused
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
sudo chmod 0777 -R "$LOCAL_NIM_CACHE"

# Then launch the NIM container, in this case using GPU device ID 0.
docker run -it \
    --runtime=nvidia \
    --gpus='"device=0"' \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE":/opt/nim/.cache \
    nvcr.io/nim/openfold/openfold2:latest

# It can take some time to download all model assets on the initial run.
# You can check the status using the built-in health check.  This will
# return {"status": "ready"} when the NIM endpoint is ready for inference.
curl http://localhost:8000/v1/health/ready

Setelah NIM digunakan secara lokal, Anda dapat membangun permintaan inferensi dan menggunakan titik akhir lokal untuk menghasilkan prediksi struktur protein.

#!/usr/bin/env python3

import requests
import os
import json
from pathlib import Path

# ----------------------------
# parameters
# ----------------------------
output_file = Path("output1.json")
selected_models = [1, 2]

# SARS-CoV-2 proteome example
# Spike protein (1273 residues) — critical for vaccine development
sequence = (
"MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFAST
EKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNL
REFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVD
CALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLC
FTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCY
FPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITP
CSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARS
VASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQV
KQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWT
FGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLD
KVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHD
GKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQK
EIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"
)

data = {
    "sequence": sequence,
    "selected_models": [1, 2],
    "relax_prediction": False,
}
print(data)

# ---------------------------------------------------------
# Submit
# ---------------------------------------------------------
url = "http://localhost:8000/biology/openfold/openfold2/predict-structure-from-msa-and-template"
print("Making request...")
response = requests.post(url=url, json=data)

# ---------------------------------------------------------
# View response
# ---------------------------------------------------------
if response.status_code == 200:
    output_file.write_text(response.text)
    print(f"Response output to file: {output_file}")

else:
    print(f"Unexpected HTTP status: {response.status_code}")
    print(f"Response: {response.text}")

Mulailah Akselerasi Protein AI Workflows

Sedangkan Alphafold2 pernah membutuhkan node komputasi kinerja tinggi yang heterogen, akselerasi NVIDIA untuk prediksi struktur protein-termasuk komponen modular dalam CueQuivariance, Tensorrt, dan MMSEQS2-GPU-ON RTX Pro 6000 Blackwell, memungkinkan pelipat pada server tunggal di dunia. Ini membuat lipat skala proteome dapat diakses oleh laboratorium atau platform perangkat lunak apa pun, dengan waktu-ke-prediksi tercepat hingga saat ini.

Apakah Anda mengembangkan platform perangkat lunak untuk penemuan obat, membangun solusi biotektan pertanian, atau melakukan penelitian kesiapsiagaan pandemi, kinerja RTX Pro 6000 Blackwell yang belum pernah terjadi sebelumnya akan mengubah alur kerja biologi komputasi Anda. Kekuatan RTX Pro 6000 Blackwell Server Edition tersedia saat ini di server NVIDIA RTX Pro dari Global System Makers serta dalam contoh cloud dari penyedia layanan cloud terkemuka.

Siap Memulai? Temukan mitra untuk NVIDIA RTX Pro 6000 Blackwell Server Edition dan pengalaman lipat protein dengan kecepatan dan skala yang belum pernah terjadi sebelumnya.

Ucapan Terima Kasih

We'd like to thank the researchers from NVIDIA, University of Oxford, and Seoul National University who contributed to this research, including Christian Dallago, Alejandro Chacon, Kieran Didi, Prashant Sohani, Fabian Berressem, Alexander Nesterovskiy, Robert Ohannessian, Mohamed Elbalkini, Jonathan Cogan, Ania Kukushkina, Anthony Costa, Arash Vahdat, Bertil Schmidt, Milot Mirdita, dan Martin Steinegger.

[ad_2]

Akselerate Protein Struktur Inferensi Lebih dari 100x dengan NVIDIA RTX Pro 6000 Blackwell Server Edition