NVIDIA CUDA-Q

GPU-accelerated quantum computing. Simulate large circuits orders of magnitude faster than CPU simulators — free and open source.

Open SourceGPU AcceleratedC++ CorePython & C++

What is CUDA-Q?

NVIDIA CUDA-Q (formerly CUDA Quantum) is an open-source platform for hybrid quantum-classical computing. Its GPU-accelerated simulator can handle 30+ qubit circuits hundreds to thousands of times faster than CPU simulators. CUDA-Q supports both Python and C++, and works entirely free on any CUDA-capable NVIDIA GPU.

CUDA-Q requires an NVIDIA GPU for full acceleration. For CPU-only machines, it can also run in CPU simulation mode, or try it free on Google Colab with a T4 GPU.

Installation

terminal

# Option 1: pip (recommended for Python users) pip install cudaq # Option 2: Docker (for full CUDA environment) docker pull nvcr.io/nvidia/nightly/cuda-quantum:latest docker run --gpus all -it nvcr.io/nvidia/nightly/cuda-quantum # Option 3: Google Colab (free GPU!) # In a Colab cell with T4 GPU runtime: # !pip install cudaq

Writing Kernels with the @kernel Decorator

CUDA-Q's core concept is the @cudaq.kernel decorator — it marks Python functions as quantum kernels that are compiled and executed on GPU.

cudaq_kernel.py

import cudaq # Define a quantum kernel — compiled to GPU @cudaq.kernel def bell_state(): # Allocate 2 qubits qvec = cudaq.qvector(2) # Apply gates h(qvec[0]) cx(qvec[0], qvec[1]) mz(qvec) # Measure all # Sample the kernel — runs on GPU counts = cudaq.sample(bell_state, shots_count=10000) print(counts) # { 00:4998 11:5002 } print(counts.most_probable()) # '00' or '11' # Get statevector state = cudaq.get_state(bell_state) print(state) # [(0.707+0j), 0j, 0j, (0.707+0j)]

Parameterized Kernels for VQE

cudaq_vqe.py

import cudaq from cudaq import spin import numpy as np from scipy.optimize import minimize @cudaq.kernel def ansatz(theta: float): q = cudaq.qvector(2) x(q[0]) # |10⟩ initial state ry(theta, q[0]) cx(q[0], q[1]) # Define Hamiltonian using Pauli operators hamiltonian = ( 5.907 * spin.z(0) + 2.151 * spin.z(1) + 5.907 * spin.z(0) * spin.z(1) + 0.219 * spin.x(0) * spin.x(1) + 0.219 * spin.y(0) * spin.y(1) ) def cost(theta_list): # cudaq.observe computes ⟨ψ|H|ψ⟩ analytically on GPU exp_val = cudaq.observe(ansatz, hamiltonian, theta_list[0]) return exp_val.expectation() # Minimize the energy result = minimize(cost, x0=[0.0], method='COBYLA', options={'maxiter': 200}) print(f"Ground state energy: {result.fun:.6f}") print(f"Optimal theta: {result.x[0]:.4f}")

Multi-GPU & Asynchronous Execution

cudaq_multigpu.py

import cudaq import asyncio @cudaq.kernel def ghz_state(n: int): qvec = cudaq.qvector(n) h(qvec[0]) for i in range(n - 1): cx(qvec[i], qvec[i + 1]) mz(qvec) # Asynchronous batch execution across GPUs async def run_experiments(): tasks = [ cudaq.sample_async(ghz_state, n, shots_count=1000) for n in [4, 8, 12, 16] ] results = await asyncio.gather(*[t for t in tasks]) for n, r in zip([4, 8, 12, 16], results): print(f"GHZ({n}): {r.most_probable()}") asyncio.run(run_experiments()) # Selecting GPU backend explicitly cudaq.set_target("nvidia") # Single GPU cudaq.set_target("nvidia-mgpu") # Multi-GPU (needs cuQuantum)

💡

Also available via HLQuantum

Want to run the same circuit on multiple backends without rewriting your code? HLQuantum abstracts this SDK (and 5 others) behind a single unified API.

python

import hlquantum as hlq qc = hlq.Circuit(2) qc.h(0).cx(0, 1).measure_all() # One line to switch between any backend result = hlq.run(qc, shots=1024) # auto-detect result = hlq.run(qc, shots=1024, backend="cudaq") # explicit

View HLQuantum Guide