# Research: Qwen3.5-Omni Plus

**Date:** 2026-03-31
**Status:** Complete — VERDICT: WAIT. Indian language speech solved via IndicConformerASR + IndicF5 on Panda instead.
**Relevance:** Potential replacement/complement for Annie's voice pipeline (Nemotron Nano + Kokoro TTS + Whisper STT)

---

## Critical Finding: Naming Disambiguation

There is an important naming distinction in the Qwen ecosystem that must be understood first:

1. **Qwen3.5-Plus** = The hosted API version of Qwen3.5-397B-A17B with extra features (1M context, built-in tools, Auto mode). This is a **text-only LLM** (no speech generation). Available via Alibaba Cloud Bailian API.

2. **Qwen3.5-Omni** = A separate multimodal model family announced March 30-31, 2026 with Thinker-Talker architecture supporting real-time speech I/O. Comes in three sizes: **Plus, Flash, and Light**.

3. **Qwen3.5-Omni-Plus** = The largest variant of the Qwen3.5-Omni family. This is what this research focuses on.

The Qwen3.5-Omni series is distinct from the text-only Qwen3.5 series (0.8B through 397B). They share the Hybrid-Attention MoE architecture but the Omni variants add the Thinker-Talker dual-component design for native speech generation.

---

## 1. Model Parameters & Architecture

### Parameter Counts

| Variant | Total Params | Active Params | Architecture |
|---------|-------------|---------------|--------------|
| Qwen3.5-Omni-Plus | **Not publicly disclosed** (likely 397B-class based on Thinker reuse) | ~17B (estimated) | Hybrid-Attention MoE |
| Qwen3.5-Omni-Flash | Not disclosed (likely 35B-class) | ~3B (estimated) | Hybrid-Attention MoE |
| Qwen3.5-Omni-Light | Not disclosed (likely 9B-class) | ~3B or less (estimated) | Hybrid-Attention MoE |

**Note:** As of March 31, 2026 (release day), Qwen has not published exact parameter counts for the Omni variants. The predecessor Qwen3-Omni was 30B total / 3B active. The "Plus" naming maps to the 397B-A17B tier in the text-only Qwen3.5 lineup, suggesting the Thinker may share or derive from that architecture.

### Architecture: Thinker-Talker Design

```
Input (text/audio/image/video)
    |
    v
[THINKER] -- Hybrid-Attention MoE
  - Gated DeltaNet linear attention (3:1 ratio with full attention)
  - TMRoPE positional encoding
  - Processes all input modalities
  - Outputs text tokens + hidden representations
    |
    v
[INTERVENTION POINT] -- RAG, safety filters, function calls can intercept here
    |
    v
[TALKER] -- Hybrid-Attention MoE
  - RVQ (Residual Vector Quantization) encoding
  - ARIA technology for dynamic text-speech alignment
  - Generates streaming speech tokens in real time
    |
    v
Output (text + speech)
```

### MoE Configuration (from Qwen3.5-397B-A17B base)

- **512 total experts**, 10 routed + 1 shared per token
- Expert intermediate dimension: 1,024
- Hidden dimension: 4,096
- 60 layers: 15 x (3 x [Gated DeltaNet -> MoE] -> 1 x [Full Attention -> MoE])
- Gated DeltaNet: 64 linear attention heads (V), 16 for QK, head dim 128
- Full Attention: 32 Q heads, 2 KV heads, head dim 256, RoPE dim 64

### Key Architectural Innovation

Gated DeltaNet achieves near **O(n) linear** complexity vs standard attention's O(n^2), enabling efficient long-context processing. The 3:1 hybrid ratio (3 linear attention blocks per 1 full attention block) was validated by NVIDIA engineering as an effective balance.

---

## 2. NVFP4 Quantization

### Official NVIDIA NVFP4 Model (Text-only Qwen3.5)

NVIDIA has published **`nvidia/Qwen3.5-397B-A17B-NVFP4`** on HuggingFace:
- Quantized with Model Optimizer v0.42.0
- Weights and activations of MoE linear operators quantized to FP4
- E2M1 FP4 codebook with blockwise FP8 (E4M3) scaling over 16-element micro-blocks
- Calibrated on CNN DailyMail + Nemotron-Post-Training-Dataset-v2
- License: Apache 2.0

### NVFP4 Variants Available on HuggingFace

| Model | Source | Status |
|-------|--------|--------|
| nvidia/Qwen3.5-397B-A17B-NVFP4 | Official NVIDIA | Available |
| Sehyo/Qwen3.5-122B-A10B-NVFP4 | Community | Available |
| txn545/Qwen3.5-122B-A10B-NVFP4 | Community (ModelOpt) | Available, tested on DGX Spark |
| Sehyo/Qwen3.5-35B-A3B-NVFP4 | Community | Available |
| kaitchup/Qwen3.5-27B-NVFP4 | Community | Available |
| AxionML/Qwen3.5-9B-NVFP4 | Community | Available |

### Qwen3.5-Omni NVFP4: NOT AVAILABLE

**No NVFP4 quantization exists for any Qwen3.5-Omni variant as of March 31, 2026.** The Omni models were just announced and open weights have not been released yet. NVFP4 quantization would require:
1. Open weights release
2. Handling both Thinker AND Talker components
3. Preserving speech generation quality through quantization

### NVFP4 Benchmark (Text-only 397B)

| Precision | MMLU Pro | GPQA Diamond | LiveCodeBench V6 | SciCode | AIME 2025 | IFBench |
|-----------|----------|--------------|-------------------|---------|-----------|---------|
| FP8 | 0.883 | 0.871 | 0.837 | 0.467 | 0.918 | 0.782 |
| **NVFP4** | **0.880** | **0.871** | **0.843** | **0.479** | **0.922** | **0.785** |

NVFP4 matches or exceeds FP8 on most benchmarks -- less than 1% degradation.

---

## 3. Modalities

### Input Modalities
- **Text**: 201 languages, 250K vocabulary
- **Audio**: Speech recognition for **113 languages and dialects**; processes up to **10+ hours of audio**
- **Image**: Native vision understanding (not bolt-on)
- **Video**: Up to **400 seconds of 720p (1 FPS)** with audio; up to **1 hour** of video

### Output Modalities
- **Text**: Standard text generation with thinking/reasoning
- **Speech**: Real-time speech generation in **36 languages and dialects** (up from 10 in Qwen3-Omni)

### Speech Features (new in 3.5-Omni)
- **Voice cloning**: Custom voice identity creation
- **Semantic interruption**: Distinguishes genuine user interruption from background noise
- **Voice control**: Adjustable volume, speed, emotional tone
- **ARIA technology**: Dynamic text-speech alignment for natural prosody
- **Streaming**: Real-time speech token generation

### Emergent Capability: "Audio-Visual Vibe Coding"
Qwen3.5-Omni can generate Python code or frontend prototypes from visual + audio instructions -- an emergent capability not specifically trained for.

---

## 4. VRAM Requirements

### Qwen3.5-Omni: No Official Numbers Yet

Since open weights are not released, exact VRAM numbers are unavailable. Estimates based on the text-only Qwen3.5 lineup and Qwen3-Omni (30B/3B) predecessor:

| Model | Precision | Estimated VRAM | Notes |
|-------|-----------|----------------|-------|
| Qwen3.5-Omni-Plus | BF16 | ~800 GB | If 397B-class Thinker + Talker overhead |
| Qwen3.5-Omni-Plus | INT4/GPTQ | ~200-250 GB | Rough estimate |
| Qwen3.5-Omni-Plus | NVFP4 | ~100-130 GB | If quantization becomes available |
| Qwen3.5-Omni-Flash | BF16 | ~70 GB | If 35B-class |
| Qwen3.5-Omni-Flash | INT4 | ~20-25 GB | Potentially fits DGX Spark |
| Qwen3.5-Omni-Light | BF16 | ~18 GB | If 9B-class |
| Qwen3.5-Omni-Light | INT4 | ~5-8 GB | Could run on consumer GPU |

### Reference: Text-only Qwen3.5 VRAM (measured)

| Model | BF16 | INT4/GPTQ | NVFP4 |
|-------|------|-----------|-------|
| 397B-A17B | ~807 GB | ~220 GB | ~100 GB (4x B200) |
| 122B-A10B | ~234 GB | ~65 GB | ~75.6 GB |
| 35B-A3B | ~70 GB | ~20 GB | ~18 GB |
| 27B (dense) | ~54 GB | ~15 GB | - |
| 9B | ~18 GB | ~5 GB | - |

### DGX Spark (128 GB unified) Feasibility

- **Omni-Plus**: Will NOT fit on single DGX Spark even with NVFP4
- **Omni-Flash**: Likely fits in BF16 (~70 GB) or easily in INT4 (~20-25 GB)
- **Omni-Light**: Fits easily in BF16 (~18 GB)

---

## 5. Performance Benchmarks

### Qwen3.5-Omni-Plus: 215 SOTA Results

Qwen3.5-Omni-Plus achieved **215 state-of-the-art results** across:
- Audio understanding and reasoning
- Audio-video understanding
- Speech recognition (113 languages)
- Speech translation
- Interactive dialogue

**Claims**: Surpasses Gemini-3.1 Pro in general audio understanding, reasoning, recognition, translation, and dialogue.

### Text-only Qwen3.5-397B Benchmarks (for reference)

| Benchmark | Qwen3.5-397B | GPT-5.2 | Claude Opus 4.5 | Gemini 3 Pro |
|-----------|-------------|---------|-----------------|--------------|
| OmniDocBench | **90.8** | 85.7 | 87.7 | 88.5 |
| IFBench | **76.5** | - | - | - |
| Tau2-Bench | 86.7 | - | - | - |
| AIME 2026 | 91.3 | - | - | - |
| SWE-bench Verified | **76.4** | - | - | - |
| MMMU | 85.0 | - | - | - |
| VideoMME | 87.5 | - | - | - |

### Predecessor: Qwen3-Omni (30B-A3B) Results

- SOTA on **22 of 36** audio/video benchmarks
- ASR and audio understanding comparable to Gemini 2.5 Pro
- Open-source SOTA on 32 of 36 benchmarks

---

## 6. Context Window

- **Native context**: 256K tokens (262,144)
- **Extended context**: Up to **1M tokens** via YaRN scaling (available in hosted Plus API)
- **Audio context**: Over 10 hours of audio input
- **Video context**: Up to 400 seconds of 720p (1 FPS) or ~1 hour of video
- **Multi-Token Prediction**: Reduces inference costs by 10-60%

The 256K context processes **19x faster** than Qwen3-Max (predecessor) and standard 32K workflows are **8.6x faster**.

---

## 7. Tool Calling

### Confirmed Support
- **WebSearch**: Built-in web search capability
- **Function Calls**: Complex function calling supported
- **Code Interpreter**: Can execute code (in hosted API)
- **"Auto" mode** (hosted API): Adaptive tool use without manual prompting

### Framework Support for Tool Calling
- **SGLang**: `--tool-call-parser qwen3` or `qwen3_coder`
- **vLLM**: `--enable-auto-tool-choice --tool-call-parser qwen3_xml`
- **Ollama**: Supported via Qwen3 template

### Caveat
Tool calling reliability drops when reasoning/thinking mode is disabled. Best used with thinking enabled.

---

## 8. Serving Options

### Current Status (as of March 31, 2026)

| Framework | Qwen3.5 (text) | Qwen3-Omni (speech) | Qwen3.5-Omni |
|-----------|----------------|---------------------|--------------|
| vLLM | Full support | vLLM-Omni project | Not yet (no weights) |
| SGLang | Full support | Limited | Not yet |
| TensorRT-LLM | Via NVIDIA NIM | Unknown | Not yet |
| Ollama | Full support (GGUF) | No (speech not supported) | Not yet |
| Transformers | Full support | Full support | Not yet |
| KTransformers | Supported | Unknown | Not yet |

### vLLM-Omni (for speech models)

vLLM-Omni is a separate project that adds speech I/O support:
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \
  --omni --port 8091 \
  --stage-configs-path /path/to/stage_configs_file
```

### SGLang (text-only Qwen3.5)
```bash
python -m sglang.launch_server \
  --model-path Qwen/Qwen3.5-35B-A3B \
  --port 8000 --tp-size 4 \
  --context-length 262144 \
  --reasoning-parser qwen3
```

### NVIDIA NIM
Qwen3.5-397B-A17B is available as an NVIDIA NIM at `build.nvidia.com`.

### Ollama
```bash
ollama run qwen3.5  # text-only, no speech
```
Ollama does NOT support speech generation modality. Qwen3.5-Omni speech output cannot run through Ollama.

---

## 9. License

- **Qwen3.5 (text-only)**: **Apache 2.0** -- unrestricted commercial use, modification, distribution
- **Qwen3-Omni (predecessor)**: **Apache 2.0**
- **Qwen3.5-Omni**: License **not yet confirmed** as weights are not released. Expected to follow Apache 2.0 based on Qwen's pattern, but not guaranteed.
- **nvidia/Qwen3.5-397B-A17B-NVFP4**: Apache 2.0

---

## 10. aarch64 / DGX Spark Compatibility

### CRITICAL: NVFP4 Bug on ARM64 (FIXED)

**Bug**: vLLM issue [#35519](https://github.com/vllm-project/vllm/issues/35519) -- NVFP4 models crashed on ARM64 GB10 DGX Spark with "CUDA illegal instruction" during generation.

**Root cause**: `cvt.rn.satfinite.e2m1x2.f32` PTX instruction used for NVFP4 activation quantization is **SM100-only**. DGX Spark's GB10 is **SM121/SM121a**, which lacks this instruction.

**Status**: **FIXED** in vLLM PR #35947 -- implemented software E2M1 conversion for SM12x. Validated on DGX Spark: 18.3 tok/s decode (vs 15.8 with Marlin backend).

### DGX Spark Deployment Requirements

For running Qwen3.5 NVFP4 on DGX Spark (text-only models):

**Required environment variables**:
```bash
VLLM_USE_FLASHINFER_MOE_FP4=0
VLLM_NVFP4_GEMM_BACKEND=marlin
VLLM_TEST_FORCE_FP8_MARLIN=1
```

**Required patches**:
1. **MoE Gate BF16 Lock** -- Router gate must stay full precision (qwen3_next.py line 256)
2. **MARLIN backend** -- Default FLASHINFER_CUTLASS uses unsupported MX format on SM121
3. **Unified memory tuning** -- `vm.swappiness=1`, `vm.dirty_bytes=268435456`
4. **fastsafetensors safety** -- Keep `gpu-memory-utilization <= 0.76`

### Measured Performance on DGX Spark (Qwen3.5-122B-A10B-NVFP4)

| Metric | Value |
|--------|-------|
| Prompt processing (pp1024) | 1,553-2,014 tok/s |
| Token generation (tg128) | 14.3-15.0 tok/s |
| Time to first token | ~162 ms |
| Vision inference | ~9 tok/s |
| Model load time | ~11 minutes |
| VRAM (NVFP4) | 75.6 GB of 128 GB |
| KV cache headroom | ~52 GB (~405K tokens) |

### Qwen3.5-Omni on DGX Spark

**Unknown** -- no weights released yet. Key concerns:
- Talker component adds VRAM overhead beyond Thinker
- Speech codec (RVQ) may have additional memory requirements
- Audio processing pipeline may need torchaudio, which has [known ARM64 installation issues](https://forums.developer.nvidia.com/t/support-for-qwen3-tts-on-dgx-spark-gb10-torchaudio-installation-failure-on-arm64/359663)

---

## 11. Comparison with Nemotron Nano and Nemotron Super

### Architecture Comparison

| Feature | Qwen3.5-Omni-Plus | Nemotron 3 Nano 30B-A3B | Nemotron 3 Super 120B-A12B |
|---------|-------------------|------------------------|---------------------------|
| Total params | ~397B (est.) | 30B | 120B |
| Active params | ~17B (est.) | 3B | 12B |
| Architecture | Hybrid-Attention MoE (Gated DeltaNet) | Mamba2-Transformer MoE | Mamba2-Transformer MoE |
| Expert count | 512 | 128 | 128 |
| Context window | 256K (1M hosted) | 256K (1M supported) | 1M |
| Modalities IN | Text + Audio + Image + Video | Text + Image | Text + Image |
| Modalities OUT | Text + **Speech** | Text only | Text only |
| Speech generation | Native (36 languages) | No (needs separate TTS) | No (needs separate TTS) |
| Speech recognition | Native (113 languages) | No (needs separate ASR) | No (needs separate ASR) |
| License | TBD (likely Apache 2.0) | Open weights | Open weights |

### Voice/TTS/STT Use Cases

| Capability | Qwen3.5-Omni-Plus | Nemotron Nano + Pipeline | Annie Current Stack |
|------------|-------------------|-------------------------|-------------------|
| STT | Built-in, 113 languages | Nemotron Speech ASR (0.6B, <24ms) or Whisper | Whisper STT (custom PyTorch) |
| TTS | Built-in, 36 languages, voice cloning | Separate TTS model (magpie_tts, Kokoro) | Kokoro GPU (~30ms) |
| End-to-end latency | Single model inference | ASR + LLM + TTS cascade | STT + LLM + TTS cascade |
| Voice cloning | Native support | Not available in Nano | Not available |
| Semantic interruption | Native support | Must implement in pipeline | Custom in Pipecat |
| Streaming speech | Native | Depends on TTS model | Kokoro supports streaming |
| VRAM (total pipeline) | ~100-130 GB (NVFP4, est.) | ~20 GB (Nano) + ~1 GB (ASR) + ~2 GB (TTS) = **~23 GB** | ~20 GB (all components) |
| Multilingual | 36 TTS languages | English primary | English primary |

**Key insight**: Qwen3.5-Omni-Plus replaces the entire cascade (STT + LLM + TTS) with a single model, but at **5-6x the VRAM cost** compared to the current Annie pipeline.

### Text Generation and Reasoning

| Benchmark | Qwen3.5-397B (text) | Nemotron 3 Nano 30B | Nemotron 3 Super 120B |
|-----------|---------------------|--------------------|-----------------------|
| SWE-bench Verified | **76.4%** | - | 60.47% |
| GPQA | **88.4%** | - | 82.70% |
| MMLU-Pro | **86%+** | - | 83.73% |
| HLE | **25.30%** | - | 18.26% |
| RULER 1M (long ctx) | Not reported | - | **91.75%** |
| Throughput | ~152 tok/s (122B) | **3.3x higher** than Qwen | **458-484 tok/s** |

**Trade-off**: Qwen3.5 wins on accuracy across all benchmarks. Nemotron wins on throughput (2.2-7.5x faster) and long-context retrieval.

### Tool Calling Accuracy

| Model | Tool Calling | Notes |
|-------|-------------|-------|
| Qwen3.5-397B | Supported via function calling | Works best with thinking enabled |
| Nemotron 3 Nano | Basic tool calling | Optimized for speed, not accuracy |
| Nemotron 3 Super | 97% tool accuracy (claimed) | +22 SWE-Bench vs Nano; drops when reasoning disabled |

Nemotron Super has the edge in agentic tool calling for production use. Qwen3.5 has higher raw benchmark scores but tool calling reliability has caveats.

### VRAM Footprint

| Model | BF16 | INT4 | NVFP4 | Fits DGX Spark (128 GB)? |
|-------|------|------|-------|--------------------------|
| Qwen3.5-Omni-Plus | ~800 GB (est.) | ~200 GB (est.) | ~100 GB (est.) | Only NVFP4 (tight) |
| Qwen3.5-Omni-Flash | ~70 GB (est.) | ~20 GB (est.) | ~18 GB (est.) | Yes, comfortably |
| Qwen3.5-Omni-Light | ~18 GB (est.) | ~5 GB (est.) | - | Yes, easily |
| Nemotron Nano 30B-A3B | ~60 GB | ~17 GB | ~15 GB | Yes |
| Nemotron Super 120B-A12B | ~240 GB | ~65 GB | ~40 GB | NVFP4 only |

---

## Summary: Relevance to her-os / Annie

### What Qwen3.5-Omni-Plus Offers

1. **Single-model voice pipeline** -- eliminates STT/TTS cascade complexity
2. **Voice cloning** -- Annie could have a consistent, cloned voice identity
3. **113-language ASR** -- massive multilingual advantage (Kannada for mom!)
4. **36-language TTS** -- covers Indian languages potentially
5. **Semantic interruption** -- built-in, no Pipecat custom logic needed
6. **Audio-visual understanding** -- can process video with audio natively

### Why NOT to Adopt (Yet)

1. **Not released as open weights** -- API-only via Alibaba Cloud Bailian as of today
2. **No NVFP4 quantization** -- cannot run locally on Titan
3. **VRAM prohibitive** -- Plus variant likely needs 100+ GB even quantized, leaving no room for other models
4. **No vLLM/SGLang support** -- Omni variant serving infrastructure not ready
5. **Latency unknown** -- single-model may be slower than optimized cascade for real-time voice
6. **Current Annie pipeline works well** -- Nemotron Nano (48-65 tok/s) + Kokoro (~30ms TTS) + Whisper STT at ~23 GB total

### Recommendation

**WAIT. Do not adopt Qwen3.5-Omni-Plus now.** Monitor these milestones:

1. **Open weights release** on HuggingFace (expected weeks to months)
2. **NVFP4 quantization** by NVIDIA or community
3. **vLLM-Omni support** for Qwen3.5-Omni
4. **DGX Spark compatibility** testing
5. **Qwen3.5-Omni-Flash** or **Light** variants -- these could fit Titan's VRAM budget alongside other models

**The Flash variant** (likely ~35B/3B) is the most promising for Annie:
- Could fit in ~20 GB (INT4) alongside other models
- Native speech I/O eliminates cascade latency
- If NVFP4 works, could be ~15-18 GB

**When open weights drop**, the research priority should be:
1. Run Qwen3.5-Omni-Flash on Titan in INT4
2. Benchmark latency vs current cascade (STT + Nano + Kokoro)
3. Test Kannada ASR quality (113 languages claimed)
4. Test voice cloning with Annie's voice identity
5. If competitive, plan migration from cascade to single-model

---

## 12. Qwen3-TTS — Standalone TTS Model (Separate from Omni)

**Important distinction**: Qwen3-TTS is a standalone text-to-speech model, separate from the Qwen3.5-Omni Thinker-Talker architecture. There is **no Qwen3.5-TTS** — the latest standalone TTS is Qwen3-TTS (released January 2026).

### Model Variants

| Model | Params | VRAM | Use Case |
|-------|-------:|-----:|----------|
| Qwen3-TTS-12Hz-1.7B | 1.7B | ~2-7 GB | Flagship — best quality, voice cloning |
| Qwen3-TTS-12Hz-0.6B | 0.6B | ~1-5 GB | Lightweight — edge/real-time |

Sub-variants: Base (standard TTS), CustomVoice (3-second voice cloning).

### Key Features

- **Languages**: 10 (Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian)
- **No Indian languages** (no Kannada, Hindi, Tamil, etc.)
- **Voice cloning**: 3-second reference audio → cloned voice
- **First-packet latency**: 97ms (streaming)
- **Natural language voice control**: "speak warmly, slowly" style instructions
- **Training data**: 5M+ hours of speech
- **Architecture**: Qwen3-TTS-Tokenizer-12Hz (12 tokens/sec) + dual-track LM (non-DiT)
- **License**: Apache 2.0

### vs Kokoro (Annie's Current TTS)

| Feature | Qwen3-TTS 0.6B | Kokoro |
|---------|----------------|--------|
| Params | 0.6B | 82M |
| First-packet latency | ~97ms | ~30ms |
| VRAM | ~2-5 GB | ~2 GB |
| Voice cloning | Yes (3-sec) | No |
| Languages | 10 | English primary |
| Kannada/Hindi | No | No |
| Quality | SOTA (beats ElevenLabs) | Good |
| DGX Spark aarch64 | **Broken** (torchaudio) | **Works** (patched) |

### DGX Spark Blocker

Same torchaudio ARM64 issue as Qwen3.5-Omni. Active NVIDIA forum thread with no clean resolution yet. Qwen3-TTS depends on torchaudio for audio processing.

### Verdict for Annie

**Don't switch from Kokoro yet.** Hard blockers:
1. torchaudio ARM64 bug on DGX Spark
2. No Kannada/Hindi (neither Kokoro nor Qwen3-TTS helps for "talk to mom")

When torchaudio is fixed: the 0.6B CustomVoice variant (~2 GB) would be a near-drop-in Kokoro replacement with voice cloning + multilingual + SOTA quality upgrades.

---

## Sources

- [nvidia/Qwen3.5-397B-A17B-NVFP4 -- HuggingFace](https://huggingface.co/nvidia/Qwen3.5-397B-A17B-NVFP4)
- [Qwen3.5 Collection -- HuggingFace](https://huggingface.co/collections/Qwen/qwen35)
- [Qwen3-Omni GitHub](https://github.com/QwenLM/Qwen3-Omni)
- [vLLM Issue #35519 -- NVFP4 ARM64 Bug](https://github.com/vllm-project/vllm/issues/35519)
- [DGX Spark Qwen3.5-122B NVFP4 Forum Thread](https://forums.developer.nvidia.com/t/qwen3-5-122b-a10b-nvfp4-quantized-for-dgx-spark-234gb-75gb-runs-on-128gb/361819)
- [Qwen3.5 Lineup & Architecture -- StableLearn](https://stable-learn.com/en/qwen35-native-multimodal-agent-model/)
- [Qwen3.5-Omni 215 SOTA Benchmarks -- ToolMesh](https://www.toolmesh.ai/news/qwen3-5-omni-model-released-sota-vibe-coding)
- [Qwen3.5-Omni Multimodal Voice Launch -- Aihola](https://aihola.com/article/qwen35-omni-multimodal-voice-launch)
- [Qwen3.5: Towards Native Multimodal Agents -- Simon Willison](https://simonwillison.net/2026/Feb/17/qwen35/)
- [Nemotron 3 Super vs Qwen 3.5 -- BestAIFor](https://www.bestaifor.com/blog/nemotron-3-super-vs-qwen-3-5-when-speed-and-accuracy-point-in-opposite-directions)
- [Qwen3.5-35B vs Nemotron 3 Nano -- AwesomeAgents](https://awesomeagents.ai/tools/qwen-3-5-35b-a3b-vs-nemotron-3-nano/)
- [NVIDIA Nemotron Voice Agent Blueprint](https://github.com/NVIDIA-AI-Blueprints/nemotron-voice-agent)
- [Qwen3.5 License -- GitHub](https://github.com/QwenLM/Qwen3.5/blob/main/LICENSE)
- [Qwen3-Omni Technical Report -- arXiv](https://arxiv.org/abs/2509.17765)
- [GPU VRAM Guide for Qwen3.5 -- ApXML](https://apxml.com/posts/qwen-3-5-system-requirement-vram-guide)
- [Qwen3.5 VRAM Breakdown -- Kaitchup](https://kaitchup.substack.com/p/qwen35-9b-4b-2b-and-08b-gpu-requirements)
- [NVIDIA NIM Qwen3.5-397B](https://build.nvidia.com/qwen/qwen3.5-397b-a17b/modelcard)
- [DGX Spark torchaudio ARM64 Issue](https://forums.developer.nvidia.com/t/support-for-qwen3-tts-on-dgx-spark-gb10-torchaudio-installation-failure-on-arm64/359663)
- [Nemotron 3S vs Qwen3.5 Medium](https://agentnativedev.medium.com/nemotron-3s-qwen3-5-one-gpu-with-120b-12b-parameters-e3cbd10d32e6)
- [Qwen3-TTS GitHub](https://github.com/QwenLM/Qwen3-TTS)
- [Qwen3-TTS Technical Report -- arXiv](https://arxiv.org/abs/2601.15621)
- [Qwen3-TTS HuggingFace Demo](https://huggingface.co/spaces/Qwen/Qwen3-TTS)
- [Qwen3-TTS Performance Benchmarks & Hardware Guide](https://qwen3-tts.app/blog/qwen3-tts-performance-benchmarks-hardware-guide-2026)
- [Qwen3-TTS 1.7B CustomVoice -- HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice)
- [Qwen3-TTS 0.6B Base -- HuggingFace](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-0.6B-Base)
- [Qwen Blog: Qwen3-TTS Open Source Announcement](https://qwen.ai/blog?id=qwen3tts-0115)