Generating Furina-Style Speech With CosyVoice3 On Arc B50 And Quadro M4000

Posted on
The reliable path for Furina-style CosyVoice3 speech on Arc Pro B50 is Ubuntu 26.04, the official PyTorch XPU wheel index, a small CosyVoice patch that replaces CUDA assumptions with CUDA/XPU device helpers, the newest matching torch/torchaudio XPU pair, and a hard Transformers cap below 4.53. For a Quadro M4000, the newest working CUDA stack is not the newest CUDA stack overall: PyTorch 2.11.0 with cu126 works because it still ships Maxwell-compatible kernels, while newer CUDA 12.8 and CUDA 13 wheels do not.

This started as a simple automation idea: collect LLVM and HPC news, summarize it, and have a Furina-style voice read the result aloud.

The text side was not the interesting part. The hard part was making CosyVoice3 generate that audio on the right GPU.

There were two very different targets:

  • Intel Arc Pro B50, where the goal was to use the current upstream PyTorch XPU path instead of Intel Extension for PyTorch or an older Intel-specific fork.
  • NVIDIA Quadro M4000, an old Maxwell GPU, where the goal was to find the newest CUDA stack that still includes kernels for compute capability 5.2.

Both runs had to end with a real WAV/OGG file, and both had to prove that the core PyTorch model inference happened on the GPU rather than silently falling back to CPU.

This post is the final runbook. It assumes a fresh Ubuntu 26.04 system, internet access, Docker or normal host install privileges, and one of these GPUs installed. The Arc B50 path is the primary path. The M4000 path is included because old NVIDIA cards have a different failure mode: the newest driver can still run a wheel whose user-space CUDA stack no longer contains code for the GPU.

One practical note before the commands: the verification used a public Furina English sample as the prompt voice. That is fine for a personal lab run. If you publish generated audio, replace the prompt with voice data you have the right to use.

A diagram shows one text input flowing into CosyVoice3, then splitting into an Intel Arc B50 XPU lane and an NVIDIA M4000 CUDA lane, and finally converging on GPU evidence, WAV, OGG, and ASR sanity checks.

Figure 1. The problem was not just text-to-speech. The working system had to choose the correct GPU stack, patch or avoid framework assumptions, generate audio, and prove that model inference used the accelerator.

The Artifacts In This Post

The exact final engineering bundle lives outside this blog repository, but this post includes the small reusable pieces needed to reproduce the run:

The two device-check attachments, check_xpu.py and check_cuda.py, are intentionally included as source files in this page bundle so the verification commands below can be run directly after the download loop.

In the commands below, download those files into a local assets directory:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
export POST_BASE="https://blog.ouankou.com/2026/06/05/furina-voice-cosyvoice3-on-arc-b50-and-quadro-m4000"
mkdir -p ~/Projects/furina-voice-assets
cd ~/Projects/furina-voice-assets

for f in \
  cosyvoice-arc-xpu.patch \
  resolve_xpu_stack.py \
  requirements-xpu-latest.in \
  requirements-cuda-cosyvoice.in \
  check_xpu.py \
  check_cuda.py \
  download_model.py \
  setup_furina_voice.py \
  generate_furina.py \
  asr_sanity_check.py
do
  curl -fsSLO "$POST_BASE/$f"
done

chmod +x *.py
export FURINA_ASSETS="$PWD"

If you are reading the source checkout instead of the deployed site, set FURINA_ASSETS to the page-bundle directory that contains those files.

Common Ubuntu 26.04 Setup

Start with a physical Ubuntu 26.04 host. Install the common tools first:

1
2
3
4
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
  build-essential ca-certificates curl ffmpeg git git-lfs pciutils \
  python3 python3-dev python3-pip python3-venv wget

Install Docker if you want to reproduce the container verification path:

1
2
3
sudo apt-get update
sudo apt-get install -y docker.io
sudo usermod -aG docker "$USER"

Log out and back in so group membership applies, then verify Docker:

1
docker run --rm hello-world

The rest of the post uses normal host virtual environments. The verified engineering bundle also ran these paths in Ubuntu 26.04 Docker containers, with /dev/dri passed through for Intel and /dev/nvidia* plus host driver libraries passed through for NVIDIA.

Arc Pro B50: Host Runtime

On Ubuntu 26.04, the Intel user-space runtime packages are available directly from the distro repositories. Install them:

1
2
3
4
5
6
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
  clinfo intel-ocloc intel-opencl-icd libgomp1 libsndfile1 \
  libze-intel-gpu1 libze1 ocl-icd-libopencl1

sudo usermod -aG render,video "$USER"

Log out and back in, or reboot. Then verify that the GPU device and permissions are visible:

1
2
3
4
lspci -nn | grep -Ei 'intel|vga|display|3d'
ls -l /dev/dri
groups
clinfo | grep -E 'Platform Name|Device Name|Device Type|Driver Version' | head -40

The important facts are:

  • /dev/dri/renderD* exists.
  • groups includes render.
  • clinfo shows an Intel GPU device.

On the verified host, PyTorch later reported:

1
xpu_device_name=Intel(R) Arc(TM) Pro B50 Graphics

Arc Pro B50: Clone And Patch CosyVoice

Create a clean workspace:

1
2
3
4
5
6
7
mkdir -p ~/Projects
cd ~/Projects
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git furina-cosyvoice-b50
cd ~/Projects/furina-cosyvoice-b50
git checkout ace7c47f41bbd303aa6bf1ea80e6f9fbd595cd40
git submodule update --init --recursive
git apply "$FURINA_ASSETS/cosyvoice-arc-xpu.patch"

The patch is the key B50-specific change. CosyVoice is mostly CUDA-oriented upstream. The patch does not try to make every optional accelerator feature work on Intel GPU. It does something narrower:

  • choose xpu when CUDA is unavailable and torch.xpu.is_available() is true;
  • replace CUDA-only autocast, stream, synchronize, cache, and seed calls with device-aware helpers;
  • keep TensorRT and vLLM disabled on non-CUDA devices;
  • keep ONNX frontend sessions on CPU where needed;
  • move the PyTorch CosyVoice model weights and tensors to XPU.

That split matters. Text normalization, ONNX frontend work, file I/O, and WAV writing can remain CPU-side. The required condition is that the main PyTorch model inference runs on XPU.

Arc Pro B50: Python And PyTorch XPU

Install uv and use Python 3.12 for the verified Ubuntu 26.04 lane:

1
2
3
4
5
6
7
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv python install 3.12

cd ~/Projects/furina-cosyvoice-b50
uv venv --seed --python 3.12 .venv
.venv/bin/python -m pip install --upgrade pip "setuptools<82" wheel

Resolve the newest matching torch and torchaudio pair from the official PyTorch XPU wheel index:

1
2
3
4
5
6
7
.venv/bin/python "$FURINA_ASSETS/resolve_xpu_stack.py" \
  --output /tmp/xpu-stack.txt
cat /tmp/xpu-stack.txt

.venv/bin/pip install -r /tmp/xpu-stack.txt
.venv/bin/pip install --upgrade --prefer-binary \
  -r "$FURINA_ASSETS/requirements-xpu-latest.in"

The resolver matters because the latest torch wheel is not enough by itself. CosyVoice uses the PyTorch audio stack, so torch and torchaudio should be version-aligned.

During verification on June 3, 2026, the XPU index had this shape:

1
2
3
torch latest: 2.12.0+xpu
torchaudio latest: 2.11.0+xpu
selected matching pair: 2.11.0+xpu

So the correct choice was not “install the biggest torch version number.” The correct choice was “install the newest matching torch/torchaudio XPU pair.”

A decision diagram shows the PyTorch XPU index publishing torch 2.12 but torchaudio 2.11, causing the resolver to select the matching 2.11 pair, while Transformers 4.53 and newer are rejected by ASR checks.

Figure 2. The B50 path is conservative only where evidence forced it. PyTorch XPU stays on the newest matching domain-wheel pair, while Transformers is capped because newer versions produced wrong speech despite valid GPU markers.

Now prove XPU is usable before loading CosyVoice:

1
2
cd ~/Projects/furina-cosyvoice-b50
PYTORCH_ENABLE_XPU_FALLBACK=0 .venv/bin/python "$FURINA_ASSETS/check_xpu.py"

Required markers:

1
2
3
4
PYTORCH_ENABLE_XPU_FALLBACK=0
xpu_available=True
probe_tensor_device=xpu:0
xpu_memory_after=...

PYTORCH_ENABLE_XPU_FALLBACK=0 is important. It disables silent unsupported-op fallback from XPU to CPU. If the later generation works with this set, the run is much harder to misread.

Do not use plain pip install torch torchaudio from PyPI for the B50 setup. In the Ubuntu 26.04 negative control, plain PyPI installed a CUDA build:

1
2
3
4
torch=2.12.0+cu130
cuda_available=False
has_xpu=True
xpu_available=False

The conventional upstream PyTorch path for this Intel GPU is the official XPU wheel index:

1
https://download.pytorch.org/whl/xpu

Arc Pro B50: Model And Prompt Voice

Download the CosyVoice3 model:

1
2
3
4
cd ~/Projects/furina-cosyvoice-b50
.venv/bin/python "$FURINA_ASSETS/download_model.py" \
  --repo-id FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
  --local-dir pretrained_models/Fun-CosyVoice3-0.5B

The model is about 9 GB.

Extract the prompt voice:

1
2
3
4
cd ~/Projects/furina-cosyvoice-b50
.venv/bin/python \
  "$FURINA_ASSETS/setup_furina_voice.py" \
  --repo-root "$PWD"

The helper uses Hugging Face dataset NaruseShiroha/Genshin-Furina-English, parquet file data/train-00000-of-00002.parquet, row 20. It writes:

1
2
3
4
voices/furina_en/prompt.wav
voices/furina_en/prompt.txt
voices/furina_en/prompt_cosyvoice3.txt
voices/furina_en/metadata.json

CosyVoice3 generation uses prompt.wav and prompt.txt directly. There is no spk2info.pt speaker registration step in this runbook.

Arc Pro B50: Generate Audio

Generate from inline text:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
cd ~/Projects/furina-cosyvoice-b50
mkdir -p outputs

PYTORCH_ENABLE_XPU_FALLBACK=0 .venv/bin/python \
  "$FURINA_ASSETS/generate_furina.py" \
  --backend xpu \
  --repo-root "$PWD" \
  --text "Welcome to Fontaine. The compiler stage is ready, and the GPU will speak." \
  --speed 0.9 \
  --output outputs/furina_test.wav

Generate from a text file:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
cd ~/Projects/furina-cosyvoice-b50
cat > /tmp/furina_tts_text.txt <<'EOF'
Good morning, Traveller. LLVM has taken the stage again, and the offloading world has more than a little drama to report.
EOF

PYTORCH_ENABLE_XPU_FALLBACK=0 .venv/bin/python \
  "$FURINA_ASSETS/generate_furina.py" \
  --backend xpu \
  --repo-root "$PWD" \
  --text-file /tmp/furina_tts_text.txt \
  --speed 0.9 \
  --output outputs/furina_from_file.wav

Convert the WAV to a Telegram-friendly OGG:

1
2
ffmpeg -y -i outputs/furina_test.wav \
  -c:a libvorbis -q:a 4 outputs/furina_test.ogg

The generation run must print these GPU markers:

1
2
3
4
5
6
PYTORCH_ENABLE_XPU_FALLBACK=0
xpu_available=True
probe_tensor_device=xpu:0
model_device=xpu:0
output=...
sample_rate=24000

On the verified B50 Docker lane, the evidence was:

1
2
3
4
5
6
7
8
9
torch=2.11.0+xpu
xpu_available=True
xpu_device_count=1
xpu_device_name=Intel(R) Arc(TM) Pro B50 Graphics
probe_tensor_device=xpu:0
xpu_memory_after_probe=2097152
model_device=xpu:0
sample_rate=24000
duration_seconds=5.540

The Transformers Trap

The most deceptive failure in this project was not a hard crash.

Several dependency combinations produced all the right GPU markers and generated an audio file, but the speech was unrelated to the input text. It sounded a little like the target voice, but it was semantically broken.

That is why the working requirement file keeps most dependencies loose but caps Transformers:

1
transformers>=4.51.3,<4.53

The boundary test looked like this:

1
2
3
4
5
6
7
8
9
transformers==4.52.4: pass, ASR matched requested text
transformers==4.53.0: fail, generated unrelated speech
transformers==4.53.1: fail, generated unrelated speech
transformers==4.53.2: fail, generated unrelated speech
transformers==4.53.3: fail, generated unrelated speech
transformers==4.54.0: fail, generated unrelated speech
transformers==4.54.1: fail, generated unrelated speech
transformers==4.56.2: fail, generated unrelated speech
loose current transformers==4.57.6: fail, generated unrelated speech

The lesson is simple: for generative audio, “the GPU ran” is necessary evidence, but it is not sufficient evidence. You also need a content sanity check.

Run Whisper ASR on the generated WAV:

1
2
3
4
cd ~/Projects/furina-cosyvoice-b50
.venv/bin/python "$FURINA_ASSETS/asr_sanity_check.py" \
  --audio outputs/furina_test.wav \
  --expected "Welcome to Fontaine. The compiler stage is ready, and the GPU will speak."

The verified good run transcribed as:

1
2
3
4
expected=Welcome to Fontaine. The compiler stage is ready, and the GPU will speak.
transcript=Welcome to Fontaine. The compiler stage is ready and the GPU will speak.
expected_keywords=welcome fontaine compiler stage ready gpu will speak
passed=True

The failed loose-dependency run transcribed as unrelated text. That made the regression obvious even though the GPU markers were valid.

A proof ladder starts with device visibility, then a tensor probe, then model parameter device, then generated WAV and OGG, then ASR content match. The diagram marks device availability alone as insufficient.

Figure 3. A reliable audio run needs two kinds of evidence: accelerator execution and semantic audio content. Device visibility alone is not enough.

Quadro M4000: Why The Newest CUDA Stack Fails

The M4000 path is shorter because CosyVoice does not need a source patch on CUDA. The hard part is choosing a wheel that still supports Maxwell.

The Quadro M4000 is compute capability 5.2. The host can run a recent R580 driver, and nvidia-smi may report a high CUDA compatibility level, but that does not mean every PyTorch wheel contains sm_52 kernels.

The tested stack matrix was:

StackResult
Plain PyPI torch 2.12 / CUDA 13 packagesFails: wheel supports newer architectures, M4000 gets no kernel image is available
torch==2.11.0+cu128, torchaudio==2.11.0+cu128Fails: supports sm_75+, M4000 is sm_52
torch==2.10.0+cu128, torchaudio==2.10.0+cu128Fails: supports sm_70+, M4000 is sm_52
torch==2.11.0+cu126, torchaudio==2.11.0+cu126Passes: arch list includes sm_50
torch==2.10.0+cu126Passes, but older than the selected 2.11 pair

So the selected M4000 stack is:

1
2
3
4
torch==2.11.0+cu126
torchaudio==2.11.0+cu126
torchcodec>=0.13,<0.14
transformers>=4.51.3,<4.53

torchcodec is capped because the latest TorchCodec line tried during the investigation pulled CUDA 13-linked libraries and failed against the cu126 stack. Prompt WAV decoding does not need GPU execution, so a compatible TorchCodec line is enough.

Quadro M4000: Host Setup

Install base tools:

1
2
3
sudo apt-get update
sudo apt-get install -y --no-install-recommends \
  ca-certificates curl git git-lfs pciutils ubuntu-drivers-common wget

Install an R580 NVIDIA driver:

1
2
3
ubuntu-drivers devices
sudo apt-get install -y nvidia-driver-580 || sudo ubuntu-drivers autoinstall
sudo reboot

After reboot:

1
2
3
nvidia-smi
lspci -nn | grep -Ei 'nvidia|vga|3d'
ls -l /dev/nvidia*

Expected facts:

1
2
3
4
GPU: Quadro M4000
Compute capability: 5.2
Driver branch: R580
/dev/nvidia0, /dev/nvidiactl, and /dev/nvidia-uvm exist

Quadro M4000: Install And Generate

Clone CosyVoice:

1
2
3
4
5
6
mkdir -p ~/Projects
cd ~/Projects
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git furina-cosyvoice-m4000
cd ~/Projects/furina-cosyvoice-m4000
git checkout ace7c47f41bbd303aa6bf1ea80e6f9fbd595cd40
git submodule update --init --recursive

Create the Python environment:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv python install 3.10

cd ~/Projects/furina-cosyvoice-m4000
uv venv --seed --python 3.10 .venv
.venv/bin/python -m pip install --upgrade pip "setuptools<82" wheel

cat > /tmp/m4000-pytorch-cu126.txt <<'EOF'
--extra-index-url https://download.pytorch.org/whl/cu126
torch==2.11.0+cu126
torchaudio==2.11.0+cu126
EOF

.venv/bin/pip install -r /tmp/m4000-pytorch-cu126.txt
.venv/bin/pip install --upgrade --prefer-binary \
  -r "$FURINA_ASSETS/requirements-cuda-cosyvoice.in"

Verify CUDA:

1
2
cd ~/Projects/furina-cosyvoice-m4000
.venv/bin/python "$FURINA_ASSETS/check_cuda.py"

Required markers:

1
2
3
4
5
"cuda_available": true
"device_name": "Quadro M4000"
"capability": [5, 2]
"arch_list": ["sm_50", ...]
"probe_tensor_device": "cuda:0"

Download the model and extract the same prompt voice:

1
2
3
4
5
6
7
cd ~/Projects/furina-cosyvoice-m4000
.venv/bin/python "$FURINA_ASSETS/download_model.py" \
  --repo-id FunAudioLLM/Fun-CosyVoice3-0.5B-2512 \
  --local-dir pretrained_models/Fun-CosyVoice3-0.5B

.venv/bin/python "$FURINA_ASSETS/setup_furina_voice.py" \
  --repo-root "$PWD"

Generate audio:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
cd ~/Projects/furina-cosyvoice-m4000
mkdir -p outputs

.venv/bin/python "$FURINA_ASSETS/generate_furina.py" \
  --backend cuda \
  --repo-root "$PWD" \
  --text "Welcome to Fontaine. The compiler stage is ready, and the GPU will speak." \
  --speed 0.9 \
  --output outputs/furina_test.wav

ffmpeg -y -i outputs/furina_test.wav \
  -c:a libvorbis -q:a 4 outputs/furina_test.ogg

Required generation markers:

1
2
3
4
5
6
7
cuda_available=True
cuda_device_name=Quadro M4000
cuda_capability=(5, 2)
probe_tensor_device=cuda:0
model_device=cuda:0
output=...
sample_rate=24000

On the verified M4000 Docker lane, the CUDA probe reported:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
  "torch": "2.11.0+cu126",
  "cuda_runtime": "12.6",
  "cuda_available": true,
  "device_name": "Quadro M4000",
  "capability": [5, 2],
  "arch_list": ["sm_50", "sm_60", "sm_70", "sm_75", "sm_80", "sm_86", "sm_90"],
  "probe_tensor_device": "cuda:0",
  "nvidia_smi": "0, Quadro M4000, 5.2, 580.159.03, 8192 MiB"
}

The generation verifier reported:

1
2
3
4
5
6
7
sample_rate=24000
duration_seconds=5.54
cuda_available=True
cuda_device_name=Quadro M4000
cuda_capability=(5, 2)
probe_tensor_device=cuda:0
model_device=cuda:0

The ASR sanity check matched the requested text:

1
2
transcript=Welcome to Fontaine. The compiler stage is ready and the GPU will speak.
passed=True

Why The Proof Is Solid

For both GPUs, the verifier uses a ladder of checks:

  1. The driver exposes a GPU device.
  2. PyTorch reports the accelerator as available.
  3. A real tensor operation runs on the accelerator.
  4. The loaded CosyVoice model parameters are on the accelerator.
  5. The generated WAV and OGG are written after those checks.
  6. Whisper ASR confirms that the spoken content matches the requested text.

The model-device check is the critical one:

1
2
model_device=xpu:0
model_device=cuda:0

That value is read from real CosyVoice model parameters after loading the model. It is not a wish from an environment variable.

The tensor probe is also real:

1
2
probe_tensor_device=xpu:0
probe_tensor_device=cuda:0

That comes from allocating tensors on the target device, running matrix multiplication, synchronizing, and copying the result back.

For Intel, PYTORCH_ENABLE_XPU_FALLBACK=0 makes this stricter by preventing unsupported XPU operators from quietly falling back to CPU.

Conclusion

The final Arc B50 setup is surprisingly clean once the failure modes are separated:

  • Ubuntu 26.04 has the Intel runtime packages needed for Level Zero/XPU visibility.
  • The upstream PyTorch XPU wheel index is the right PyTorch source.
  • The newest matching XPU torch/torchaudio pair is safer than the newest torch alone.
  • CosyVoice3 needs a small CUDA-assumption patch to run its PyTorch model path on XPU.
  • transformers<4.53 is required for semantic audio correctness, not GPU availability.
  • GPU proof needs model-device and tensor-probe evidence, not just xpu_available=True.

The M4000 result is the opposite lesson. CUDA was already the native path for CosyVoice, but old Maxwell hardware cannot use the newest CUDA wheels just because the host driver is new. The newest working stack found here is PyTorch 2.11 with cu126, because it still includes Maxwell-compatible kernels.

So the practical rule is:

  • for Arc B50, use upstream PyTorch XPU and patch CosyVoice’s CUDA assumptions;
  • for Quadro M4000, use the newest CUDA wheel family that still contains sm_50 kernels;
  • for both, keep the Transformers cap until a full ASR-backed generation run proves a newer version speaks the requested text.

That is the difference between “it produced a sound” and “it produced the right voice, saying the right words, on the GPU I intended to use.”

References