Running DeepSeek-V3-0324 (685 B parameters) with 512 GB RAM on a Single 16 GB GPU Using Ktransformers

You can watch the video here! Running DeepSeek-V3-0324 with 512 GB RAM on a single 16 GB GPU

Here's my guide in how I set up ktransformers (quick transformers) to perform inference on DeepSeek-V3-0324 with a 20c/40t CPU, 512 GB DDR5 RAM and just a single Nvidia A4000 16 GB GPU.

I think there needs to be a proper guide to use ktransformers with large MoE models. A lot of resources (especially YouTube) will show how to run these very large models on other inference engines like Ollama, or llama.cpp. There are some interesting build guides too, like running DeepSeek on an EYPC CPU with just lots of RAM.

But one video caught my attention and that's Jesse's (createthis) video on his setup where he has a 2x EPYC CPU setup with 768 GB RAM and just a single 24 GB (RTX 3090) GPU. After seeing him using ktransformers and the inferencing speed he achieved with it, I decided I had to do the same with my system.

My system isn't as powerful, but it gets close to the performance that Jesse showed in his video from the eye test.

My desktop workstation setup:

  • Intel Xeon w5-3535x (20 cores/40 threads)
  • 512 GB DDR5 RDIMM at 5600 MT/s (64 GB x 8 Samsung DIMMs)
  • NVIDIA A4000 16 GB GPU (Ampere generation)
  • 4.0 TB Crucial SSD to store the model

All in total, you can purchase this same system for less than a Mac Studio M3 Ultra, or a single RTX 6000 Ada GPU.

But notice the GPU that I have. It's just an older Ampere NVIDIA GPU with just 16 GB of VRAM. This is what I consider to be amazing about all of this. Running big MoE models is usable with the ktransformers approach.

👉 Link to Guide on GitHub 👈