Issues running gguf on GPU

#17

by lucaelin - opened Oct 23

Oct 23

I had great success running the torch version on both cpu and gpu, but the gguf seems to only work on cpu for me. Whenever I try running llama.cpp with cuda or vulkan on gpu, the model usually generates no tokens and sometimes only a few, causing either an exception in the decoding or just a very short noise. This is with both streaming and non-steaming mode, on both q4 and q8, I also exported a f16 gguf, with also shows the exact same error.
Did anyone get this setup to work?

tanushm

Oct 27

following, same problem

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment