/u/__JockY__'s posts in /r/LocalLLaMA
With Llama-3.1 70B at long contexts (8000+ tokens), llama.cpp server is taking 26 seconds to process the context before responding with the first token. TabbyAPI/exllamav2 is instant. Is it my fault, llama.cpp's fault, neither, a bit of both, or something else entirely?
49 upvotes
Mark as read: Add to a list
