With Llama-3.1 70B at long contexts (8000+ tokens), llama.cpp server is taking 26 seconds to process the context before responding with the first token. TabbyAPI/exllamav2 is instant. Is it my fault, llama.cpp's fault, neither, a bit of both, or something else entirely?
by /u/__JockY__ in /r/LocalLLaMA
Upvotes: 49
Favorite this post:
Mark as read:
Your rating:
Add this post to a custom list