/r/LocalLLaMA
Mark as read: Add to a list
With Llama-3.1 70B at long contexts (8000+ tokens), llama.cpp server is taking 26 seconds to process the context before responding with the first token. TabbyAPI/exllamav2 is instant. Is it my fault, llama.cpp's fault, neither, a bit of both, or something else entirely?
Mark as read: Add to a list
Mark as read: Add to a list
Stupid question perhaps, but can anyone "clamp" a model to think it's the Golden Gate Bridge or whatever, or is that something only experts can do?
Mark as read: Add to a list
GPT4o vs LLAMA. For analytics, they seem to be fairly evenly matched! Amazing! GPT could do math since a while but LLAMA 3.1 405b really catches up!
Mark as read: Add to a list
Mark as read: Add to a list