/r/LocalLLaMA

Year:

Only show posts with narrations

New function calling models based on Llama-3.1

53 upvotes • Relevant_Outcome_726

Mark as read: Add to a list

A minimal Introduction to Quantization

51 upvotes • hackerllama

Mark as read: Add to a list

With Llama-3.1 70B at long contexts (8000+ tokens), llama.cpp server is taking 26 seconds to process the context before responding with the first token. TabbyAPI/exllamav2 is instant. Is it my fault, llama.cpp's fault, neither, a bit of both, or something else entirely?

49 upvotes • __JockY__

Mark as read: Add to a list

Running inference locally on the new Gemma 2 2b models with Rust

48 upvotes • SammyDaBeast

Mark as read: Add to a list

WebGPU powered Gemma 2 2B by MLC WebLLM!

48 upvotes • vaibhavs10

Mark as read: Add to a list

Yet Another Perplexity Clone

47 upvotes • ansh-gupta17

Mark as read: Add to a list

Stupid question perhaps, but can anyone "clamp" a model to think it's the Golden Gate Bridge or whatever, or is that something only experts can do?

45 upvotes • Intraluminal

Mark as read: Add to a list

GPT4o vs LLAMA. For analytics, they seem to be fairly evenly matched! Amazing! GPT could do math since a while but LLAMA 3.1 405b really catches up!

44 upvotes • jayantbhawal

Mark as read: Add to a list

It looks good but why so bulky?

44 upvotes • a_sugarcane

Mark as read: Add to a list

Llama3.1 405B quants on Ollama library now

43 upvotes • bobbiesbottleservice

Mark as read: Add to a list

Title	Upvotes	Author	Mark as read	Favorited	Rating	Add to a list
New function calling models based on Llama-3.1	53	Relevant_Outcome_726
A minimal Introduction to Quantization	51	hackerllama
With Llama-3.1 70B at long contexts (8000+ tokens), llama.cpp server is taking 26 seconds to process the context before responding with the first token. TabbyAPI/exllamav2 is instant. Is it my fault, llama.cpp's fault, neither, a bit of both, or something else entirely?	49	__JockY__
Running inference locally on the new Gemma 2 2b models with Rust	48	SammyDaBeast
WebGPU powered Gemma 2 2B by MLC WebLLM!	48	vaibhavs10
Yet Another Perplexity Clone	47	ansh-gupta17
Stupid question perhaps, but can anyone "clamp" a model to think it's the Golden Gate Bridge or whatever, or is that something only experts can do?	45	Intraluminal
GPT4o vs LLAMA. For analytics, they seem to be fairly evenly matched! Amazing! GPT could do math since a while but LLAMA 3.1 405b really catches up!	44	jayantbhawal
It looks good but why so bulky?	44	a_sugarcane
Llama3.1 405B quants on Ollama library now	43	bobbiesbottleservice