Elon Musk's AI assistant Grok boasted that the billionaire had the "potential to drink piss better than any human in history," among other absurd claims.
Yeah, you do want more contextual intelligence than an 8B for this.
Oh yeah, I’m sure. I may peek at it this weekend. I’m trying to decide if Santa is going to bring me a new graphics card, so I need to see what the price:performance curve looks like.
Massive understatement!
I think I stopped actively using image generation a little bit after LoRAs and IP Adapters were invented. I was trying to edit a video (random meme gif) to change the people in the meme to have the faces of my family, but it was very hard to have consistency between frames. Since there is generated video, it seems like someone solved this problem.
Video generation/editing is very GPU heavy though.
I dunno what card you have now, but with text LLMs (or image+text input LLMs), hybrid CPU+GPU inference is the trend days.
As an example, I can run GLM 4.6, a 350B LLM, with measurably low quantization distortion on a 3090 + 128GB CPU RAM, at like 7 tokens/s. If you would’ve told me that 2-4 years ago, my head would have exploded.
You can easily run GLM Air (or other good MoE models) on like a 3080 + system RAM, or even a lesser GPU. You just need the right software and quant.
Thanks a ton, saves me having to navigate the slopped up search results (‘AI’ as a search term is SEOd to death and back a few times)
I dunno what card you have now, but hybrid CPU+GPU inference is the trend days.
That system has the 3080 12GB and 64GB RAM but I have another 2 slots so I could go up to 128GB. I don’t doubt that there’s a GLM quant model that’ll work.
Is ollama for hosting the models and LM Studio for chatbot work still the way to go? Doesn’t seem like there’s much to improve in that area once there’s software that does the thing.
And IMO… your 3080 is good for ML stuff. It’s very well supported. It’s kinda hard to upgrade, in fact, as realistically you’re either looking at a 4090 or a used 3090 for an upgrade that’s actually worth it.
Oh no, you got it backwards. The software is everything, and ollama is awful. It’s enshittifying: don’t touch it with a 10 foot pole.
Speeds are basically limited by CPU RAM bandwidth. Hence you want to be careful doubling up RAM, and doubling it up can the max speed (and hence cut your inference speed).
Anyway, start with this. Pick your size, based on how much free CPU RAM you want to spare:
The “dense” parts will live on your 3080 while the “sparse” parts will run on your CPU. The backend you want is this, specifically the built-in llama-server:
Regular llama.cpp is fine too, but it’s quants just aren’t quite as optimal or fast.
It has two really good built-in web UIs: the “new” llama.cpp chat UI, and mikupad, which is like a “raw” notebook mode more aimed at creative writing. But you can use LM Studio if you want, or anything else; there are like a bazillion frontends out there.
Massive understatement!
Yeah, you do want more contextual intelligence than an 8B for this.
Actually SDXL is still used a lot! Especially for the anime stuff. It just got so much finetuning and tooling piled on.
Oh yeah, I’m sure. I may peek at it this weekend. I’m trying to decide if Santa is going to bring me a new graphics card, so I need to see what the price:performance curve looks like.
I think I stopped actively using image generation a little bit after LoRAs and IP Adapters were invented. I was trying to edit a video (random meme gif) to change the people in the meme to have the faces of my family, but it was very hard to have consistency between frames. Since there is generated video, it seems like someone solved this problem.
Oh yes, it has come a LOONG way. Some projects to look at are:
https://github.com/ModelTC/LightX2V
https://github.com/deepbeepmeep/Wan2GP
And for images: https://github.com/nunchaku-tech/nunchaku
Video generation/editing is very GPU heavy though.
I dunno what card you have now, but with text LLMs (or image+text input LLMs), hybrid CPU+GPU inference is the trend days.
As an example, I can run GLM 4.6, a 350B LLM, with measurably low quantization distortion on a 3090 + 128GB CPU RAM, at like 7 tokens/s. If you would’ve told me that 2-4 years ago, my head would have exploded.
You can easily run GLM Air (or other good MoE models) on like a 3080 + system RAM, or even a lesser GPU. You just need the right software and quant.
Thanks a ton, saves me having to navigate the slopped up search results (‘AI’ as a search term is SEOd to death and back a few times)
That system has the 3080 12GB and 64GB RAM but I have another 2 slots so I could go up to 128GB. I don’t doubt that there’s a GLM quant model that’ll work.
Is ollama for hosting the models and LM Studio for chatbot work still the way to go? Doesn’t seem like there’s much to improve in that area once there’s software that does the thing.
And IMO… your 3080 is good for ML stuff. It’s very well supported. It’s kinda hard to upgrade, in fact, as realistically you’re either looking at a 4090 or a used 3090 for an upgrade that’s actually worth it.
Oh no, you got it backwards. The software is everything, and ollama is awful. It’s enshittifying: don’t touch it with a 10 foot pole.
Speeds are basically limited by CPU RAM bandwidth. Hence you want to be careful doubling up RAM, and doubling it up can the max speed (and hence cut your inference speed).
Anyway, start with this. Pick your size, based on how much free CPU RAM you want to spare:
https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF
The “dense” parts will live on your 3080 while the “sparse” parts will run on your CPU. The backend you want is this, specifically the built-in llama-server:
https://github.com/ikawrakow/ik_llama.cpp/
Regular llama.cpp is fine too, but it’s quants just aren’t quite as optimal or fast.
It has two really good built-in web UIs: the “new” llama.cpp chat UI, and mikupad, which is like a “raw” notebook mode more aimed at creative writing. But you can use LM Studio if you want, or anything else; there are like a bazillion frontends out there.