While OpenAI and Google race to build bigger and bigger models, a Chinese lab just released one that goes the other direction - and it's actually good.
GLM-4.7-Flash, from Z.ai Organization, is a 30-billion-parameter model that only activates 3 billion at a time. It runs on consumer GPUs. It fits in 12GB of RAM with quantization. And on coding benchmarks, it's punching way above its weight class.
On SWE-bench Verified - a test of real-world software engineering tasks - GLM-4.7-Flash scores 59.2%. That destroys Qwen3-30B-A3B (22.0%) and beats GPT-OSS-20B (34.0%). For agentic tasks, it hits 79.5% on tau-squared-Bench, compared to 49% and 47.7% for those same competitors.
This is the architecture you build when you can't afford infinite compute. Mixture-of-Experts (MoE) means most of the model stays dormant for any given task. Only the relevant expert pathways activate. Less computation, faster inference, same capability.
The local AI community is going nuts over it. Within hours of release, developers had llama.cpp support merged, quantized versions published, and were running it on everything from gaming rigs to MacBook Pros.
One user called it "the strongest model in the 30B class" for agentic workflows - meaning AI that can actually use tools, make plans, and execute multi-step tasks without falling apart. That's huge for local deployment where you need reliability, not just benchmarks.
The American approach to AI has been "scale at all costs." Bigger models, more parameters, more data, more compute. GPT-5 reportedly used training runs costing hundreds of millions. Gemini and are in the same ballpark.




