juxt/allium: A language for sharpening intent alongside implementation. is a very exciting project. Specifically, a specification language for the behavior of software systems. A language for which there is no runtime environment or compiler, except the LLM. Implemented as Agent Skills. Super exciting because it also includes a Distill, with which you can analyze existing software and retroactively build specifications, or work out a specification in an interview process with the AI that is much more precisely understandable to an LLM than general English. A funny detail on the side: I tried Allium with Qwen3-Coder-Next, my current favorite model for local hosting, in pi.dev. I couldn't install the binary for Allium (a syntax checker and linter) with homebrew, so pi.dev simply downloaded the binary and installed it itself.
llm
scitrera/cuda-containers: Scitrera builds of various CUDA containers for version consistency, starting primarily with NVIDIA DGX Spark Containers - I'm currently a big fan of eugr/vllm-node as a base package because it always provides up-to-date versions for vLLM, but if I want to play around with sglang sometime, this is probably the most similar project. I'm particularly interested in EAGLE-3 speculative decoding - basically, tokens are generated in parallel via a very small model and the main model checks what fits and takes it, or generates itself if necessary. This way you can often have a third of the tokens generated via a much faster simple model in the <3B range and only pass every third one through the large model.
thushan/olla: High-performance lightweight proxy and load balancer for LLM infrastructure. Intelligent routing, automatic failover and unified model discovery across local and remote inference backend might be the better choice following the LiteLLM debacle (hacked supply chain with data extractor in the package). For me, it's definitely interesting because I simply want to run two models and make them available under a single endpoint, and all the other packages are significantly overkill for that.
Running Mistral Small 4 119B NVFP4 on NVIDIA DGX Spark (GB10) - DGX Spark / GB10 User Forum / DGX Spark / GB10 - NVIDIA Developer Forums - lifesaver discussion in the NVIDIA forums. With what's in there, I got Mistral Small 4 running smoothly. And it runs cleanly with 150K context and 100 tokens/second in generation. Wow. This is the first time I've really noticed the power of this machine.
Introducing Mistral Small 4 | Mistral AI is another interesting candidate for the ASUS Ascent GX10||ASUS Deutschland, especially since I don't need side-car models for vision there, because the model itself already comes with vision capabilities built-in. And as a MoE model, it should also deliver good speed performance.
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 · Hugging Face will likely be the first large model on the ASUS Ascent GX10||ASUS Deutschland because it was trained optimized on NVFP4 - and thus does not experience any "dumbing down" due to quantization, but functions completely normally as expected. And it is optimized for agentic workflows, which should benefit OpenClaw, just as the 1M context, which can probably even be utilized in the model (different architecture than classic transformer-based models).
ASUS Ascent GX10||ASUS Deutschland is arriving in the next few days. AI powerhouse that will allow me to run larger models locally and, for example, operate an OpenClaw Agent autonomously at home without needing any subscriptions. I'm really looking forward to seeing what's possible with it.
Cognee is also something I’ll keep an eye on for later. Basically a knowledge graph controlled by an LLM to make memory available for another LLM. Certainly exciting to play around with when I have good local hardware to run larger models on. But for now, just a memory keeper.
Docker Model Runner Adds vLLM Support on macOS | Docker - at first just noticed, that could be interesting later because I can run models via Docker with vllm, while using Apple Silicon at the same time. Interesting here is that it comes as a Docker image ready to use, and I don't have to fiddle with setup. I'm currently working more with my own rfc1437/MLXServer: a simple MLX based server for small models to run locally simply because I only need it for offline operation, but vllm-metal could be very exciting later.
rfc1437/MLXServer: a simple MLX based server for small models to run locally is a tool that I built (with AI assistance) to run small models directly locally, without heavy overhead. It doesn't consume much memory, has a built-in local chat for personal experiments, and feels significantly more practical to me compared to the big alternatives—fewer knobs to adjust, but consequently less confusion. I just want to run a small model locally for my on-the-road blog.
mlx-community/Qwen3.5-9B-MLX-4bit · Hugging Face is another nice, small model — larger than the others, thus slightly more consistent in execution, but still pretty fast. And that's the upper limit of what you can run on a MacBook Air M4 with 16GB RAM without crashing the computer.
Google/gemma-3-4b-it · Hugging Face is a pretty nice model that has been trained for many European languages and is therefore well suited for local translations – it loads under 4G into memory and occupies approximately 6.5G in interference during operation. And it has Vision Capability, so it can also be used to get image descriptions. Ideal, for example, to be used locally with bDS when you want to be offline on the go. And significantly smaller than mlx-community/gemma-3-12b-it-4bit · Hugging Face – that was borderline on my Macbook Air.
Inferencer | Run and Deeply Control Local AI Models is an interesting tool that allows you to run LLMs locally. Of course, LM Studio or Ollama or vllm-mlx can do this as well. But Inferencer has a feature called "Model streaming" that's pretty cool: it can run models that are actually too large for memory. Of course, you're trading time for memory, but for a local model for image captioning or similar smaller tasks, you could definitely use it. However, I have the feeling that the model becomes somewhat more fragile this way - for example, it suddenly doesn't use tools correctly anymore (I tried it with gemma3 12b, which is just scratching the memory limit of my laptop).
OpenClaw Memory Masterclass: The complete guide to agent memory that survives • VelvetShark - interesting compilation of the memory system and the pitfalls with compaction in Openclaw. The agent is meant to run for a long time, but there is always the risk that compaction will strike right in the middle of complex situations. And openclaw runs autonomously, so you want to be sure that it continues continuously.
unum-cloud/USearch - that's what it says. So a library that offers an index for vectors that can come from embeddings, for example, and can find semantically similar texts. Not text-similar, but semantically, i.e., content. Interesting topic, the models required for this are related to LLMs, but not large, but small - they don't need to fully understand and generate because they only create vectors that can then be compared against each other and the higher the similarity, the higher the similarity of the texts in the topic. Cool little feature for bDS.
waybarrios/vllm-mlx: OpenAI and Anthropic compatible server for Apple Silicon. I use this to run mlx-community/gemma-3-12b-it-4bit on my MacBook Air. It works very well, a small shell script to start the server and then I am autonomous. Not as comfortable as Ollama, but it perfectly supports Apple's MLX and thus makes good use of Silicon.
mlx-community/gemma-3-12b-it-4bit · Hugging Face is currently the best model for local operation, allowing me to implement image captioning and even local chat. It's not the fastest, as it's quite large, but it's absolutely suitable for offline operation if I come up with a few mechanisms for batch processing of images, etc. This could be super exciting for vacation times. An image description might take a minute, but hey, no dependencies.
Models.dev — An open-source database of AI models is a very practical site that provides framework parameters for all kinds of providers and all kinds of LLMs, including even API prices. And technical parameters such as input/output tokens.
Ollama - a runtime environment for LLMs that allows models to be run locally. My favorite model at the moment: qwen2.5vl:7b-q4_K_M. With only 6.6 GB in size, this runs smoothly on a MacBook Air M4 and still has enough memory and capacity to run programs alongside it. The model is surprisingly usable in chat and above all has excellent vision capabilities. Ideal for providing titles, alt text, or summaries for images without having to pay big providers for it. And an important building block to bring bDS back to full-offline.
mistralai/mistral-vibe: Minimal CLI coding agent by Mistral - accompanying the AI Studio - Mistral AI there is also the Vibe coding interface to Devstral as open source. Very nice, because it makes a good pair. Will definitely try it out, even if I will probably rather reach for the powerhouses (Opus 4.6) for larger projects.
AI Studio - Mistral AI - as the situation in the USA becomes a bit more tense again, and simply because one should always check what is happening outside the USA, here is a link to a European alternative to the major US operators. Mistral offers a coding model with Mistral 2 that is not only open weights (i.e., freely available and operable if you have the necessary hardware), but also quite affordable when using Mistral itself. And the performance is slightly above Claude Haiku 4.5, and below Sonnet 4.5, but not by much. So quite usable and my first experiments were not bad. Unfortunately, no vision capability, so not very suitable for experiments with images (and therefore not ideal for my bDS), but still interesting enough to keep an eye on.
If you, like me, want an overview of UI integration for LLMs and are wondering how A2UI and MCP Apps compare and what they offer: Agent UI Standards Multiply: MCP Apps and Google’s A2UI - Richard MacManus helps. I have implemented A2UI in bDS so that the LLM can also use visual aspects in the internal chat, and I really like that. But the idea of incorporating parts of my UI into external agents is also fascinating. Even if I find that "local HTML/JS in an IFrame" somehow sounds like a hack at first, but much in the LLM environment gives me the feeling right now, simply because everything is pushed through a normal text stream and you hope that the LLMs adhere to the formats (even A2UI works like this).