Salvatore Sanfilippo тоже увлёкся нейросетями.
https://github.com/antirez/iris.c:
Iris is an inference pipeline that generates images from text prompts using open weights diffusion transformer models. It is implemented entirely in C, with zero external dependencies beyond the C standard library. MPS and BLAS acceleration are optional but recommended. Under macOS, a BLAS API is part of the system, so nothing is required.
The name comes from the Greek goddess Iris, messenger of the gods and personification of the rainbow.
Supported model families:
- FLUX.2 Klein (by Black Forest Labs):
- 4B distilled (4 steps, auto guidance set to 1, very fast).
- 4B base (50 steps for max quality, or less. Classifier-Free Diffusion Guidance, much slower but more generation variety).
- 9B distilled (4 steps, larger model, higher quality. Non-commercial license).
- 9B base (50 steps, CFG, highest quality. Non-commercial license).
- Z-Image-Turbo (by Tongyi-MAI):
- 6B (8 NFE / 9 scheduler steps, no CFG, fast).
https://github.com/antirez/qwen-asr:
This is a C implementation of the inference pipeline for Qwen3-ASR speech-to-text models (both 0.6B and 1.7B). It has zero external dependencies beyond the C standard library and a BLAS implementation (Accelerate on macOS, OpenBLAS on Linux). Tokens stream to stdout as they are generated. The implementation runs at speed multiple of the file length even in very modest hardware, like low end Intel or AMD processor.
Important: this implementation explicitly avoids implementing support for MPS. Transcription systems are very important pieces of infrastructure, and are often run on remote Linux servers. Adding the MPS target would focus the efforts too much on Apple hardware, so for now I’m skipping it. The code runs very well anyway on Apple hardware (NEON optimized). Please, don’t send pull requests about this feature, fork the code instead, in order to add MPS support. I’ll add it much later when the other optimizations are already mature.
Supported modes and models
Both normal (offline) and streaming (online) modes are supported. Normal mode defaults to full offline decode (
-S 0), so the whole audio is encoded at once. Streaming mode processes audio in 2-second chunks with prefix rollback (it keeps the last few decoded tokens as context for the decoder/LLM when transcribing the next chunk).
Important practical note: in this implementation, interactive
--streamprioritizes incremental token stability over throughput and can be much slower than normal mode when you process an already-recorded file end-to-end.
Audio can be piped from stdin (
--stdin), making it easy to transcode and transcribe any format via ffmpeg. Language is usually auto-detected from audio, and can be forced with--language. A system prompt can bias the model toward specific terms or spellings.
Both the 0.6B and 1.7B parameters models are supported. While the 1.7B model is generally more powerful, the 0.6B model seems the sweet spot for CPU inference, however the speed difference is not huge, so you may want to try both and decide what to use depending on your use case.
https://github.com/antirez/voxtral.c:
This is a C implementation of the inference pipeline for the Mistral AI’s Voxtral Realtime 4B model. It has zero external dependencies beyond the C standard library. The MPS inference is decently fast, while the BLAS acceleration is usable but slow (it continuously convert the bf16 weights to fp32).
Audio processing uses a chunked encoder with overlapping windows, bounding memory usage regardless of input length. Audio can also be piped from stdin (
--stdin), or captured live from the microphone (--from-mic, macOS), making it easy to transcode and transcribe any format via ffmpeg. A streaming C API (vox_stream_t) lets you feed audio incrementally and receive token strings as they become available.
More testing needed: please note that this project was mostly tested against few samples, and likely requires some more work to be production quality. However the hard part, to understand the model inference and reproduce the inference pipeline, is here, so the rest likely can be done easily. Testing it against very long transcriptions, able to stress the KV cache circular buffer, will be a useful task.
Motivations (and some rant)
Thank you to Mistral for releasing such a great model in an Open Weights fashion. However, the author of this project believes that limiting the inference to a partnership with vLLM, without providing a self-contained reference implementation in Python, limits the model’s actual reach and the potential good effects it could have. For this reason, this project was created: it provides both a pure C inference engine and a simple, self-contained Python reference implementation (
python_simple_implementation.py) that anyone can read and understand without digging through the vLLM codebase.
https://github.com/antirez/gguf-tools:
This is a work in progress library to manipulate GGUF files. While the library aims to be useful, one of the main goals is to provide an accessible code base that as a side effect documents the GGUF files used by the awesome llama.cpp project: GGUF files are becoming increasingly more used and central in the local machine learning scene, so to have multiple implementations of parsers and files generators may be useful.
The program gguf-tools uses the library to implement both useful and useless stuff, to show the library usage in the real world. For now the utility implements the following subcommands:
gguf-tools show file.ggufshows detailed info about the GGUF file. This will include all the key-value pairs, including arrays, and detailed tensors informations. Tensor offsets will be relative to the start of the file (so they are actually absolute offsets), not the start of the data section like in the GGUF format.
gguf-tools compare file1.gguf file2.ggufThis tool is useful to understand if two LLMs (or other models distributed as GGUF files) are related, for instance if one is the finetune of another, or if both are fine-tuned from the same parent model.
For each matching tensor (same name and parameters count), the command computes the average weights difference (in percentage, so that a random distribution in the interval -N, +N would be on average 100% different than another random distribution in the same interval). This is useful to see if a model is a finetune of another model, how much it was finetuned, which layers were frozen while finetuning and so forth. Note that because of quantization, even tensors that are functionally equivalent may have some small average difference.
gguf-tools inspect-tensor file.gguf tensor.name [count]Show all (if count is not specified, otherwise only the first count) weights values of the specified tensor. This is useful for low level stuff, like checking if quantization is working as expected, see the introduced error, model fingerprinting and so forth.
gguf-tools split-mixtral 65230776370407150546470161412165 mixtral.gguf out.ggufExtracts a 7B model
out.gguffrom Mixtral 7B MoE using the specified MoE ID for each layer (there are 32 digits in the sequence 652…).
Note that
split-mixtralis quite useless as models obtained in this way will not perform any useful work. This is just an experiment and a non trivial task to show how to use the library. Likely it will be removed soon, once I have more interesting and useful examples to show, like models merging.
https://github.com/antirez/ds4:
DwarfStar is a small native inference engine optimized first for DeepSeek V4 Flash, with support for DeepSeek V4 PRO on very high-memory machines. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime: it is completely self-contained. Other than running the model in a correct and fast way, the project goal is to provide DS4 specific loading, prompt rendering, tool calling, KV state handling (RAM and on-disk), server API and integrated coding agent, all ready to work with coding agents or with the provided CLI interface. There are also tools for GGUF and imatrix generation, and for quality and speed testing.
We support the following backends:
- Metal is our primary target. Starting from MacBooks with 96GB of RAM.
- NVIDIA CUDA with special care for the DGX Spark.
- AMD ROCm is only supported in the rocm branch. It is kept separate from main since I (antirez) don’t have direct hardware access, so the community rebases the branch as needed.
This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.






