Zyphra introduced Zamba2-7B, a new SLM that surpasses competitors like Mistral-7B, Gemma-7B, and Llama3-8B in both performance & efficiency. Designed for on-device use, consumer GPUs, and enterprise applications.

Zamba2-7B:

> Achieves top benchmark results with faster inference—25% quicker first token generation and 20% more tokens per second compared to similar models.

> Utilizes Mamba2 blocks w/ two shared attention blocks interleaved in an ABAB pattern, enhancing cross-sequence dependencies.

> Trained on 3T tokens, combining proprietary Zyda data w/ high-quality, deduplicated datasets, followed by a specialized annealing phase to improve model precision.

> Zamba2-7B achieves high throughput and low memory usage, eliminating the need for KV-caches & leveraging efficient Mamba blocks optimized for parallel hardware.

> Model was trained on 128 H100 GPUs over 50 days and will be available open-source under Apache 2.0. w/ integrations on Huggingface and PyTorch.

Zamba2-7B ☺︎

Blog: https://www.zyphra.com/post/zamba2-7b
Weights: https://huggingface.co/Zyphra/Zamba2-7B
Instruct: https://huggingface.co/Zyphra/Zamba2-7B-Instruct
Chat with the model: https://maia.zyphra.com/chat
Demo: https://huggingface.co/spaces/Zyphra/Zamba2-7B
NVIDIA’s NIM: https://build.nvidia.com/zyphra/zamba2-7b-instruct