It's not new, but Meta released multiple versions of SPIRIT-LM a while back, a multimodal language model that mixes text and speech. Sharing it here because I didn't know it's out there. And it's open-source.

"Large language models are frequently used to build text-to-speech pipelines, wherein speech is transcribed by automatic speech recognition (ASR), then synthesized by an LLM to generate text, which is ultimately converted to speech using text-to-speech (TTS). However, this process compromises the expressive aspects of the speech being understood and generated. In an effort to address this limitation, we built Meta Spirit LM, our first open source multimodal language model that freely mixes text and speech."

They also have a demo page with samples: speechbot.github.io/spiritlm/

Sounds like there is room for improvement but this is the worst it will ever be.