NAM-to-Speech Conversion with Multitask-Enhanced Autoregressive Models

Anonymous submission to InterSpeech 2025

Abstract:

We propose an alignment-free, end-to-end Non-Audible Murmur (NAM)-to-Speech conversion model. Existing methods rely on large NAM-text pairs per speaker to generate high-quality alignments for training non-autoregressive models. However, alignment quality deteriorates when trained on multi-speaker data, limiting their ability to generalize and effectively utilize the available training data. To address this, we introduce a streamlined autoregressive approach that eliminates the need for explicit alignment learning. By leveraging multi-speaker samples, synthetic training pairs, and multitask character recognition training, our method reduces the word error rate (WER) by 59.19% compared to the state-of-the-art (SOTA) on two public datasets. We demonstrate the model’s zero-shot capability and validate the effectiveness of multitask training through ablation studies.

Proposed Method

...
Illustration of the end-to-end NAM-to-Speech conversion framework. The framework comprises: (1) a transformer based NAM-to-unit translation (N2UT) model with a NAM encoder and discrete speech decoder, (2) multitask networks conditioned on both the encoder and decoder, and (3) a vocoder trained to convert the predicted units into speech.

Table of Contents

Comparing samples while training on all data and synthetic NAM data.

Database speaker ID Ground-truth Text Input NAM Ours StethoSpeech
StethoText s01_neilnam_IITM_296 then the crane put the fish down and ate it.
StethoText s01_neilnam_IITM_658 a robber must be cruel and have no pity.
StethoText s01_neilnam_IITM_918 otherwise, you will disturb his meditation and bring upon you his wrath.
StethoText s01_neilnam_IITM_1486 at once, the two lion cubs got ready to attack the elephant.
StethoText s01_neilnam_IITM_1874 but he was hungry, so he said, very well, hold tight.
MultiNAM 1_9 or in what is purchased with that produce from other nations.
MultiNAM 3_31 but if they had all wrought separately and independently.
MultiNAM 4_2 is almost always divided among a great number of hands.
MultiNAM 42_17 of being a greater than most of his neighbours.
MultiNAM 48_14 such taxes, in proportion to what they bring into the public treasury of the state.
MultiNAM 52_16 applicable to local and provincial purposes.
MultiNAM s2_8_17 there were nine wards in all on the female side, one of them in the attic.
MultiNAM s2_15_4 and which had the prescription of long usage.
MultiNAM s2_17_26 he was charged with moving something which should not be touched.

Comparing samples on StethoText corpus in a zero-shot setting.

Ground-truth Text Input NAM Ours StethoSpeech
How old are you, mother.
Let us approach someone holy and knowledgeable.
Ask him to give you a kingdom.
It was believed that he had become a thief.