We propose an alignment-free, end-to-end Non-Audible Murmur (NAM)-to-Speech conversion model. Existing methods rely on large NAM-text pairs per speaker to generate high-quality alignments for training non-autoregressive models. However, alignment quality deteriorates when trained on multi-speaker data, limiting their ability to generalize and effectively utilize the available training data. To address this, we introduce a streamlined autoregressive approach that eliminates the need for explicit alignment learning. By leveraging multi-speaker samples, synthetic training pairs, and multitask character recognition training, our method reduces the word error rate (WER) by 59.19% compared to the state-of-the-art (SOTA) on two public datasets. We demonstrate the model’s zero-shot capability and validate the effectiveness of multitask training through ablation studies.
Database | speaker ID | Ground-truth Text | Input NAM | Ours | StethoSpeech |
---|---|---|---|---|---|
StethoText | s01_neilnam_IITM_296 | then the crane put the fish down and ate it. | |||
StethoText | s01_neilnam_IITM_658 | a robber must be cruel and have no pity. | |||
StethoText | s01_neilnam_IITM_918 | otherwise, you will disturb his meditation and bring upon you his wrath. | |||
StethoText | s01_neilnam_IITM_1486 | at once, the two lion cubs got ready to attack the elephant. | |||
StethoText | s01_neilnam_IITM_1874 | but he was hungry, so he said, very well, hold tight. | |||
MultiNAM | 1_9 | or in what is purchased with that produce from other nations. | |||
MultiNAM | 3_31 | but if they had all wrought separately and independently. | |||
MultiNAM | 4_2 | is almost always divided among a great number of hands. | |||
MultiNAM | 42_17 | of being a greater than most of his neighbours. | |||
MultiNAM | 48_14 | such taxes, in proportion to what they bring into the public treasury of the state. | |||
MultiNAM | 52_16 | applicable to local and provincial purposes. | |||
MultiNAM | s2_8_17 | there were nine wards in all on the female side, one of them in the attic. | |||
MultiNAM | s2_15_4 | and which had the prescription of long usage. | |||
MultiNAM | s2_17_26 | he was charged with moving something which should not be touched. |
Ground-truth Text | Input NAM | Ours | StethoSpeech |
---|---|---|---|
How old are you, mother. | |||
Let us approach someone holy and knowledgeable. | |||
Ask him to give you a kingdom. | |||
It was believed that he had become a thief. |