

格林先生MrGreen arXiv每日学术速递 2022-12-17





【1】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

作者:Chunyu Qiang,Peng Yang,Hao Che,Xiaorui Wang,Zhongyuan Wang
机构:Kwai, Beijing, P.R. China
备注:Published to ISCSLP 2022
摘要:Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

【2】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric

作者:Hyeongju Kim,Hyeong-Seok Choi
机构:Supertone, Inc., Seoul National University
备注:5 pages, submitted to ICASSP 2023
摘要:Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.

【3】 Jointly Learning Visual and Auditory Speech Representations from Raw Data

作者:Alexandros Haliassos,Pingchuan Ma,Rodrigo Mira,Stavros Petridis,Maja Pantic
机构:Imperial College London, Meta AI
备注:22 pages
摘要:We present RAVEn, a self-supervised multi-modal approach to jointly learn visual and auditory speech representations. Our pre-training objective involves encoding masked inputs, and then predicting contextualised targets generated by slowly-evolving momentum encoders. Driven by the inherent differences between video and audio, our design is asymmetric w.r.t. the two modalities' pretext tasks: Whereas the auditory stream predicts both the visual and auditory targets, the visual one predicts only the auditory targets. We observe strong results in low- and high-resource labelled data settings when fine-tuning the visual and auditory encoders resulting from a single pre-training stage, in which the encoders are jointly trained. Notably, RAVEn surpasses all self-supervised methods on visual speech recognition (VSR) on LRS3, and combining RAVEn with self-training using only 30 hours of labelled data even outperforms a recent semi-supervised method trained on 90,000 hours of non-public data. At the same time, we achieve state-of-the-art results in the LRS3 low-resource setting for auditory speech recognition (as well as for VSR). Our findings point to the viability of learning powerful speech representations entirely from raw video and audio, i.e., without relying on handcrafted features. Code and models will be made public.


【1】 Towards deep generation of guided wave representations for composite materials

作者:Mahindra Rautela,J. Senthilnath,Armin Huber,S. Gopalakrishnan
机构: Senthilnath is with the Institute for Infocomm Research
摘要:Laminated composite materials are widely used in most fields of engineering. Wave propagation analysis plays an essential role in understanding the short-duration transient response of composite structures. The forward physics-based models are utilized to map from elastic properties space to wave propagation behavior in a laminated composite material. Due to the high-frequency, multi-modal, and dispersive nature of the guided waves, the physics-based simulations are computationally demanding. It makes property prediction, generation, and material design problems more challenging. In this work, a forward physics-based simulator such as the stiffness matrix method is utilized to collect group velocities of guided waves for a set of composite materials. A variational autoencoder (VAE)-based deep generative model is proposed for the generation of new and realistic polar group velocity representations. It is observed that the deep generator is able to reconstruct unseen representations with very low mean square reconstruction error. Global Monte Carlo and directional equally-spaced samplers are used to sample the continuous, complete and organized low-dimensional latent space of VAE. The sampled point is fed into the trained decoder to generate new polar representations. The network has shown exceptional generation capabilities. It is also seen that the latent space forms a conceptual space where different directions and regions show inherent patterns related to the generated representations and their corresponding material properties.

【2】 Style-Label-Free: Cross-Speaker Style Transfer by Quantized VAE and Speaker-wise Normalization in Speech Synthesis

作者:Chunyu Qiang,Peng Yang,Hao Che,Xiaorui Wang,Zhongyuan Wang
机构:Kwai, Beijing, P.R. China
备注:Published to ISCSLP 2022
摘要:Cross-speaker style transfer in speech synthesis aims at transferring a style from source speaker to synthesised speech of a target speaker's timbre. Most previous approaches rely on data with style labels, but manually-annotated labels are expensive and not always reliable. In response to this problem, we propose Style-Label-Free, a cross-speaker style transfer method, which can realize the style transfer from source speaker to target speaker without style labels. Firstly, a reference encoder structure based on quantized variational autoencoder (Q-VAE) and style bottleneck is designed to extract discrete style representations. Secondly, a speaker-wise batch normalization layer is proposed to reduce the source speaker leakage. In order to improve the style extraction ability of the reference encoder, a style invariant and contrastive data augmentation method is proposed. Experimental results show that the method outperforms the baseline. We provide a website with audio samples.

【3】 Towards trustworthy phoneme boundary detection with autoregressive model and improved evaluation metric

作者:Hyeongju Kim,Hyeong-Seok Choi
机构:Supertone, Inc., Seoul National University
备注:5 pages, submitted to ICASSP 2023
摘要:Phoneme boundary detection has been studied due to its central role in various speech applications. In this work, we point out that this task needs to be addressed not only by algorithmic way, but also by evaluation metric. To this end, we first propose a state-of-the-art phoneme boundary detector that operates in an autoregressive manner, dubbed SuperSeg. Experiments on the TIMIT and Buckeye corpora demonstrates that SuperSeg identifies phoneme boundaries with significant margin compared to existing models. Furthermore, we note that there is a limitation on the popular evaluation metric, R-value, and propose new evaluation metrics that prevent each boundary from contributing to evaluation multiple times. The proposed metrics reveal the weaknesses of non-autoregressive baselines and establishes a reliable criterion that suits for evaluating phoneme boundary detection.



