Human Language Technology


Speech Recognition

Speech is the most natural way of human communication, which carries information about lexical words, meanings, speaker identity, emotions, and other properties. We study the techniques and algorithms for automatic speech recognition, speaker recognition, language recognition, and other paralinguistic speech recognition. We specialize in enabling technologies for multi-lingual, code-switch, and conversational speech applications. We participated in NIST speaker recognition evaluation and language recognition evaluation since 2006, ASVspoof anti-spoofing challenge series since 2015, Computational Paralinguistics ChallengE (ComParE) since 2019 and Fearless Steps Challenge (FS-I) in 2019. In 2020, we organized a challenge entitled “Far-Field Speaker Verification Challenge 2020” a special session on “The Attacker’s Perspective on Automatic Speaker Verification” in INTERSPEECH 2020.

Speech Synthesis

Speech synthesis is a significant aspect of artificial intelligence. We study novel algorithms for text-to-speech synthesis and voice conversion. In text-to-speech synthesis, we focus on expressive rendering, prosodic quality, and multi-lingual synthesis. In voice conversion, we focus on the disentanglement and manipulation of voice identity, emotion, and accents.  We have published in INTERSPEECH, ICASSP, ASRU and Speech Communication, IEEE/ACM TASLP. We contributed and participated in a number of international technology evaluations. We achieved good results in Blizzard Challenge 2018, ZeroSpeech Challenge 2019, Blizzard Challenge 2020 and Voice Conversion Challenge 2020. The team co-organized the Voice Conversion Challenge 2020, and delivered tutorials on voice conversion in INTERSPEECH 2018, and APSIPA ASC 2020.

Neuromorphic Computing

Human brains deal with electrical impulses (spikes) using massively connected spiking neurons. Spiking neural network (SNN) is biologically inspired and grounded under a solid scientific framework to mimic and achieve some of the advantages of biological neural systems. Neuromorphic computing, as a non-von Neumann computing paradigm, realizes the event-driven computation of the biological neural systems with SNNs in silicon. The emerging neuromorphic computing architectures leverage on the massively parallel, low-power computing units to support spike-based information processing, which has a great potential to improve the energy and computational efficiency of today’s digital computers in perceptual and cognitive tasks.  We study neural coding, precise-spike-driven learning for synaptic plasticity learning, neuron modeling for analog memory element characteristics as well as memory and information retrieval. We specialize in developing ultra-low power, brain-inspired computing systems using emerging neuromorphic computing technologies to tackle challenging perceptual and cognitive tasks. We participated in the 2018 and 2019 International Collegiate Competition for Brain-inspired Computing and won the first and second prize, respectively.

Human-Robot Interaction

Spoken language is the primary mode of Human-Robot Interaction (HRI). We explore the novel algorithms that integrate machine listening intelligence into the robotic audition and collaborative dialogue interaction. In the robotic audition, we study audio-visual sound localization to provide both accurate direction and distance information; speaker-recognition for robots to only respond to intended speakers, and integrated audition solutions for far-field speech acquisition. We also investigate novel end-to-end speech-to-intent recognition to allow the users to speak to the robots in free form speech and even in different languages.  To establish a collaborative dialogue interaction between humans and robots through speech, we address the issues related to the deployment of a domain-specific speech recognition engine in a practically challenging industrial noisy environment.

Selective Auditory Attention

Speech could be easily corrupted by interference speakers, environmental noises and reverberation. The performance of speech technologies degrades significantly in the aforementioned adverse conditions. We develop algorithms to emulate human’s ability of selective auditory attention, to locate and separate speech sources, to extract a target speaker from multi-talker cocktail party, and to remove background noises and reverberation.  We have developed state-of-the-art speaker extraction systems, such as SpEx. By applying the speaker extraction algorithms, we greatly improve the performance of downstream speech recognition applications, such as speech recognition, and speaker recognition.  We also study algorithms to detect human’s auditory attention to connect human brains with auditory attention technologies for innovative applications, such as smart hearing aids.

Natural Language Processing

Natural language processing (NLP) is an interdisciplinary field of linguistics, computer science, and artificial intelligence, concerning the interactions between computers and human language, in particular, how to build intelligent machines to process and analyze large amounts of unstructured text data. We specialize in both foundational and application-oriented research, such as cross-lingual embedding, language modeling and dialogue system. We study challenging NLP research problems, such as how to align text of different languages into a meaningful semantic space and how to automatically evaluate the quality of open-domain chatbots. Through active participation in relevant workshops and challenges, such as the JSALT 2020, as well as increasing emergence in top-tier NLP conferences, we aim to stay at the forefront of NLP research.

Multimedia Processing

Humans naturally interact with each other and the environment through visual and auditory media. We believe it is vital to study this interaction using multiple modalities. Auditory interactions include different media such as speech, music, and environmental sounds. We study techniques and algorithms for controlled audio and music synthesis, music information retrieval, and multimedia speech perception. We specialize in enabling technologies for applications such as singing voice evaluation, lyrics-to-audio alignment and transcription, multimedia speech emotion understanding, multimedia speech enhancement, speaker recognition and diarization. We have participated in MIREX lyrics-to-audio alignment evaluation since 2018, achieving the best performing system in 2019. Our research in singing voice evaluation has led to the incorporation of NUS spin-off MuSigPro Pte. Ltd., funded by NUS Graduate Research Innovation Program (GRIP).