Text-To-Speech
Our Text-To-Speech component processes text and plays audio with matching facial animation.
Approach
The realtime rig is setup by the loader class and imports alls FACS and blendshapes from the didimo package. We use a third-party service (i.e. Amazon Polly) to generate audio and the corresponding timing data of phonemes which the viseme player uses to match with the corresponding viseme when playing animations. With SSML you can control various aspects of speech, such as pronunciation, volume, pitch, and speech rate.
For more information, see Generating Speech from SSML Documents
Visemes
We use Amazon's visemes which are based on a mesh called Rin.
Rin has: 1 silence pose + 16 poses (has another one that is not being used) + smile + frown.
Frame | Viseme |
---|---|
0 | silence |
8 | p |
16 | t |
24 | SS |
32 | TT |
40 | f |
48 | k |
56 | i |
64 | r |
72 | s |
80 | u |
88 | amper |
96 | a |
104 | e |
112 | EE |
120 | o |
128 | OO |
NOTE: The frame positions presented in the table were being used in the deprecated animation system. Currently, the realtime rig handles the visemes directly.
Phonemes
The facial rig mapping goes as following:
Viseme | Phoneme |
---|---|
p | phoneme_p_b_m |
t | phoneme_d_t_n |
SS | phoneme_s_z |
TT | phoneme_d_t_n |
f | phoneme_f_v |
k | phoneme_k_g_ng |
i | phoneme_ay |
r | phoneme_r |
s | phoneme_s_z |
u | phoneme_ey_eh_uh |
amper | phoneme_p_b_m |
a | phoneme_aa |
e | phoneme_ae_ax_ah |
EE | phoneme_ey_eh_uh |
o | phoneme_ao |
OO | phoneme_aw |
Setup Of The Viseme Player
The instantiation tool will handle the setup of the realtime rig and the realtime rig avatar, which the viseme player uses to play text-to-speech animations.
Currently, the realtime rig holds 23 blendshapes to match the phonemes referenced in the section above.
The viseme player will then be able to control the active viseme through the rig.
//Viseme Player Class
private void Update()
{
// Reset the pose on update. If we have other animations playing (e.g. idle animations), this won't override them
if (audioSource != null && audioSource.isPlaying && visemes != null && isDeprecatedTemplate)
{
ResetPose();
}
if (didimoIsSpeaking && audioSource.isPlaying && !isDeprecatedTemplate)
{
UpdatePoseForTimeImpl(audioSource.time);
}
}
void UpdatePoseForTimeImpl(float time)
{
List<InterpolationViseme> visemesToInterpolate = GetVisemesForInterpolation(time + visemeOffset);
realtimeRig.ResetAll();
foreach (InterpolationViseme interpolationViseme in visemesToInterpolate)
{
PlayMatchingRealtimeRigVisemeFromPhoneme(interpolationViseme.animation, interpolationViseme.weight);
}
}
public void PlayMatchingRealtimeRigVisemeFromPhoneme(string phoneme, float weight)
{
string viseme_name;
// map between thename of the phone and the name of the viseme in the FACS we support in the realtime rig
// (switch-case tables are compiled to constant hash jump tables)
switch (phoneme)
{
case "sil": viseme_name = ""; break;
case "p": viseme_name = "phoneme_p_b_m"; break;
case "t": viseme_name = "phoneme_d_t_n"; break;
case "SS": viseme_name = "phoneme_s_z"; break;
case "TT": viseme_name = "phoneme_d_t_n"; break;
case "f": viseme_name = "phoneme_f_v"; break;
case "k": viseme_name = "phoneme_k_g_ng"; break;
case "i": viseme_name = "phoneme_ay"; break;
case "r": viseme_name = "phoneme_r"; break;
case "s": viseme_name = "phoneme_s_z"; break;
case "u": viseme_name = "phoneme_ey_eh_uh"; break;
case "&": viseme_name = "phoneme_aa"; break;
case "@": viseme_name = "phoneme_aa"; break;
case "a": viseme_name = "phoneme_ae_ax_ah"; break;
case "e": viseme_name = "phoneme_ey_eh_uh"; break;
case "EE": viseme_name = "phoneme_ao"; break;
case "o": viseme_name = "phoneme_ao"; break;
case "OO": viseme_name = "phoneme_aw"; break;
default: viseme_name = ""; break;
}
realtimeRig.SetBlendshapeWeightsForFac(viseme_name, weight, true);
}
We include a fully working user interface in the Unity SDK. You can learn more on this topic by exploring the sample scene, and how to use our scripts for connecting to the Didimo API, instantiate didimos, and play text-to-speech animations.
Go to Exploring the sample scenes
Last updated on 2020-10-06