Auto segmentation for Malay speech corpus
Abstract-This paper deals with the automatic segmentation of Malay continuous speech database. Auto segmentation is a process of producing a sequence of discrete utterance with particular characteristics remaining constant within each one. In terms of quality, hand crafted segmentation would be th...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Conference or Workshop Item |
| Published: |
2012
|
| Online Access: | http://eprints.utm.my/36446/ http://eprints.utm.my/36446/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Abstract-This paper deals with the automatic segmentation
of Malay continuous speech database. Auto segmentation
is a process of producing a sequence of discrete utterance with particular characteristics remaining constant within each one. In terms of quality, hand crafted segmentation would be the best method. However, due to the large database size, manual speech segmentation and labeling become tremendous. It is time consuming and error prone. Besides, even if the
database is segmented by an expert, the segmentation rule may become subjective and not reproducible. Inconsistency result may occur from different linguistic experts. Thus, an automated segmentation rule was drawn to consistently segment the large scale database with satisfactory level of quality. Automated segmentation of Malay Language syllable is not a tough task because all syllables in Malay Language are pronounced almost equally and moreover it is not a tonal language like English. The manipulation and identification of the segment boundaries of Malay Language is straight forward and easy to understand.
For the segmentation, the HMM based approach with adapted
Viterbi force alignment technique is used. Composite HMM with Baum Welch reestimation was utilized to ease the process of phonetic segmentation. All the data from the database was fed into the segmentation tool directly without prior trained sample for pre-training purpose. For the design of the sentence coverage of the database, the scripts are consisting of 1000 sentences. 620 sentences are selected from primary school Malay Language
text book and 380 sentences were computed using the 70%
highest frequency words that appear in the 10 million words online digital text. This configuration of Malay Language script already promises a phonetically balanced database which covers all the vowels and consonants. The objective evaluation method is used to identify the performance. The result from the autosegmentation
was verified to obtain the accuracy degree and
overall quality. The result was tested perceptually and it is proven to have satisfactory high quality. |
|---|