Speech coding

Speech coding is an application of data compression of digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.[1]

Some applications of speech coding are mobile telephony and voice over IP (VoIP).[2] The most widely used speech coding technique in mobile telephony is linear predictive coding (LPC), while the most widely used in VoIP applications are the LPC and modified discrete cosine transform (MDCT) techniques.

The techniques employed in speech coding are similar to those used in audio data compression and audio coding where knowledge in psychoacoustics is used to transmit only data that is relevant to the human auditory system. For example, in voiceband speech coding, only information in the frequency band 400 Hz to 3500 Hz is transmitted but the reconstructed signal is still adequate for intelligibility.

Speech coding differs from other forms of audio coding in that speech is a simpler signal than most other audio signals, and a lot more statistical information is available about the properties of speech. As a result, some auditory information which is relevant in audio coding can be unnecessary in the speech coding context. In speech coding, the most important criterion is preservation of intelligibility and "pleasantness" of speech, with a constrained amount of transmitted data.[3]

In addition, most speech applications require low coding delay, as long coding delays interfere with speech interaction.[4]

Categories

Speech coders are of two types:[5]

  1. Waveform coders
  2. Vocoders

Sample companding viewed as a form of speech coding

From this point of view, the A-law and μ-law algorithms (G.711) used in traditional PCM digital telephony can be seen as an earlier precursor of speech encoding, requiring only 8 bits per sample but giving effectively 12 bits of resolution.[6] The logarithmic companding laws are consistent with human hearing perception in that a low-amplitude noise is heard along a low-amplitude speech signal but is masked by a high-amplitude one. Although this would generate unacceptable distortion in a music signal, the peaky nature of speech waveforms, combined with the simple frequency structure of speech as a periodic waveform having a single fundamental frequency with occasional added noise bursts, make these very simple instantaneous compression algorithms acceptable for speech.

A wide variety of other algorithms were tried at the time, mostly on delta modulation variants, but after a careful consideration, the A-law/μ-law algorithms were chosen by the designers of the early digital telephony systems. At the time of their design, their 33% bandwidth reduction for a very low complexity made an excellent engineering compromise. Their audio performance remains acceptable, and there was no need to replace them in the stationary phone network.

In 2008, G.711.1 codec, which has a scalable structure, was standardized by ITU-T. The input sampling rate is 16 kHz.

Modern speech compression

Much of the later works in speech compression was motivated by military research into digital communications for secure military radios, where very low data rates were required to allow effective operation in a hostile radio environment. At the same time, far more processing power was available, in the form of VLSI circuits, than was available for earlier compression techniques. As a result, modern speech compression algorithms could use far more complex techniques than were available in the 1960s to achieve far higher compression ratios.

These techniques were available through the open research literature to be used for civilian applications, allowing the creation of digital mobile phone networks with substantially higher channel capacities than the analog systems that preceded them.

The most widely used speech coding algorithms are based on linear predictive coding (LPC).[7] In particular, the most common speech coding scheme is the LPC-based Code Excited Linear Prediction (CELP) coding, which is used for example in the GSM standard. In CELP, the modelling is divided in two stages, a linear predictive stage that models the spectral envelope and code-book based model of the residual of the linear predictive model. In CELP, linear prediction coefficients (LPC) are computed and quantized, usually as line spectral pairs (LSPs). In addition to the actual speech coding of the signal, it is often necessary to use channel coding for transmission, to avoid losses due to transmission errors. Usually, speech coding and channel coding methods have to be chosen in pairs, with the more important bits in the speech data stream protected by more robust channel coding, in order to get the best overall coding results.

The modified discrete cosine transform (MDCT), a type of discrete cosine transform (DCT) algorithm, was adapted into a speech coding algorithm called LD-MDCT, used for the AAC-LD format introduced in 1999.[8] MDCT has since been widely adopted in voice-over-IP (VoIP) applications, such as the G.729.1 wideband audio codec introduced in 2006,[9] Apple's Facetime (using AAC-LD) introduced in 2010,[10] and the CELT codec introduced in 2011.[11]

Opus is a free software speech coder. It combines both the MDCT and LPC audio compression algorithms.[12] It is widely used for VoIP calls in WhatsApp.[13][14][15] The PlayStation 4 video game console also uses the CELT/Opus codec for its PlayStation Network system party chat.[16]

Codec2 is another free software speech coder, which manages to achieve very good compression, as low as 700 bit/s.[17]

Sub-fields

Wideband audio coding
Narrowband audio coding
gollark: Or possibly a conspiracy theory.
gollark: Still, dedicated GPU, so that helps.
gollark: Oh, wow, that's pretty old.
gollark: Huh?
gollark: That's probably not quite true recently since Intel stagnated loads, but eh.

See also

References

  1. M. Arjona Ramírez and M. Minami, "Low bit rate speech coding," in Wiley Encyclopedia of Telecommunications, J. G. Proakis, Ed., New York: Wiley, 2003, vol. 3, pp. 1299-1308.
  2. M. Arjona Ramírez and M. Minami, “Technology and standards for low-bit-rate vocoding methods,” in The Handbook of Computer Networks, H. Bidgoli, Ed., New York: Wiley, 2011, vol. 2, pp. 447–467.
  3. P. Kroon, "Evaluation of speech coders," in Speech Coding and Synthesis, W. Bastiaan Kleijn and K. K. Paliwal, Ed., Amsterdam: Elsevier Science, 1995, pp. 467-494.
  4. J. H. Chen, R. V. Cox, Y.-C. Lin, N. S. Jayant, and M. J. Melchner, A low-delay CELP coder for the CCITT 16 kb/s speech coding standard. IEEE J. Select. Areas Commun. 10(5): 830-849, June 1992.
  5. Soo Hyun Bae, ECE 8873 Data Compression & Modeling, Georgia Institute of Technology , 2004
  6. N. S. Jayant and P. Noll, Digital coding of waveforms. Englewood Cliffs: Prentice-Hall, 1984.
  7. Gupta, Shipra (May 2016). "Application of MFCC in Text Independent Speaker Recognition" (PDF). International Journal of Advanced Research in Computer Science and Software Engineering. 6 (5): 805-810 (806). ISSN 2277-128X. Retrieved 18 October 2019.
  8. Schnell, Markus; Schmidt, Markus; Jander, Manuel; Albert, Tobias; Geiger, Ralf; Ruoppila, Vesa; Ekstrand, Per; Bernhard, Grill (October 2008). MPEG-4 Enhanced Low Delay AAC - A New Standard for High Quality Communication (PDF). 125th AES Convention. Fraunhofer IIS. Audio Engineering Society. Retrieved 20 October 2019.
  9. Nagireddi, Sivannarayana (2008). VoIP Voice and Fax Signal Processing. John Wiley & Sons. p. 69. ISBN 9780470377864.
  10. Daniel Eran Dilger (June 8, 2010). "Inside iPhone 4: FaceTime video calling". AppleInsider. Retrieved June 9, 2010.
  11. Presentation of the CELT codec by Timothy B. Terriberry (65 minutes of video, see also presentation slides in PDF)
  12. Valin, Jean-Marc; Maxwell, Gregory; Terriberry, Timothy B.; Vos, Koen (October 2013). High-Quality, Low-Delay Music Coding in the Opus Codec. 135th AES Convention. Audio Engineering Society. arXiv:1602.04845.
  13. Leyden, John (27 October 2015). "WhatsApp laid bare: Info-sucking app's innards probed". The Register. Retrieved 19 October 2019.
  14. Hazra, Sudip; Mateti, Prabhaker (September 13–16, 2017). "Challenges in Android Forensics". In Thampi, Sabu M.; Pérez, Gregorio Martínez; Westphall, Carlos Becker; Hu, Jiankun; Fan, Chun I.; Mármol, Félix Gómez (eds.). Security in Computing and Communications: 5th International Symposium, SSCC 2017. Springer. pp. 286–299 (290). doi:10.1007/978-981-10-6898-0_24. ISBN 9789811068980.
  15. Srivastava, Saurabh Ranjan; Dube, Sachin; Shrivastaya, Gulshan; Sharma, Kavita (2019). "Smartphone Triggered Security Challenges: Issues, Case Studies and Prevention". In Le, Dac-Nhuong; Kumar, Raghvendra; Mishra, Brojo Kishore; Chatterjee, Jyotir Moy; Khari, Manju (eds.). Cyber Security in Parallel and Distributed Computing: Concepts, Techniques, Applications and Case Studies. Cyber Security in Parallel and Distributed Computing. John Wiley & Sons. pp. 187–206 (200). doi:10.1002/9781119488330.ch12. ISBN 9781119488057.
  16. "Open Source Software used in PlayStation®4". Sony Interactive Entertainment Inc. Retrieved 2017-12-11.
  17. "GitHub - Codec2". November 2019.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.