Humans, Technology, and Language Identification
Updated: Feb 27
Previously in this blog, we talked about how humans recognize and identify spoken language. We can take it a step further and look at how technology can do the same. Current language identification technologies include automatic audio and text translation technologies such as Google Translate, speech recognition technologies such as Alexa and Google Home, and multilingual search functions for documents or media files. However, language identification is not an easy task for technology. It must be properly trained how to recognize and analyze unique features of a variety of languages, and to do that, a large amount of language data from humans must be collected. Language identification technology can make communication easier, give people access to resources in different languages, and break down global language barriers.
NameThatLanguage, a game on the Lingo Boingo website, attempts to collect data on how accurately humans can identify language. It includes clips in which the spoken language is known (confirmed by expert language annotation), and clips where the spoken language is suspected to be a certain one, but not confirmed. The initial version of the game included at least 80 known audio clips in 13 languages and approximately 600 suspected clips in each of 9 of those languages. For the clips where the spoken language is known, the participants’ answers provide data on whether people reliably agree with the expert language annotation. For the clips where the spoken language is suspected, but not confirmed, the participants’ answers provide data that either agrees with or disagrees with the suspected language.
Additionally there is a list of about 5 or 6 choices of languages that the participant can pick after listening to the audio clip. Tracking people’s choices within the multiple-choice style function can provide data on what languages people confuse with others. For example, if the spoken language of an audio clip is in Hindi, but people often guess that it is Urdu, we can suppose that these two languages have a very high rate of similarity and confusability. The concept of language confusability is a difficult issue for language identification technology to tackle, so it is important to collect accurate and reliable language data. In the future, NameThatLanguage plans to add more audio clips from different languages.
83,991 unique userIDs (a player can have more than one userID) have participated in more than 862,608 Name that Language tasks. And, 85% have yielded judgments that can be used for Language identification. From this data, the NTL Language Recognition Corpus, containing audio files and participants’ judgements on those audio files, will be released. In the future, this corpus can be used for language recognition technology such as automatic translation, search engines, and AI.
Fiumara, J., Cieri, C., Liberman, M., Callison-Burch, C., Wright, J., & Parker, R. (2022).
The NIEUW Project: Developing Language Resources Through Novel Incentives.
Linguistic Data Consortium.