Mathematics aids language between cultures

The new method helps in the comparison of texts with greater precision than with conventional methods. Innovation allows improving the quality of language translation, database coincidences, information recovery speed in web browsers and voice recognition.

Soft cardinality consists of counting non-repetitive elements, contributing to discovering the semantic similarity in sets or ample data. This is the result of a research project carried out by Universidad Nacional de Colombia (UNal) IT Ph.D. Sergio Gonzalo Jiménez Vargas.

Natural language processing is a set of techniques used for machines to “understand” and process human language. Its purpose is to facilitate communication between human and machines, something unheard of many years ago. In fact, there are evident sign of progress in automatic translation and voice recognition using personal assistance apps available for cell phones.

 “In some way, information technology is not available to everybody due to interface issues. When displays, keyboards or mice are not available it will be accessible for everybody, such as the elderly or people with a visual or auditory impairment. Being able to really speak with a machine will take IT to the next level,” said Jiménez.

A problem, a solution

“My research project emerged from an issue at the Universidad Nacional when the ‘Address’ field in the database for registration was opened. The issue for databases is how the same word may be written in multiple manners. For instance, there are 100 ways to write the word Bogotá: accompanied with Santa Fe, with or without an accent mark (tilde, in Spanish) using Distrito Capital, etc. A rigid comparison demands an exact coincidence, while a soft comparison is more flexible. The cardinality method worked to solve this issue.

In face of the issue of repeated data, they compared texts with cardinality; in other words, indicate how many common elements are in a set, even if there are not exact.

Along with engineers and research directors Alexander Gelbukh (Mexico) and Fabio González (Colombia), they observed that the comparison of sets does not take into account similar elements. Serendipity (discovery by chance) occurred in this case because they observed that the soft cardinality method did not comply with the monotonicity property, in other words, what they discovered was a measure of diversity, not only a method of counting.

According to Jiménez, when they were researching cardinality, he discovered an article in Ecology (2014), where a group of researchers of the University of Glasgow proposed a similar model to soft cardinality applied to similarities between species to compare them in ecosystems.

“At that moment I panicked because if you are in the middle of your doctorate program you have to propose something new. I thought that my method wasn’t new, but I found out that the article had been written after the first articles with soft cardinality. This turned into a benefit for my project, which received a merit award,” said Jimenez.

Professor Jiménez has interacted with linguists which provided valuable support to improve his cardinality method. Along with other students, Jimenez participated in challenges on text similarity and included in the Swadesh list, a basic vocabulary formed by common words of different languages; for instance hand, mano (in Spanish), main (French). The original list, proposed by Morris Swadesh, an American-Mexican linguist included 200 terms which he compiled in the 40s and 50s with the purpose of comparing languages.

Finally, the researcher realized that there is “an identical, more identical than another identical.” A way to explain this is with the famous Charles Chaplin imitator contest where Chaplin himself participated but the judges considered that there was a more identical Chaplin than the original”.

Read the complete there project (in Spanish) here

Listen to the researcher in a UN Radio program (in Spanish)

