Information about nearly any topic can be found on Wikipedia, the popular user-edited online encyclopedia. But the extent and completeness of each entry depends on which of Wikipedia’s 272 language-specific sites is used for the search.
Because its articles can be altered by anyone with an Internet connection, critics accuse Wikipedia of systemic bias and inconsistencies.
These inconsistencies and ever-evolving entries are exactly what Elena Filatova, Ph.D., is using as fodder for new research. An assistant professor of computer science, Filatova is investigating how Wikipedia’s multilinguality can be used for opinion, contradiction and new information detection.
Filatova is an expert in natural language processing (NLP), which uses computer science technology and linguistics techniques to analyze how computers accommodate and alter human languages. For 50 years, NLP researchers have been testing computers’ ability to translate languages by using parallel documents—such as European parliament proceedings—in which one language is translated into several other languages with the goal of preserving the information in the source document.
Recently, researchers have begun to study parallel texts in which each version is not a direct translation, such as the different language-based Wikipedia sites.
“These entries are not identical; they can vary greatly in terms of description length and information choice,” Filatova said. “Keeping these peculiarities in mind is necessary while using multilingual Wikipedia for training and testing NLP applications.”
By using the translation feature on the Google search engine, Filatova compared entries of 50 well-known people on the English version of Wikipedia with entries written in other languages.
“The assumption was whatever was repeated in many languages was important and whatever was only mentioned in one language would be less important,” Filatova said. “That isn’t quite accurate.”
Take Babe Ruth. Wikipedia’s English site features a long article with plenty of photos and information about Ruth’s early years; joining the Baltimore Orioles as a teenager; his Major League career, including his time with the Boston Red Sox; his emergence as a hitter; being sold to the New York Yankees; and his record with the Yankees from 1920 to 1925. There isn’t a dearth of information on “the Bambino.”
The Italian entry contains a moderate amount of information about Ruth, including his nickname: “George Herman Ruth detto “Babe” (Baltimora, 6 febbraio 1895 – New York, 16 agosto 1948) conosciuto anche con il soprannome “Il bambino.” The Finnish version, however, contains a few two-sentence paragraphs.
“Basically, it shows that the Italian-language community is not so interested in Ruth, and that Wikipedia articles are not direct translations,” Filatova said. “But if they are not translations, do the volunteers who collaboratively edit the articles add something else?”
She found that Wikipedia contributors do indeed bring something to the articles—bias.
According to the English Wikipedia, the Battle of Borodino was fought on September 7, 1812, and was the largest and bloodiest single-day action of the French invasion of Russia, involving more than 250,000 troops and resulting in at least 70,000 casualties. The English entry indicates that the Russian army retreated and names the French as victors.
Not according to the Italian version, which names Russia as the victor. Meanwhile, the Russian Wikipedia fails to indicate which army prevailed.
“Those who contribute to Wikipedia generally tend to be neutral, but by using certain words, they inevitably inflict opinion,” Filatova said. “In the Italian entry for this battle, they include a different set of participants, such as the Kingdom of Naples. For some reason, it was important for Italian contributors to include that.”
Filatova also found that even though the English Wikipedia is by far the largest—with 3.2 million entries—it may not be the most comprehensive.
“For example, the International Center of Photography in New York has an exhibition devoted to the work of the reclusive and mysterious Czech photographer Miroslav Tichy,” Filatova said. “After visiting this exhibition, I wanted to find more information about this artist and, interestingly, English Wikipedia does not have an article about him.”
Yet the Czech, German, Polish and French Wikipedia sites do.
“This could be because the English-speaking community does not have enough interest in Tichy, or perhaps people from the English-speaking community who are interested in Tichy are not Wikipedia contributors,” Filatova said.
“Overall, different communities speaking different languages have different interests, opinions and world perception. In my research so far, I’ve dealt with identifying similarities in the information reported in Wikipedia by communities speaking different languages,” Filatova said.
“In the future, I plan to work on identifying differences in the information reported by communities speaking different languages.”