The “digital divide” between those who can afford an Internet connection and those who can’t is sprouting an evil twin: a “language divide.”
The roots of the Internet lie in U.S. military and university research projects, conducted in English. That language is still preferred online for international commerce and science.
But the scene is shifting rapidly. Tens of millions of new Internet users do not speak or read English and seek content in their own languages. China alone has 400 million people online — more than the entire US population — and the vast majority only read Chinese.
“There’s a Chinese Internet that we don’t interact with very much. There’s an Arabic Internet that we don’t interact with very much,” says Ethan Zuckerman, co-founder of Global Voices, a community of more than 300 bloggers and translators around the world who seek out voices that are not ordinarily heard in the mainstream news media.
Those who can only read in their native language are “missing this extraordinary opportunity to get much, much better at understanding what people around the world are thinking and saying and feeling,” he says.
In May, the organization that regulates Internet domain names made history when it permitted three countries — Egypt, Saudi Arabia, and the United Arab Emirates — to display their web addresses (those ending in .eg for Egypt, .sa for Saudi Arabia, .ae for the UAE) in their native Arabic characters rather than English. Many other countries, including China, are expected to follow suit.
We’re “way beyond” English as the language of the Internet, contends Mr. Zuckerman, whose Global Voices website is translated into 15 languages by more than 200 volunteers. “The Internet is for everybody these days.”
The web giant Google sees a commercial opportunity in making more of the Web readable, whether it’s helping an English speaker read a web page in Urdu or a Basque read an English-only website. Google Translate (translate.google.com) now offers quick, computer-generated translations between 57 languages, including Urdu (spoken by 60 million to 90 million people in parts of India and Pakistan) and Basque (with more than 600,000 speakers in Spain and France).
Google’s Web browser, Chrome, sports a tool bar that offers to translate any Web page into a user’s own language.
“The last few years there’s been a blossoming of languages on Google Translate,” says company spokesman Nate Tyler. “Our goal is to make it as good as it can be. At this point, it is not as good as a human translator. It’s hard to know when it can ever be.”
Traditionally, efforts to undertake computerized translation centered on devising rules that the computer would follow, such as “If you see this word or phrase, it means this word or phrase in the other language.”
But all sorts of problems creep in, many of which demand customized solutions. If a headline says “Clemson Tigers beat Georgia Bulldogs,” for example, does it mean one sports team “actually physically beat” the other? asks Prem Natarajan, vice president of speech and language technology at Raytheon BBN Technologies in Cambridge, Mass. (For that matter, are we talking about real “tigers” and “bulldogs” or human athletes?)
What’s more, casual conversations can quickly leap from topic to topic.
“Rule-based systems don’t have a chance,” says Mr. Natarajan, whose company monitors and translates written and verbal communications for the U.S. military, among other clients. When a rules-based system breaks, he says, “It breaks spectacularly.”
Google has bypassed the rules concept by concentrating on comparing huge numbers of translated documents that already exist online, such as those created by the United Nations or European Parliament.
Instead of creating rules, Google uses statistical analysis to assess the location of words or groups of words that appear in each translation and notes their relationship. The program doesn’t need to understand the meaning of the words to make an often highly accurate translation.
“Their great strength is that they have enormous quantities of data” to analyze, Zuckerman says. Google works best with major European languages, he says. “I can read a lot of French newspapers with Google Translate and have them read quite comfortably.”
On the other hand, languages that have fewer texts online for Google’s algorithm to chew on can be problematic. For Asian and Middle Eastern translations, “what you’re really getting is guesswork, at best,” Zuckerman says.
He often researches Vietnamese websites. “It appears that [Google’s] entire [Vietnamese] dictionary comes from translating menus,” he says. When he’s trying to read an article about Vietnamese politics, the English translation ends up talking about “spring rolls” and “fried rice,” he says.
Raytheon BBN and others are trying to combine the best of rules-based and statistical translation by “converting these rules into statistical information and incorporating them,” Natarajan says.
Despite its limitations, computer translation has already begun migrating to cellphones.
Google Goggles, a free mobile application for phones with the Android operating system, can analyze photos taken by cellphone cameras. Goggles can read a bar code, for example, to help you compare a price online. It can identify famous landmarks, more than 100,000 works of art, and books by their covers.
The program can also perform simple language translations, at least on short texts, says Harmut Neven, who heads Goggles. Snap a photo of a road sign in Chinese or a menu item in French, and Goggles first figures out the words being represented, using optical recognition software. Then it uses Google Translate to send back a translation.
“You get a very good gist [of the translation], and sometimes it’s right on,” Mr. Neven says.
Raytheon BBN is working on hand-held translators, too, with military and humanitarian uses in mind, at least at first.
It’s now testing mobile phones with voice recognition and translation programming installed. The first two languages are Pashto and Dari, both spoken in Afghanistan.
A soldier or relief worker would hand a cellphone to an Afghan. The American could speak into the phone in English and the other will hear a translation in Pashto or Dari. The Afghan could reply in a similar way.
Raytheon BBN differs from Google’s approach in that the voice recognition and translation take place within the phones themselves and don’t require an Internet connection. Networking with the Internet “is good to use if it is available, but not something to rely on,” especially in remote areas, Natarajan says.
Speech translation packages might appear on consumer phones within two or three years, Natarajan says. But he cautions that they will have limited and specific uses, such as asking for directions.
The advent of a mobile “universal translator,” so often depicted in science fiction stories, is still a long way off, these experts agree.
“You’re asking two of the hardest questions we have in computer science” — turning voice into text and then translating that text into another language, Zuckerman says. “You’re combining two really hard things to do.”