*** START OF THE PROJECT GUTENBERG EBOOK 30422 ***
THE INTERNET AND LANGUAGES
[around the year 2000]
MARIE LEBERT
NEF, University of Toronto, 2009
Copyright © 2009 Marie Lebert. All rights reserved.
TABLE
Introduction
"Language nations" online
Towards a "linguistic democracy"
Encoding: from ASCII to Unicode
First multilingual projects
Online language dictionaries
Learning languages online
Minority languages on the web
Multilingual encyclopedias
Localization and internationalization
Machine translation
Chronology
Websites
INTRODUCTION
It is true that the internet transcends the limitations of time,
distances and borders, but what about languages? Non-English-speaking
internet users reached 50% in July 2000.
# "Language Nations"
"Because the internet has no national boundaries, the organization of
users is bounded by other criteria driven by the medium itself. In
terms of multilingualism, you have virtual communities, for example, of
what I call 'Language Nations'... all those people on the internet
wherever they may be, for whom a given language is their native
language. Thus, the Spanish Language nation includes not only Spanish
and Latin American users, but millions of Hispanic users in the U.S.,
as well as odd places like Spanish-speaking Morocco." (Randy Hobler,
consultant in internet marketing for translation products and services,
September 1998)
# "Linguistic Democracy"
"Whereas 'mother-tongue education' was deemed a human right for every
child in the world by a UNESCO report in the early 1950s, 'mother-
tongue surfing' may very well be the Information Age equivalent. If the
internet is to truly become the Global Network that it is promoted as
being, then all users, regardless of language background, should have
access to it. To keep the internet as the preserve of those who, by
historical accident, practical necessity, or political privilege,
happen to know English, is unfair to those who don't." (Brian King,
director of the WorldWide Language Institute, September 1998)
# A medium for the world
"It is very important to be able to communicate in various languages. I
would even say this is mandatory, because the information given on the
internet is meant for the whole world, so why wouldn't we get this
information in our language or in the language we wish? Worldwide
information, but no broad choice for languages, this would be quite a
contradiction, wouldn't it?" (Maria Victoria Marinetti, teacher in
Spanish and translator, August 1999)
# Good software
"When software gets good enough for people to chat or talk on the web
in real time in different languages, then we will see a whole new world
appear before us. Scientists, political activists, businesses and many
more groups will be able to communicate immediately without having to
go through mediators or translators." (Tim McKenna, writer and
philosopher, October 2000)
***
Unless specified otherwise, quotations are excerpts from NEF
interviews. Many thanks to all those who are quoted in this book, and
who kindly answered questions about multilingualism over the years.
Most interviews are available online . This book is also available in French,
with a different text. Both versions are available online
. The author,
whose mother tongue is French, is responsible for any remaining
mistakes in English.
Marie Lebert is a researcher and editor specializing in technology for
books, other media, and languages. Her books are published by NEF (Net
des études françaises / Net of French Studies), University of Toronto,
Canada, and are freely available online .
"LANGUAGE NATIONS" ONLINE
= [Quote]
Randy Hobler, a consultant in internet marketing for Globalink, a
company specializing in language translation software and services,
wrote in September 1998: "Because the internet has no national
boundaries, the organization of users is bounded by other criteria
driven by the medium itself. In terms of multilingualism, you have
virtual communities, for example, of what I call 'Language Nations'...
all those people on the internet wherever they may be, for whom a given
language is their native language. Thus, the Spanish Language nation
includes not only Spanish and Latin American users, but millions of
Hispanic users in the U.S., as well as odd places like Spanish-speaking
Morocco."
= [Text]
At first, the internet was nearly 100% English. A network was set up by
the Pentagon in 1969, before spreading to U.S. governmental agencies
and universities from 1974 onwards, after Vinton Cerf and Bob Kahn
invented TCP/IP (transmission control protocol / internet protocol).
After the creation of the World Wide Web in 1989-90 by Tim Berners-Lee
at the European Laboratory for Particle Physics (CERN) in Geneva,
Switzerland, and the distribution of the first browser Mosaic, the
ancestor of Netscape, from November 1993 onwards, the internet really
took off, first in the U.S. and Canada, then worldwide.
Why did the internet spread in North America first? The U.S. and Canada
were leading the way in computer science and communication technology,
and a connection to the internet, mainly through a phone line at the
time, was much cheaper than in most countries. In Europe, avid internet
users needed to navigate the web at night, when phone rates by the
minute were cheaper, to cut their expenses. In 1998, some French,
Italian and German users were so fed up with the high rates that they
launched a movement to boycott the internet one day per week, for
internet providers and phone companies to set up a special monthly rate
for them. This paid off, and providers began to offer monthly "internet
rates".
In the 1990s, the percentage of English decreased from nearly 100% to
80%. People from all over the world began to have access to the
internet, and to post more and more webpages in their own languages.
The first major study about language distribution on the web was run by
Babel, a joint initiative from Alis Technologies, a company
specializing in language translation services, and the Internet
Society. The results were published in June 1997 on a webpage named
"Web Languages Hit Parade". The main languages were English with 82.3%,
German with 4.0%, Japanese with 1.6%, French with 1.5%, Spanish with
1.1%, Swedish with 1.1%, and Italian with 1.0%.
In "Web Embraces Language Translation", an article published in ZDNN
(ZDNetwork News) on 21 July 1998, Martha L. Stone explained: "This
year, the number of new non-English websites is expected to outpace the
growth of new sites in English, as the cyber world truly becomes a
'World Wide Web'."
According to Global Reach, a branch of Euro-Marketing Associates, an
international marketing consultancy, there were 56 million non-English-
speaking users in July 1998, with 22.4% Spanish-speaking users, 12.3%
Japanese-speaking users, 14% German-speaking users, and 10% French-
speaking users. But 80% of all webpages were still in English, whereas
only 6% of the world population was speaking English as a native
language, while 16% was speaking Spanish as a native language. 15% of
Europe's half a billion population spoke English as a first language,
28% didn't speak English at all, and 32% were using the web in English.
Jean-Pierre Cloutier was the editor of "Chroniques de Cybérie", a
weekly French-language online report of internet news. He wrote in
August 1999: "We passed a milestone this summer. Now more than half the
users of the internet live outside the United States. Next year, more
than half of all users will be non English-speaking, compared with only
5% five years ago. Isn't that great? (...) The web is going to grow in
non-English-speaking regions. So we have to take into account the
technical aspects of the medium if we want to reach these 'new' users.
I think it is a pity there are so few translations of important
documents and essays published on the web - from English into other
languages and vice versa. (...) In the same way, the recent spreading
of the internet in new regions raises questions which would be good to
read about. When will Spanish-speaking communication theorists and
those speaking other languages be translated?"
Will the web hold as many languages as the ones spoken on our planet?
This will be quite a challenge, with the 6,700 languages listed in "The
Ethnologue: Languages of the World", an authoritative catalog published
by SIL International (SIL: Summer Institute of Linguistics) and freely
available on the web since the mid-1990s.
The year 2000 was a turning point for a multilingual internet,
regarding its users. Non English-speaking users reached 50% in summer
2000. According to Global Reach, they were 52.5% in summer 2001, 57% in
December 2001, 59.8% in April 2002, 64.4% in September 2003 (including
34.9% non-English-speaking Europeans and 29.4% Asians), and 64.2% in
March 2004 (including 37.9% non-English-speaking Europeans and 33%
Asians).
Despite the so-called English-language hegemony some non-English-
speaking intellectuals were complaining about, without doing much to
promote their own language, the internet was also a good medium for
minority languages, as stated by Caoimhín Ó Donnaíle. Caoimhín has
taught computing at the Institute Sabhal Mór Ostaig, on the Island of
Skye (Scotland). He has also created and maintained the college
website, as the main site worldwide with information on Scottish
Gaelic, with a bilingual (English, Gaelic) list of European minority
languages. He wrote in May 2001: "Students do everything by computer,
use Gaelic spell-checking, a Gaelic online terminology database. There
are more hits on our website. There is more use of sound. Gaelic radio
(both Scottish and Irish) is now available continuously worldwide via
the internet. A major project has been the translation of the Opera
web-browser into Gaelic - the first software of this size available in
Gaelic."
TOWARDS A "LINGUISTIC DEMOCRACY"
= [Quote]
Brian King, director of the WorldWide Language Institute (WWLI),
brought up the concept of "linguistic democracy" in September 1998:
"Whereas 'mother-tongue education' was deemed a human right for every
child in the world by a UNESCO report in the early 1950s, 'mother-
tongue surfing' may very well be the Information Age equivalent. If the
internet is to truly become the Global Network that it is promoted as
being, then all users, regardless of language background, should have
access to it. To keep the internet as the preserve of those who, by
historical accident, practical necessity, or political privilege,
happen to know English, is unfair to those who don't."
= [Text]
Yoshi Mikami, a computer scientist at Asia Info Network in Fujisawa
(Japan), launched in December 1995 the website "The Languages of the
World by Computers and the Internet", also known as the Logos Home Page
or Kotoba Home Page. (The website was updated until September 2001.)
Yoshi was also the co-author (with Kenji Sekine and Nobutoshi Kohara)
of "The Multilingual Web Guide" (Japanese edition), a print book
published by O'Reilly Japan in August 1997, and translated in 1998 into
English, French and German.
Yoshi Mikami explained in December 1998: "My native tongue is Japanese.
Because I had my graduate education in the U.S. and worked in the
computer business, I became bilingual in Japanese and American English.
I was always interested in languages and different cultures, so I
learned some Russian, French and Chinese along the way. In late 1995, I
created on the web 'The Languages of the World by Computers and the
Internet' and tried to summarize there the brief history, linguistic
and phonetic features, writing system and computer processing aspects
for each of the six major languages of the world, in English and
Japanese. As I gained more experience, I invited my two associates to
help me write a book on viewing, understanding and creating
multilingual webpages, which was published in August 1997 as 'The
Multilingual Web Guide', in a Japanese edition, the world's first book
on such a subject."
Yoshi added in the same email interview: "Thousands of years ago, in
Egypt, China and elsewhere, people were more concerned about
communicating their laws and thoughts not in just one language, but in
several. In our modern world, most nation states have each adopted one
language for their own use. I predict greater use of different
languages and multilingual pages on the internet, not a simple
gravitation to American English, and also more creative use of
multilingual computer translation. 99% of the websites created in Japan
are written in Japanese."
Robert Ware launched his website OneLook Dictionaries in April 1996 as
a "fast finder" in hundreds of online dictionaries. On September 2,
1998, the fast finder could "browse" 2,058,544 words in 425
dictionaries covering various topics: business, computer/internet,
medical, miscellaneous, religion, science, sports, technology, general,
and slang. OneLook Dictionaries was provided as a free service by the
company Study Technologies, in Englewood, Colorado.
Robert Ware explained in September 1998: "On the personal side, I was
almost entirely in contact with people who spoke one language and did
not have much incentive to expand language abilities. Being in contact
with the entire world has a way of changing that. And changing it for
the better! (...) I have been slow to start including non-English
dictionaries (partly because I am monolingual). But you will now find a
few included."
In the same email interview, Robert wrote about a personal experience
showing the internet could promote both a common language and
multilingualism: "In 1994, I was working for a college and trying to
install a software package on a particular type of computer. I located
a person who was working on the same problem and we began exchanging
email. Suddenly, it hit me... the software was written only 30 miles
away but I was getting help from a person half way around the world.
Distance and geography no longer mattered! OK, this is great! But what
is it leading to? I am only able to communicate in English but,
fortunately, the other person could use English as well as German which
was his mother tongue. The internet has removed one barrier (distance)
but with that comes the barrier of language. It seems that the internet
is moving people in two quite different directions at the same time.
The internet (initially based on English) is connecting people all
around the world. This is further promoting a common language for
people to use for communication. But it is also creating contact
between people of different languages and creates a greater interest in
multilingualism. A common language is great but in no way replaces this
need. So the internet promotes both a common language *and*
multilingualism. The good news is that it helps provide solutions. The
increased interest and need is creating incentives for people around
the world to create improved language courses and other assistance, and
the internet is providing fast and inexpensive opportunities to make
them available."
The internet could also be a tool to develop a "cultural identity".
During the Symposium on Multimedia Convergence organized by the
International Labor Office (ILO) in January 1997, Shinji Matsumoto,
general secretary of the Musicians' Union of Japan (MUJ), explained:
"Japan is quite receptive to foreign culture and foreign technology.
(...) Foreign culture is pouring into Japan and, in fact, the domestic
market is being dominated by foreign products. Despite this, when it
comes to preserving and further developing Japanese culture, there has
been insufficient support from the government. (...) With the
development of information networks, the earth is getting smaller and
it is wonderful to be able to make cultural exchanges across vast
distances and to deepen mutual understanding among people. We have to
remember to respect national cultures and social systems."
December 1997 was a turning point for a plurilingual web. AltaVista, a
leading search engine, was the first website to launch a free
translation software called Babel Fish (or AltaVista Translation),
which could translate up to three pages from English into French,
German, Italian, Portuguese or Spanish, and vice versa. Non-English-
speaking users were thrilled. The software was developed by Systran, a
pioneer company specializing in machine translation. Later on, other
translation software was developed by Alis Technologies, Globalink,
Lernout & Hauspie, Softissimo, Wordfast and Trados, with free and/or
paid versions available on the web.
Brian King, director of the WorldWide Language Institute (WWLI),
brought up the concept of "linguistic democracy" in September 1998:
"Whereas 'mother-tongue education' was deemed a human right for every
child in the world by a UNESCO report in the early 1950s, 'mother-
tongue surfing' may very well be the Information Age equivalent. If the
internet is to truly become the Global Network that it is promoted as
being, then all users, regardless of language background, should have
access to it. To keep the internet as the preserve of those who, by
historical accident, practical necessity, or political privilege,
happen to know English, is unfair to those who don't."
Geoffrey Kingscott was the managing director of Praetorius, a language
consultancy in applied languages. He wrote in September 1998: "Because
the salient characteristics of the web are the multiplicity of site
generators and the cheapness of message generation, as the web matures
it will in fact promote multilingualism. The fact that the web
originated in the USA means that it is still predominantly in English
but this is only a temporary phenomenon. If I may explain this further,
when we relied on the print and audiovisual (film, television, radio,
video, cassettes) media, we had to depend on the information or
entertainment we wanted to receive being brought to us by agents
(publishers, television and radio stations, cassette and video
producers) who have to subsist in a commercial world or -- as in the
case of public service broadcasting -- under severe budgetary
restraints. That means that the size of the customer-base is all-
important, and determines the degree to which languages other than the
ubiquitous English can be accommodated. These constraints disappear
with the web. To give only a minor example from our own experience, we
publish the print version of Language Today [a magazine for linguists,
published by Praetorius] only in English, the common denominator of our
readers. When we use an article which was originally in a language
other than English, or report an interview which was conducted in a
language other than English, we translate into English and publish only
the English version. This is because the number of pages we can print
is constrained, governed by our customer-base (advertisers and
subscribers). But for our web edition we also give the original
version."
Founder of Euro-Marketing Associates and its virtual branch Global
Reach, Bill Dunlap was championing the assets of e-commerce in Europe
among his fellow compatriots in the U.S. Bill wrote in December 1998:
"There are so few people in the U.S. interested in communicating in
many languages -- most Americans are still under the delusion that the
rest of the world speaks English. However, here in Europe (I'm writing
from France), the countries are small enough so that an international
perspective has been necessary for centuries."
As the internet quickly spread worldwide, more and more people in the
U.S. realized that, although English may stay the main international
language for exchanges of all kinds, people did prefer to read
information in their own language. To reach as large an audience as
possible, companies and organizations needed to offer bilingual,
trilingual, even multilingual websites, while adapting their content to
a given audience. Thus the need of both localization and
internationalization, which became a major trend in the following
years, not only in the U.S. but in many countries, with companies
setting up bilingual websites, in their language and in English, to
reach a wider audience, and get more clients.
Brian King, director of the WorldWide Language Institute (WWLI),
explained in September 1998: "As well as the appropriate technology
being available so that the non-English speaker can go, there is the
impact of 'electronic commerce' as a major force that may make
multilingualism the most natural path for cyberspace. A pull from non-
English-speaking computer users and a push from technology companies
competing for global markets has made localization a fast growing area
in software and hardware development."
In 1998, the European Network in Language and Speech (ELSNET) was a
network of more than 100 European academic and industrial institutions.
ELSNET members intended to build multilingual speech and natural
language systems with coverage of both spoken and written language.
Steven Krauwer, coordinator of ELSNET, explained in September 1998: "As
a European citizen I think that multilingualism on the web is
absolutely essential, as in the long run I don't think that it is a
healthy situation when only those who have a reasonable command of
English can fully exploit the benefits of the web. As a researcher
(specialized in machine translation) I see multilingualism as a major
challenge: how can we ensure that all information on the web is
accessible to everybody, irrespective of language differences."
Steven added in August 1999: "I've become more and more convinced we
should be careful not to address the multilinguality problem in
isolation. I've just returned from a wonderful summer vacation in
France, and even if my knowledge of French is modest (to put it
mildly), it's surprising to see that I still manage to communicate
successfully by combining my poor French with gestures, facial
expressions, visual clues and diagrams. I think the web (as opposed to
old-fashioned text-only email) offers excellent opportunities to
exploit the fact that transmission of information via different
channels (or modalities) can still work, even if the process is only
partially successful for each of the channels in isolation."
What practical solutions would he suggest for a truly multilingual web?
"At the author end: better education of web authors to use combinations
of modalities to make communication more effective across language
barriers (and not just for cosmetic reasons). At the server end: more
translation facilities à la AltaVista (quality not impressive, but
always better than nothing). At the browser end: more integrated
translation facilities (especially for the smaller languages), and more
quick integrated dictionary lookup facilities."
Linguistic pluralism and diversity are everybody's business, as
explained in a petition launched by the European Committee for the
Respect of Cultures and Languages in Europe (ECRCLE) "for a humanist
and multilingual Europe, rich of its cultural diversity": "Linguistic
pluralism and diversity are not obstacles to the free circulation of
men, ideas, goods and services, as would like to suggest some objective
allies, consciously or not, of the dominant language and culture.
Indeed, standardization and hegemony are the obstacles to the free
blossoming of individuals, societies and the information economy, the
main source of tomorrow's jobs. On the contrary, the respect for
languages is the last hope for Europe to get closer to the citizens, an
objective always claimed and almost never put into practice. The Union
must therefore give up privileging the language of one group." The full
text of the petition was available in the eleven official languages of
the European Union. Among other things, the petition asked the revisors
of the Treaty of the European Union to include the respect of national
cultures and languages in the text of the treaty, and the national
governments to "teach the youth at least two, and preferably three
foreign European languages; encourage the national audiovisual and
musical industries; and favour the diffusion of European works."
Henk Slettenhaar is a professor in communication technology at Webster
University in Geneva, Switzerland. Henk is a trilingual European. He is
Dutch, he teaches computer science in English, and he is fluent in
French as a resident in neighboring France. He has regularly insisted
on the need of bilingual websites, in the original language and in
English. He wrote in December 1998: "I see multilingualism as a very
important issue. Local communities which are on the web should use the
local language first and foremost for their information. If they want
to be able to present their information to the world community as well,
their information should be in English as well. I see a real need for
bilingual websites. (...) As far as languages are concerned, I am
delighted that there are so many offerings in the original languages
now. I much prefer to read the original with difficulty than to get a
bad translation."
Henk added in August 1999: "There are two main categories of websites
in my opinion. The first one is the global outreach for business and
information. Here the language is definitely English first, with local
versions where appropriate. The second one is local information of all
kinds in the most remote places. If the information is meant for people
of an ethnic and/or language group, it should be in that language
first, with perhaps a summary in English. We have seen lately how
important these local websites are -- in Kosovo and Turkey, to mention
just the most recent ones. People were able to get information about
their relatives through these sites."
Marcel Grangier was the head of the French Section of the Swiss Federal
Government's Central Linguistic Services, which means he was in charge
of organizing translations into French for the Swiss government. He
wrote in January 1999: "We can see multilingualism on the internet as a
happy and irreversible inevitability. So we have to laugh at the
doomsayers who only complain about the supremacy of English. Such
supremacy is not wrong in itself, because it is mainly based on
statistics (more PCs per inhabitant, more people speaking English,
etc.). The answer is not to 'fight' English, much less whine about it,
but to build more sites in other languages. As a translation service,
we also recommend that websites be multilingual. The increasing number
of languages on the internet is inevitable and can only boost
multicultural exchanges. For this to happen in the best possible
circumstances, we still need to develop tools to improve compatibility.
Fully coping with accents and other characters is only one example of
what can be done."
Alain Bron, a consultant in information systems and a writer, wrote in
January 1999: "Different languages will still be used for a long time
to come and this is healthy for the right to be different. The risk is
of course an invasion of one language to the detriment of others, and
with it the risk of cultural standardization. I think online services
will gradually emerge to get around this problem. First, translators
will be able to translate and comment on texts by request, but mainly
sites with a large audience will provide different language versions,
just as the audiovisual industry does now."
Guy Antoine, founder of Windows on Haiti, a reference website about
Haitian culture, wrote in November 1999: "It is true that for all
intents and purposes English will continue to dominate the web. This is
not so bad in my view, in spite of regional sentiments to the contrary,
because we do need a common language to foster communications between
people the world over. That being said, I do not adopt the doomsday
view that other languages will just roll over in submission. Quite the
contrary. The internet can serve, first of all, as a repository of
useful information on minority languages that might otherwise vanish
without leaving a trace. Beyond that, I believe that it provides an
incentive for people to learn languages associated with the cultures
about which they are attempting to gather information. One soon
realizes that the language of a people is an essential and inextricable
part of its culture. (...)
From this standpoint, I have much less faith in mechanized tools of
language translation, which render words and phrases but do a poor job
of conveying the soul of a people. Who are the Haitian people, for
instance, without "Kreyòl" (Creole for the non-initiated), the language
that has evolved and bound various African tribes transplanted in Haiti
during the slavery period? It is the most palpable exponent of
commonality that defines us as a people. However, it is primarily a
spoken language, not a widely written one. I see the web changing this
situation more so than any traditional means of language dissemination.
In Windows on Haiti, the primary language of the site is English, but
one will equally find a center of lively discussion conducted in
"Kreyòl". In addition, one will find documents related to Haiti in
French, in the old colonial creole, and I am open to publishing others
in Spanish and other languages. I do not offer any sort of translation,
but multilingualism is alive and well at the site, and I predict that
this will increasingly become the norm throughout the web."
ENCODING: FROM ASCII TO UNICODE
= [Quote]
Brian King, director of the WorldWide Language Institute (WWLI),
explained in September 1998: "The first step was for ASCII to become
Extended ASCII. This meant that computers could begin to start
recognizing the accents and symbols used in variants of the English
alphabet -- mostly used by European languages. But only one language
could be displayed on a page at a time. (...) The most recent
development is Unicode. Although still evolving and only just being
incorporated into the latest software, this new coding system
translates each character into 16 bytes. Whereas 8-byte extended ASCII
could only handle a maximum of 256 characters, Unicode can handle over
65,000 unique characters and therefore potentially accommodate all of
the world's writing systems on the computer. So now the tools are more
or less in place. They are still not perfect, but at last we can at
least surf the web in Chinese, Japanese, Korean, and numerous other
languages that don't use the Western alphabet. As the internet spreads
to parts of the world where English is rarely used - such as China, for
example, it is natural that Chinese, and not English, will be the
preferred choice for interacting with it. For the majority of the users
in China, their mother tongue will be the only choice."
= Encoding in Project Gutenberg
Used since the beginning of computing, ASCII (American Standard Code
for Information Interchange) is a 7-bit coded character set for
information interchange in English. It was published in 1968 by ANSI
(American National Standards Institute), with an update in 1977 and
1986. The 7-bit plain ASCII, also called Plain Vanilla ASCII, is a set
of 128 characters with 95 printable unaccented characters (A-Z, a-z,
numbers, punctuation and basic symbols), i.e. the ones that are
available on the English/American keyboard. With the use of other
European languages, extensions of ASCII (also called ISO-8859 or ISO-
Latin) were created as sets of 256 characters to add accented
characters as found in French, Spanish and German, for example ISO
8859-1 (ISO-Latin-1) for French.
Created by Michael Hart in July 1971, Project Gutenberg was the first
information provider on the internet. Michael's purpose was to digitize
as many literary texts as possible, and to offer them for free in a
digital library open to anyone. Michael explained in August 1998: "We
consider etext to be a new medium, with no real relationship to paper,
other than presenting the same material, but I don't see how paper can
possibly compete once people each find their own comfortable way to
etexts, especially in schools."
Whether digitized years ago or now, all Project Gutenberg books are
created in 7-bit plain ASCII, called Plain Vanilla ASCII. When 8-bit
ASCII is used for books with accented characters like French or German,
Project Gutenberg also produces a 7-bit ASCII version with the accents
stripped. (This doesn't apply for languages that are not "convertible"
in ASCII, like Chinese, encoded in Big-5.)
Project Gutenberg sees Plain Vanilla ASCII as the best format by far,
and calls it "the lowest common denominator". It can be read, written,
copied and printed by any simple text editor or word processor on any
electronic device. It is the only format compatible with 99% of
hardware and software. It can be used as it is or to create versions in
many other formats. It will still be used while other formats will be
obsolete, or are already obsolete, like formats of a few short-lived
reading devices launched since 1999. It is the assurance collections
will never be obsolete, and will survive future technological changes.
The goal is to preserve the texts not only over decades but over
centuries.
Project Gutenberg also publishes ebooks in well-known formats like
HTML, XML or RTF. There are Unicode files too. Any other format
provided by volunteers (PDF, LIT, TeX and many others) is usually
accepted, as long as they also supply an ASCII version where possible.
Initially, the books were mostly in English. As the original Project
Gutenberg is based in the United States, its first focus was the
English-speaking community in the country and worldwide. In October
1997, Michael Hart expressed his intention to digitize ebooks in other
languages. In early 1998, the catalog had a few titles in French (10
titles), German, Italian, Spanish and Latin. In July 1999, Michael
wrote: "I am publishing in one new language per month right now, and
will continue as long as possible."
In the 2000s, multilingualism became a priority for Project Gutenberg,
like internationalization, with Project Gutenberg Australia (created in
August 2001), Project Gutenberg Europe (created in January 2004),
Project Gutenberg Canada (created in July 2007), and others to come.
The launching of Project Gutenberg Europe and Distributed Proofreaders
Europe (DP Europe) by Project Rastko was an important step. Founded in
1997, Project Rastko is a non-governmental cultural and educational
project. One of its goals is the online publishing of Serbian culture.
It is part of the Balkans Cultural Network Initiative, a regional
cultural network for the Balkan peninsula in south-eastern Europe.
DP Europe has used the software of the original Distributed
Proofreaders, launched in 2000 to share proofreading among a number of
volunteers. Since the beginning, DP Europe has been a multilingual
website, with its main pages translated into several European languages
by volunteer translators. In April 2004, DP Europe was available in 12
languages. The long-term goal was 60 languages and 60 linguistic teams
in the main European languages. DP Europe supports Unicode instead of
ASCII, to be able to proofread ebooks in numerous languages.
First published in January 1991, Unicode "provides a unique number for
every character, no matter what the platform, no matter what the
program, no matter what the language" (excerpt from the website). This
double-byte platform-independent encoding provides a basis for the
processing, storage and interchange of text data in any language, and
any modern software and information technology protocols. Unicode is
maintained by the Unicode Consortium, and is a component of the W3C
(World Wide Web Consortium) specifications. In 2008, 50% of available
documents on the internet were encoded in Unicode, with the other 50%
encoded in ASCII.
In the original Project Gutenberg in the U.S., there were ebooks in 25
languages in February 2004, in 42 languages in July 2005, including
Sanskrit and the Mayan languages, and in 50 languages in December 2006.
The ten top languages were English, French, German, Finnish, Dutch,
Spanish, Chinese, Italian, Portuguese and Tagalog.
[Many thanks to Russon Wooldridge and Mike Cook for revising previous
versions of this section.]
FIRST MULTILINGUAL PROJECTS
= [Quote]
Tyler Chambers, who created the Human-Languages Page and the Internet
Dictionary Project, wrote in September 1998: "Online, my work has been
with making language information available to more people through a
couple of my web-based projects. While I'm not multilingual, nor even
bilingual, myself, I see an importance to language and multilingualism
that I see in very few other areas. The internet has allowed me to
reach millions of people and help them find what they're looking for,
something I'm glad to do. (...) Overall, I think that the web has been
great for language awareness and cultural issues -- where else can you
randomly browse for 20 minutes and run across three or more different
languages with information you might potentially want to know?"
= Travlang
Travlang is a website dedicated to both travel and languages, created
in 1994 by Michael C. Martin on his university's website when he was a
student in physics. Travlang included one section called Foreign
Languages for Travelers, with links to online tools to learn 60
languages. Another section, Translating Dictionaries, gave access to
free dictionaries in a number of languages (Afrikaans, Czech, Danish,
Dutch, Esperanto, Finnish, French, Frisian, German, Hungarian, Italian,
Latin, Norwegian, Portuguese, Spanish). Other sections offered links to
language dictionaries, translation services, language schools, and
multilingual bookstores. In 1998, Travlang was still maintained by its
founder, who had become a researcher in experimental physics at the
Lawrence Berkeley National Laboratory, California.
Michael C. Martin wrote in August 1998: "I think the web is an ideal
place to bring different cultures and people together, and that
includes being multilingual. Our Travlang site is so popular because of
this, and people desire to feel in touch with other parts of the world.
(...) The internet is really a great tool for communicating with people
you wouldn't have the opportunity to interact with otherwise. I truly
enjoy the global collaboration that has made our Foreign Languages for
Travelers pages possible." Regarding the internet and languages in
general, "I think computerized full-text translations will become more
common, enabling a lot of basic communications with even more people.
This will also help bring the internet more completely to the non-
English speaking world."
= The Human-Languages Page
Created by Tyler Chambers in May 1994, the Human-Languages Page (H-LP)
was a comprehensive catalog of 1,800 language-related internet
resources in 100 languages. In September 1998, there were six subject
listings and two category listings. The six subject listings were:
languages and literature, schools and institutions, linguistics
resources, products and services, organizations, jobs and internships.
The two category listings were: dictionaries, and language lessons.
Tyler Chambers' other language-related project was the Internet
Dictionary Project (IDP), launched in 1995. As explained on the
project's website in September 1998: "The Internet Dictionary Project's
goal is to create royalty-free translating dictionaries through the
help of the internet's citizens. This site allows individuals from all
over the world to visit and assist in the translation of English words
into other languages. The resulting lists of English words and their
translated counterparts are then made available through this site to
anyone, with no restrictions on their use. (...) The Internet
Dictionary Project began in 1995 in an effort to provide a noticeably
lacking resource to the internet community and to computing in general
-- free translating dictionaries. Not only is it helpful to the online
community to have access to dictionary searches at their fingertips via
the World Wide Web, it also sponsors the growth of computer software
which can benefit from such dictionaries -- from translating programs
to spelling-checkers to language-education guides and more. By
facilitating the creation of these dictionaries online by thousands of
anonymous volunteers all over the internet, and by providing the
results free-of-charge to anyone, the Internet Dictionary Project hopes
to leave its mark on the internet and to inspire others to create
projects which will benefit more than a corporation's gross income."
Tyler wrote in an email interview in September 1998: "Multilingualism
on the web was inevitable even before the medium 'took off', so to
speak. 1994 was the year I was really introduced to the web, which was
a little while after its christening but long before it was mainstream.
That was also the year I began my first multilingual web project, and
there was already a significant number of language-related resources
online. This was back before Netscape even existed -- Mosaic was almost
the only web browser, and webpages were little more than hyperlinked
text documents. As browsers and users mature, I don't think there will
be any currently spoken language that won't have a niche on the web,
from Native American languages to Middle Eastern dialects, as well as a
plethora of 'dead' languages that will have a chance to find a new
audience with scholars and others alike online. To my knowledge, there
are very few language types which are not currently online: browsers
currently have the capability to display Roman characters, Asian
languages, the Cyrillic alphabet, Greek, Turkish, and more. Accent
Software has a product called 'Internet with an Accent' which claims to
be able to display over 30 different language encodings. If there are
currently any barriers to any particular language being on the web,
they won't last long. (...)
Online, my work has been with making language information available to
more people through a couple of my web-based projects. While I'm not
multilingual, nor even bilingual, myself, I see an importance to
language and multilingualism that I see in very few other areas. The
internet has allowed me to reach millions of people and help them find
what they're looking for, something I'm glad to do. It has also made me
somewhat of a celebrity, or at least a familiar name in certain circles
-- I just found out that one of my web projects had a short mention in
Time Magazine's Asia and International issues. Overall, I think that
the web has been great for language awareness and cultural issues --
where else can you randomly browse for 20 minutes and run across three
or more different languages with information you might potentially want
to know? Communications mediums make the world smaller by bringing
people closer together; I think that the web is the first (of mail,
telegraph, telephone, radio, TV) to really cross national and cultural
borders for the average person. Israel isn't thousands of miles away
anymore, it's a few clicks away -- our world may now be small enough to
fit inside a computer screen."
How about the future? "I think that the future of the internet is even
more multilingualism and cross-cultural exploration and understanding
than we've already seen. But the internet will only be the medium by
which this information is carried; like the paper on which a book is
written, the internet itself adds very little to the content of
information, but adds tremendously to its value in its ability to
communicate that information. To say that the internet is spurring
multilingualism is a bit of a misconception, in my opinion -- it is
communication that is spurring multilingualism and cross-cultural
exchange, the internet is only the latest mode of communication which
has made its way down to the (more-or-less) common person. The internet
has a long way to go before being ubiquitous around the world, but it,
or some related progeny, likely will. Language will become even more
important than it already is when the entire planet can communicate
with everyone else (via the web, chat, games, e-mail, and whatever
future applications haven't even been invented yet), but I don't know
if this will lead to stronger language ties, or a consolidation of
languages until only a few, or even just one remain. One thing I think
is certain is that the internet will forever be a record of our
diversity, including language diversity, even if that diversity fades
away. And that's one of the things I love about the internet -- it's a
global model of the saying 'it's not really gone as long as someone
remembers it'. And people do remember."
In spring 2001, the Human-Languages Page merged with the Languages
Catalog, a section of the WWW Virtual Library, to become
iLoveLanguages, In September 2003, iLoveLanguages provided an index of
2,000 linguistic resources in 100 languages. As for the Internet
Dictionary Project, Tyler ran out of time to manage this project, and
removed the ability to update the dictionaries in January 2007. People
can still search the available dictionaries or download the archived
files.
= NetGlos
Launched in 1995 by the WorldWide Language Institute (WWLI), an
institute providing language instruction via the web, NetGlos (which
stands for: Multilingual Glossary of Internet Terminology) has been
compiled as a voluntary, collaborative project by a number of
translators and other language professionals. In September 1998,
NetGlos was available in the following languages: Chinese, Croatian,
English, Dutch/Flemish, French, German, Greek, Hebrew, Italian, Maori,
Norwegian, Portuguese, and Spanish.
Brian King, director of the WorldWide Language Institute, wrote in
September 1998 in an email interview: "Although English is still the
most important language used on the web, and the internet in general, I
believe that multilingualism is an inevitable part of the future
direction of cyberspace. Here are some of the important developments
that I see as making a multilingual web become a reality:
1. Computer technology has
traditionally been the sole domain of a 'techie' elite, fluent in both
complex programming languages and in English -- the universal language
of science and technology. Computers were never designed to handle
writing systems that couldn't be translated into ASCII. There wasn't
much room for anything other than the 26 letters of the English
alphabet in a coding system that originally couldn't even recognize
acute accents and umlauts -- not to mention non-alphabetic systems like
Chinese. But tradition has been turned upside down. Technology has been
popularized. GUIs (graphical user interfaces) like Windows and
Macintosh have hastened the process (and indeed it's no secret that it
was Microsoft's marketing strategy to use their operating system to
make computers easy to use for the average person). These days this
ease of use has spread beyond the PC to the virtual, networked space of
the internet, so that now non-programmers can even insert Java applets
into their webpages without understanding a single line of code.
2. An extension of (local) popularization is the export of
information technology around the world. Popularization has now
occurred on a global scale and English is no longer necessarily the
lingua franca of the user. Perhaps there is no true lingua franca, but
only the individual languages of the users. One thing is certain -- it
is no longer necessary to understand English to use a computer, nor it
is necessary to have a degree in computer science. A pull from non-
English-speaking computer users and a push from technology companies
competing for global markets has made localization a fast growing area
in software and hardware development. This development has not been as
fast as it could have been. The first step was for ASCII to become
Extended ASCII. This meant that computers could begin to start
recognizing the accents and symbols used in variants of the English
alphabet -- mostly used by European languages. But only one language
could be displayed on a page at a time.
3. The most recent development is
Unicode. Although still evolving and only just being incorporated into
the latest software, this new coding system translates each character
into 16 bytes. Whereas 8-byte Extended ASCII could only handle a
maximum of 256 characters, Unicode can handle over 65,000 unique
characters and therefore potentially accommodate all of the world's
writing systems on the computer. So now the tools are more or less in
place. They are still not perfect, but at last we can at least surf the
web in Chinese, Japanese, Korean, and numerous other languages that
don't use the Western alphabet. As the internet spreads to parts of the
world where English is rarely used -- such as China, for example, it is
natural that Chinese, and not English, will be the preferred choice for
interacting with it. For the majority of the users in China, their
mother tongue will be the only choice. There is a change-over period,
of course. Much of the technical terminology on the web is still not
translated into other languages. And as we found with our Multilingual
Glossary of Internet Terminology -- known as NetGlos -- the translation
of these terms is not always a simple process. Before a new term
becomes accepted as the 'correct' one, there is a period of instability
where a number of competing candidates are used. Often an English loan
word becomes the starting point -- and in many cases the endpoint. But
eventually a winner emerges that becomes codified into published
technical dictionaries as well as the everyday interactions of the
nontechnical user. The latest version of NetGlos is the Russian one and
it should be available in a couple of weeks or so [at the end of
September 1998]. It will no doubt be an excellent example of the
ongoing, dynamic process of 'russification' of web terminology.
4. Whereas 'mother-tongue education' was deemed
a human right for every child in the world by a UNESCO report in the
early '50s, 'mother-tongue surfing' may very well be the Information
Age equivalent. If the internet is to truly become the Global Network
that it is promoted as being, then all users, regardless of language
background, should have access to it. To keep the internet as the
preserve of those who, by historical accident, practical necessity, or
political privilege, happen to know English, is unfair to those who
don't.
5. Although a multilingual web may be desirable
on moral and ethical grounds, such high ideals are not enough to make
it other than a reality on a small-scale. As well as the appropriate
technology being available so that the non-English speaker can go,
there is the impact of 'electronic commerce' as a major force that may
make multilingualism the most natural path for cyberspace. Sellers of
products and services in the virtual global marketplace into which the
internet is developing must be prepared to deal with a virtual world
that is just as multilingual as the physical world. If they want to be
successful, they had better make sure they are speaking the languages
of their customers!"
How about the future of the WorldWide Language Institute? "As a company
that derives its very existence from the importance attached to
languages, I believe the future will be an exciting and challenging
one. But it will be impossible to be complacent about our successes and
accomplishments. Technology is already changing at a frenetic pace.
Lifelong learning is a strategy that we all must use if we are to stay
ahead and be competitive. This is a difficult enough task in an
English-speaking environment. If we add in the complexities of
interacting in a multilingual/multicultural cyberspace, then the task
becomes even more demanding. As well as competition, there is also the
necessity for cooperation -- perhaps more so than ever before. The
seeds of cooperation across the internet have certainly already been
sown. Our NetGlos Project has depended on the goodwill of volunteer
translators from Canada, U.S., Austria, Norway, Belgium, Israel,
Portugal, Russia, Greece, Brazil, New Zealand and other countries. I
think the hundreds of visitors we get coming to the NetGlos pages
everyday is an excellent testimony to the success of these types of
working relationships. I see the future depending even more on
cooperative relationships -- although not necessarily on a volunteer
basis."
= Logos
Logos is a global translation company with headquarters in Modena,
Italy. In 1997, Logos had 200 in-house translators in Modena and 2,500
free-lance translators worldwide, who processed around 200 texts per
day. The company made a bold move, and decided to put on the web the
linguistic tools used by its translators, for the internet community to
freely use them as well. The linguistic tools were the Logos
Dictionary, a multilingual dictionary with 7 billion words (in fall
1998); the Logos Wordtheque, a multilingual library with 300 billion
words extracted from translated novels, technical manuals and other
texts; the Logos Linguistic Resources, a database of 500 glossaries;
and the Logos Universal Conjugator, a database for verbs in 17
languages.
When interviewed by Annie Kahn in December 1997 for the French daily Le
Monde, Rodrigo Vergara, head of Logos, explained: "We wanted all our
translators to have access to the same translation tools. So we made
them available on the internet, and while we were at it we decided to
make the site open to the public. This made us extremely popular, and
also gave us a lot of exposure. This move has in fact attracted many
customers, and also allowed us to widen our network of translators,
thanks to contacts made in the wake of the initiative."
In the same article, "Les mots pour le dire" (The Words to Tell it),
Annie Kahn wrote: "The Logos site is much more than a mere dictionary
or a collection of links to other online dictionaries. The cornerstone
is the document search program, which processes a corpus of literary
texts available free of charge on the web. If you search for the
definition or the translation of a word ('didactique' [didactic], for
example), you get not only the answer sought, but also a quote from one
of the literary works containing the word (in our case, an essay by
Voltaire). All it takes is a click on the mouse to access the whole
text or even to order the book, including in foreign translations,
thanks to a partnership agreement with the famous online bookstore
Amazon.com. However, if no text containing the required word is found,
the program acts as a search engine, sending the user to other web
sources containing this word. In the case of certain words, you can
even hear the pronunciation. If there is no translation currently
available, the system calls on the public to contribute. Everyone can
make suggestions, after which Logos translators check the suggested
translations they receive."
ONLINE LANGUAGE DICTIONARIES
= [Quote]
WordReference.com was created in 1999 by Michael Kellogg, who wrote on
his project's website: "I started this site in 1999 in an effort to
provide free online bilingual dictionaries and tools to the world for
free on the internet. The site has grown gradually ever since to
become one of the most-used online dictionaries, and the top online
dictionary for its language pairs of English-Spanish, English-French,
English-Italian, Spanish-French, and Spanish-Portuguese. Today, I am
happy to continue working on improving the dictionaries, its tools and
the language forums. I really do enjoy creating new features to make
the site more and more useful."
= From print versions
The first online language dictionaries stemmed from print versions,
with websites launched in the mid-1990s.
On the website "Merriam-Webster Online: The Language Center", Merriam-
Webster, a main publisher of English-language dictionaries, gave free
access to online resources stemming from its print publications. The
online resources were: Webster Dictionary, Webster Thesaurus, Webster's
Third (a lexical landmark), Guide to International Business
Communications, Vocabulary Builder (with interactive vocabulary
quizzes), and the Barnhart Dictionary Companion (hot new words). The
goal was also to help track down definitions, spellings,
pronunciations, synonyms, vocabulary exercises, and other key facts
about words and language.
The "Dictionnaire Francophone en Ligne" was the web version of the
"Dictionnaire Universel Francophone", published by Hachette, a major
French publisher, and the University Agency for Francophony (AUF:
Agence Universitaire de la Francophonie, also known as AUPELF-UREF).
The dictionary included not only standard French but also the French-
language words and expressions used worldwide. French is the official
language of 49 states, with a number of them in Africa, and is spoken
by 500 million people worldwide. The Agency of French-speaking
Countries (Agence de la Francophonie), which has included the AUF, was
founded in 1970 as an instrument of multilateral cooperation at the
international level. As a side remark, English and French are the only
official and/or cultural languages that are widely spread on five
continents.
= Directories of dictionaries
Directories of dictionaries have been useful too, such as
"Dictionnaires Électroniques" (Electronic Dictionaries), an online
catalog of electronic dictionaries maintained by the French Section of
the Swiss Federal Administration's Central Linguistic Services (SLC-f:
Section Française des Services Linguistiques Centraux). The catalog
included five main sections: abbreviations and acronyms, monolingual
dictionaries, bilingual dictionaries, multilingual dictionaries, and
geographical information. The catalog could also be searched by
keywords.
Marcel Grangier was the head of the French Section of Central
Linguistic Services, which means he was in charge of organizing
translation matters into French for the linguistic services of the
Swiss government. He wrote in January 1999: "Our website was first
conceived as an intranet service for translators in Switzerland, who
often deal with the same kind of material as the Federal government's
translators. Some parts of it are useful to any translators, wherever
they are. The section "Dictionnaires Électroniques" is only one section
of the website. Other sections deal with administration, law, the
French language, and general information. The site also hosts the pages
of the Conference of Translation Services of European States (COTSOES).
(...) To work without the internet is simply impossible now. Apart from
all the tools used (email, the electronic press, services for
translators), the internet is for us a vital and endless source of
information in what I'd call the 'non-structured sector' of the web.
For example, when the answer to a translation problem can't be found on
websites presenting information in an organized way, in most cases
search engines allow us to find the missing link somewhere on the
network."
How about the future? "We can see multilingualism on the internet as a
happy and irreversible inevitability. So we have to laugh at the
doomsayers who only complain about the supremacy of English. Such
supremacy isn't wrong in itself, because it is mainly based on
statistics (more PCs per inhabitant, more people speaking English,
etc.). The answer isn't to 'fight English', much less whine about it,
but to build more sites in other languages. As a translation service,
we also recommend that websites be multilingual. (...) The increasing
number of languages on the internet is inevitable and can only boost
multicultural exchanges. For this to happen in the best possible
circumstances, we still need to develop tools to improve compatibility.
Fully coping with accents and other characters is only one example of
what can be done."
The section "Dictionnaires Électroniques" was later transfered on the
website of the Conference of Translation Services of European States
(COTSOES), when COTSOES launched its own website.
= The yourDictionary.com portal
Robert Beard, a language teacher at Bucknell University, in Lewisburg,
Pennsylvania, created the website "A Web of Online Dictionaries" (WOD)
in 1995. In September 1998, the website provided an index of 800 online
dictionaries in 150 languages, as well as specific sections:
multilingual dictionaries, specialized English dictionaries, thesauri
and other vocabulary aids, language identifiers and guessers, an index
of dictionary indices, the Web of Online Grammars, and the Web of
Linguistic Fun (i.e. linguistics for non-specialists).
Robert Beard wrote in September 1998: "There was an initial fear that
the web posed a threat to multilingualism on the web, since HTML and
other programming languages are based on English and since there are
simply more websites in English than any other language. However, my
websites indicate that multilingualism is very much alive and the web
may, in fact, serve as a vehicle for preserving many endangered
languages. I now have links to dictionaries in 150 languages and
grammars of 65 languages. Moreover, the new attention paid by browser
developers to the different languages of the world will encourage even
more websites in different languages."
A few months later, Robert Beard co-founded a larger project,
yourDictionary.com, that included his previous website and was launched
in February 2000. He wrote in January 2000: "The new website is an
index of 1,200+ dictionaries in more than 200 languages. Besides the
WOD, the new website includes a word-of-the-day-feature, word games, a
language chat room, the old 'Web of Online Grammars' (now expanded to
include additional language resources), the 'Web of Linguistic Fun',
multilingual dictionaries; specialized English dictionaries; thesauri
and other vocabulary aids; language identifiers and guessers, and other
features; dictionary indices. yourDictionary.com will hopefully be the
premiere language portal and the largest language resource site on the
web. It is now actively acquiring dictionaries and grammars of all
languages with a particular focus on endangered languages. It is
overseen by a blue ribbon panel of linguistic experts from all over the
world. (...) Indeed, yourDictionary.com has lots of new ideas. We plan
to work with the Endangered Language Fund in the U.S. and Britain to
raise money for the Foundation's work and publish the results on our
site. We will have language chatrooms and bulletin boards. There will
be language games designed to entertain and teach fundamentals of
linguistics. The Linguistic Fun page will become an online journal for
short, interesting, yes, even entertaining, pieces on language that are
based on sound linguistics by experts from all over the world."
How about the future of the web? "The web will be an encyclopedia of
the world by the world for the world. There will be no information or
knowledge that anyone needs that will not be available. The major
hindrance to international and interpersonal understanding, personal
and institutional enhancement, will be removed. It would take a wilder
imagination than mine to predict the effect of this development on the
nature of humankind."
= Terminological databases
Some terminological databases are run by international organizations in
their own field of expertise, with free online versions, for example
ILOTERM maintained by the International Labor Organization (ILO),
TERMITE (ITU Telecommunication Terminology Database) maintained by the
International Telecommunication Union (ITU), WHOTERM (WHO Terminology
Information System) maintained by the World Health Organization (WHO),
and Eurodicautom maintained by the European Commission.
ILOTERM is a quadrilingual (English, French, German, Spanish)
terminology database maintained by the Terminology and Reference Unit
of the Official Documentation Branch (OFFDOC) at the International
Labor Office (ILO) in Geneva, Switzerland. As explained on its website,
ILOTERM's primary purpose is to provide solutions, reflecting current
usage, to terminological problems in the social and labor fields. Terms
are entered in English with their French, Spanish and German
equivalents. The database also includes records for the ILO structure
and programs, official names of international institutions, national
bodies and employers' and workers' organizations, and titles of
international meetings.
TERMITE (which stands for: Telecommunication Terminology Database) is
maintained by the Terminology, References and Computer Aids to
Translation Section of the Conference Department at the International
Telecommunication Union (ITU) in Geneva, Switzerland. It is a
quadrilingual (English, French, Spanish, Russian) terminological
database built on the content of all ITU printed glossaries since 1980,
and updated with recent entries.
WHOTERM (which stands for: WHO Terminology Information System) is
maintained by the World Health Organization (WHO) in Geneva,
Switzerland. It has included: (a) the WHO General Dictionary Index (in
English, with the French and Spanish equivalents); (b) three glossaries
in English: Health for All, Programme Development and Management, and
Health Promotion; (c) the WHO TermWatch, an awareness service from the
Technical Terminology, reflecting the current WHO usage, but not
necessarily terms officially approved by WHO, and links to health-
related terminology.
Eurodicautom, a multilingual terminological database maintained by the
Translation Service of the European Commission, was initially developed
to assist in-house translators. The free online version was used by
European Union officials and by language professionals throughout the
world. Its contents were available in the eleven official languages of
the European Union (Danish, Dutch, English, Finnish, French, German,
Greek, Italian, Portuguese, Spanish, Swedish), plus Latin. Eurodicautom
covered "a broad spectrum of human knowledge", mainly relating to
economy, science, technology and legislation in the European Union. In
late 2003, the website announced the inclusion of the existing database
into a larger terminological database that would also include databases
from other official European institutions. The new terminological
database would be available in more than 20 languages, because a number
of Eastern European countries were expected to join the European Union
in the near future, thus the need for more languages than the eleven
original ones. The European Union went from 15 country members to 25
country members in May 2004, and 27 country members in January 2007.
The website of IATE (Inter-Active Terminology for Europe) was launched
in March 2007 as an eagerly awaited free service on the web, with 1.4
million entries in 24 languages.
= Wikipedia
Wikipedia was launched in January 2001 by Jimmy Wales and Larry Sanger
(Larry resigned later on). It has quickly grown into the largest
reference website on the internet, financed by donations, with no
advertising. Its multilingual content is free and written
collaboratively by people worldwide, who contribute under a pseudonym.
Its website is a wiki, which means that anyone can edit, correct and
improve information throughout the encyclopedia. The articles stay the
property of their authors, and can be freely used according to the GFDL
(GNU Free Documentation License).
Wikipedia had 1.3 million articles (by 13,000 contributors) in 100
languages in December 2004, 6 million articles in 250 languages in
December 2006, and 7 million articles in 192 languages in May 2007,
including 1.8 million articles in English, 589,000 articles in German,
500,000 articles in French, 260,000 articles in Portuguese, and 236,000
articles in Spanish. In August 2009, Wikipedia was among the top five
websites in the world, with a total of 330 million visitors a month.
Wikipedia is hosted by the Wikimedia Foundation, founded in June 2003,
which has run a number of other projects, beginning with Wiktionary
(launched in December 2002) and Wikibooks (launched in June 2003),
followed by Wikiquote, Wikisource (texts from public domain), Wikimedia
Commons (multimedia), Wikispecies (animals and plants), Wikinews,
Wikiversity (textbooks), and Wiki Search (search engine).
LEARNING LANGUAGES ONLINE
= [Quote]
Robert Beard, a language teacher at Bucknell University, in Lewisburg,
Pennsylvania, wrote in September 1998: "As a language teacher, the web
represents a plethora of new resources produced by the target culture,
new tools for delivering lessons (interactive Java and Shockwave
exercises) and testing, which are available to students any time they
have the time or interest -- 24 hours a day, 7 days a week. It is also
an almost limitless publication outlet for my colleagues and I, not to
mention my institution. (...) Ultimately all course materials,
including lecture notes, exercises, moot and credit testing, grading,
and interactive exercises will be far more effective in conveying
concepts that we have not even dreamed of yet."
= CTI Centre for Modern Languages
Since its inception in 1989, the CTI (Computer in Teaching Initiative)
Centre for Modern Languages, based in the Language Institute at the
University of Hull, United Kingdom, aims to promote and encourage the
use of computers in language learning and teaching. The CTI Centre
provides information on how computer-assisted language learning (CALL)
can be effectively integrated into existing courses. It offers support
to language lecturers who are using computers in their teaching, or who
wish to use them.
June Thompson, manager of the CTI Centre, wrote in December 1998: "The
internet has the potential to increase the use of foreign languages,
and our organization certainly opposed any trend towards the dominance
of English as the language of the internet. The use of the internet has
brought an enormous new dimension to our work of supporting language
teachers in their use of technology in teaching."
How about the future? "I suspect that for some time to come, the use of
internet-related activities for languages will continue to develop
alongside other technology-related activities (e.g. use of CD-ROMs --
not all institutions have enough networked hardware). In the future I
can envisage use of internet playing a much larger part, but only if
such activities are pedagogy-driven. Our organization is closely
associated with the WELL project which devotes itself to these issues."
The WELL (Web Enhanced Language Learning) project was a project from
EUROCALL (European Association for Computer-Assisted Language
Learning). It ran from 1997 to 2000 in the United Kingdom to provide
access to high-quality web resources in 12 languages. The resources
were selected and described by subject experts, with information and
examples on how to use them for teaching and learning.
More generally, EUROCALL's goal is to promote the use of foreign
languages within Europe, to provide a European focus for all aspects of
the use of technology for language learning, and to enhance the
quality, dissemination and efficiency of CALL materials. Another
project of EUROCALL is CAPITAL (Computer-Assisted Pronunciation
Investigation Teaching and Learning), run by a group of researchers and
practitioners interested in using computers in this field.
= LINGUIST List
The LINGUIST List was founded by Anthony Rodriques Aristar in 1990 at
the University of Western Australia, with 60 subscribers, before
moving from Australia to Texas A&M University in 1991. In 1997, emails
sent to the distribution list were also available on the list's own
website, in the following sections: the profession (conferences,
linguistic associations, programs), research and research support
(papers, dissertation abstracts, projects, bibliographies, topics,
texts), publications, pedagogy, language resources (languages, language
families, dictionaries, regional information), and computer support
(fonts and software). The LINGUIST List is a component of the WWW
Virtual Library for linguistics.
Helen Dry, moderator of the LINGUIST List, wrote in August 1998: "The
LINGUIST List, which I moderate, has a policy of posting in any
language, since it's a list for linguists. However, we discourage
posting the same message in several languages, simply because of the
burden extra messages put on our editorial staff. (We are not a bounce-
back list, but a moderated one. So each message is organized into an
issue with like messages by our student editors before it is posted.)
Our experience has been that almost everyone chooses to post in
English. But we do link to a translation facility that will present our
pages in any of five languages; so a subscriber need not read LINGUIST
in English unless s/he wishes to. We also try to have at least one
student editor who is genuinely multilingual, so that readers can
correspond with us in languages other than English."
She added in July 1999: "We are beginning to collect some primary data.
For example, we have searchable databases of dissertation abstracts
relevant to linguistics, of information on graduate and undergraduate
linguistics programs, and of professional information about individual
linguists. The dissertation abstracts collection is, to my knowledge,
the only freely available electronic compilation in existence."
MINORITY LANGUAGES ON THE WEB
= [Quote]
Caoimhín Ó Donnaíle has taught computing -- through the Gaelic language
-- at the Institute Sabhal Mór Ostaig, on the Island of Skye, in
Scotland. He has also maintained the bilingual (English, Gaelic)
college website, which is the main site worldwide with information on
Scottish Gaelic. He wrote in May 2001: "Students do everything by
computer, use Gaelic spell-checking, a Gaelic online terminology
database. There are more hits on our website. There is more use of
sound. Gaelic radio (both Scottish and Irish) is now available
continuously worldwide via the internet. A major project has been the
translation of the Opera web browser into Gaelic -- the first software
of this size available in Gaelic."
= The Ethnologue
Published by SIL International (SIL was initially known as the Summer
Institute of Linguistics), "The Ethnologue: Languages of the World" is
an encyclopedic reference work cataloging all of the world’s 6,909
known living languages. The 16th edition was published in 2009, in
print and on the web. The Ethnologue has been an active research
project for more than fifty years. Thousands of linguists have
contributed to the Ethnologue worldwide. A new edition is published
approximately every four years.
The Ethnologue was founded in 1951 by Richard Pittman, who was
motivated by the desire to share information on language development
needs around the world with his colleagues at SIL International as well
as with other language researchers. Richard Pittman was the editor of
the 1st to 7th editions (1951-1969).
Barbara Grimes was the editor of the 8th to 14th editions (1971-2000).
She wrote in January 2000: "It is a catalog of the languages of the
world, with information about where they are spoken, an estimate of the
number of speakers, what language family they are in, alternate names,
names of dialects, other socio-linguistic and demographic information,
dates of published Bibles, a name index, a language family index, and
language maps." In 1971, information was expanded from primarily
minority languages to encompass all known languages of the world.
Between 1967 and 1973, she completed an in-depth revision of the
information on Africa, the Americas, the Pacific, and a few countries
of Asia. During her years as editor, the number of identified languages
grew from 4,493 to 6,809. The information recorded on each language
expanded so that the published work more than tripled in size.
In 2000, Raymond Gordon Jr. became the third editor of the Ethnologue
and produced the 15th edition (2005). Shortly after the publication of
the 15th edition, Paul Lewis became the editor, responsible for general
oversight and research policy. He installed Conrad Hurd as managing
editor, responsible for operations and database management, and Raymond
Gordon as senior research editor, leading a team of regional and
language-family focused research editors.
In the Introduction of its latest edition (16th edition, 2009), the
Ethnologue defines a language as such: "How one chooses to define a
language depends on the purposes one has in identifying that language
as distinct from another. Some base their definition on purely
linguistic grounds. Others recognize that social, cultural, or
political factors must also be taken into account. In addition,
speakers themselves often have their own perspectives on what makes a
particular language uniquely theirs. Those are frequently related to
issues of heritage and identity much more than to the linguistic
features of the language(s) in question."
As explained in the Introduction, one feature of the database since its
inception has been a system of three-letter language identifiers, that
appeared in the publication itself from the 10th edition (1984)
onwards. "In 1998, the International Organization for Standardization
(ISO) adopted ISO 639-2, a standard for three-letter language
identifiers. The standard is based on a convergence of ISO 639-1 (an
earlier standard for two-letter language identifiers adopted in 1988)
and of ANSI Z39.53 (also known as the MARC language codes, a set of
three-letter identifiers developed within the library community and
adopted as an American National Standard in 1987). The ISO 639-2
standard was insufficient for many purposes since it has identifiers
for fewer than 400 individual languages. Thus in 2002, ISO TC37/SC2
formally invited SIL International to prepare a new standard that would
reconcile the complete set of codes used in the Ethnologue with the
codes already in use in the earlier ISO standard. In addition, codes
developed by Linguist List to handle ancient and constructed languages
were to be incorporated. The result, which was officially approved by
the subscribing national standards bodies in 2006 and published in
2007, is a standard named ISO 639-3 that provides standardized three-
letter codes for identifying nearly 7,500 languages (ISO 2007). SIL
International was named as the registration authority for the ISO 639-3
standard inventory of language identifiers and administers the annual
cycle for changes and updates. This edition of Ethnologue is the second
to use the ISO 639-3 language identifiers. In the fifteenth edition
they had the status of Draft International Standard. In this edition
they are based on the standard as originally adopted plus the 2006
series of adopted change requests (released August 2007) and the 2007
series of adopted change requests (released January 2008). Information
about the ISO 639-3 standard and procedures for requesting additions,
deletions, and other modifications to the ISO 639-3 inventory of
identified languages can be found at the ISO 639-3 website:
http://www.sil.org/iso639-3."
= Experiences
Caoimhín Ó Donnaíle has taught computing - through the Gaelic language
- at the Institute Sabhal Mór Ostaig, on the Island of Skye, in
Scotland. He has also maintained the bilingual (English, Gaelic)
college website, which is the main site worldwide with information on
Scottish Gaelic, as well as the bilingual webpage European Minority
Languages, a list of minority languages by alphabetic order and by
language family. He wrote in May 2001: "There has been a great
expansion in the use of information technology in our college. Far more
computers, more computing staff, flat screens. Students do everything
by computer, use Gaelic spell-checking, and a Gaelic online terminology
database. There are more hits on our website. There is more use of
sound. Gaelic radio (both Scottish and Irish) is now available
continuously worldwide via the internet. A major project has been the
translation of the Opera web browser into Gaelic -- the first software
of this size available in Gaelic."
How about the internet and endangered languages? "I would emphasize the
point that as regards the future of endangered languages, the internet
speeds everything up. If people don't care about preserving languages,
the internet and accompanying globalization will greatly speed their
demise. If people do care about preserving them, the internet will be a
tremendous help."
Guy Antoine is the founder of Windows on Haiti, a reference website
about Haitian culture. He wrote in November 1999: "In Windows on Haiti,
the primary language of the site is English, but one will equally find
a center of lively discussion conducted in 'Kreyòl'. In addition, one
will find documents related to Haiti in French, in the old colonial
Creole, and I am open to publishing others in Spanish and other
languages. I do not offer any sort of translation, but multilingualism
is alive and well at the site, and I predict that this will
increasingly become the norm throughout the web."
Guy added in June 2001: "Kreyòl is the only national language of Haiti,
and one of its two official languages, the other being French. It is
hardly a minority language in the Caribbean context, since it is spoken
by eight to ten million people. (...) I have taken the promotion of
Kreyòl as a personal cause, since that language is the strongest of
bonds uniting all Haitians, in spite of a small but disproportionately
influential Haitian elite's disdainful attitude to adopting standards
for the writing of Kreyòl and supporting the publication of books and
official communications in that language. For instance, there was
recently a two-week book event in Haiti's Capital and it was promoted
as 'Livres en Folie' ('A mad feast for books'). Some 500 books from
Haitian authors were on display, among which one could find perhaps 20
written in Kreyòl. This is within the context of France's major push to
celebrate Francophony among its former colonies. This plays rather well
in Haiti, but directly at the expense of Creolophony. What I have
created in response to those attitudes are two discussion forums on my
website, Windows on Haiti, held exclusively in Kreyòl. One is for
general discussions on just about everything but obviously more focused
on Haiti's current socio-political problems. The other is reserved only
to debates of writing standards for Kreyòl. Those debates have been
quite spirited and have met with the participation of a number of
linguistic experts. The uniqueness of these forums is their non-
academic nature."
Robert Beard, co-founder of the yourDictionary.com portal, wrote in
January 2000: "While English still dominates the web, the growth of
monolingual non-English websites is gaining strength with the various
solutions to the font problems. Languages that are endangered are
primarily languages without writing systems at all (only 1/3 of the
world's 6,000+ languages have writing systems). I still do not see the
web contributing to the loss of language identity and still suspect it
may, in the long run, contribute to strengthening it. More and more
Native Americans, for example, are contacting linguists, asking them to
write grammars of their language and help them put up dictionaries. For
these people, the web is an affordable boon for cultural expression."
LOCALIZATION AND INTERNATIONALIZATION
= [Quote]
Peter Raggett, deputy-head (and then head) of the Central Library at
the OECD (Organization for Economic Cooperation and Development), wrote
in August 1999: "I think it is incumbent on European organizations and
businesses to try and offer websites in three or four languages if
resources permit. In this age of globalization and electronic commerce,
businesses are finding that they are doing business across many
countries. Allowing French, German, Japanese speakers to easily read
one's website as well as English speakers will give a business a
competitive edge in the domain of electronic trading."
= [Text]
In 1999, the subtitle of Babel's website was: "Towards communicating on
the internet in any language..." Babel was a joint project from Alis
Technologies and the Internet Society to contribute to the
internationalization of the internet. Babel offered a multilingual
website (English, French, German, Italian, Portuguese, Spanish and
Swedish), with information about the world's languages, and a
typographical and linguistic glossary. "The Internet and
Multilingualism" section gave information on how to develop a
multilingual website, and how to code the "world's writing".
Bill Dunlap, founder of Euro-Marketing Associates, a company based in
San Francisco and Paris, launched the international marketing
consultancy Global Reach as a methodology for U.S. companies to expand
their internet presence into an international framework. This included
translating a website into other languages, actively promoting it, and
using local online banner advertising to increase local website
traffic.
Bill Dunlap explained in December 1998: "Promoting your website is at
least as important as creating it, if not more important. You should be
prepared to spend at least as much time and money in promoting your
website as you did in creating it in the first place. With the Global
Reach program, you can have it promoted in countries where English is
not spoken, and achieve a wider audience... and more sales. There are
many good reasons for taking the online international market seriously.
Global Reach is a means for you to extend your website to many
countries, speak to online visitors in their own language and reach
online markets there. (...)
Since 1981, when my professional life started, I've been involved with
bringing American companies in Europe. This is very much an issue of
language, since the products and their marketing have to be in the
languages of Europe in order for them to be visible here. Since the web
became popular in 1995 or so, I've turned these activities to their
online dimension, and have come to champion European e-commerce among
my fellow American compatriots. Most lately at Internet World in New
York, I spoke about European e-commerce and how to use a website to
address the various markets in Europe."
Bill added in July 1999: "After a website's home page is available in
several languages, the next step is the development of content in each
language. A webmaster will notice which languages draw more visitors
(and sales) than others, and these are the places to start in a
multilingual web promotion campaign. At the same time, it is always
good to increase the number of languages available on a website: just a
home page translated into other languages would do for a start, before
it becomes obvious that more should be done to develop a certain
language branch on a website."
The World Wide Web Consortium (W3C) was founded in October 1994 to
develop interoperable technologies (specifications, guidelines,
software, and tools) for the web, for example specifications for markup
languages (HTML, XML, and others), and to act as a forum for
information, commerce, communication and collective understanding. In
1998, the section Internationalization/Localization gave a definition
of protocols used for internationalization/localization: HTML, base
character set, new tags and attributes, HTTP, language negotiation,
URLs & other identifiers including non-ASCII characters, etc. It also
offered some help with creating a multilingual website.
The Localisation Industry Standards Association (LISA) was created in
the mid-1990s as a forum for "software publishers, hardware
manufacturers, localization service vendors, and an increasing number
of companies from related IT sectors." LISA has defined its mission as
"promoting the localization and internationalization industry and
providing a mechanism and services to enable companies to exchange and
share information on the development of processes, tools, technologies
and business models connected with localization, internationalization
and related topics". Its website was first housed and maintained by the
University of Geneva, Switzerland.
Launched in January 1999 by the European Commission, the website
HLTCentral (HLT: Human Language Technologies) gave a short definition
of language engineering: "Through language engineering we can find ways
of living comfortably with technology. Our knowledge of language can be
used to develop systems that recognize speech and writing, understand
text well enough to select information, translate between different
languages, and generate speech as well as the printed world. By
applying such technologies we have the ability to extend the current
limits of our use of language. Language enabled products will become an
essential and integral part of everyday life."
MACHINE TRANSLATION
= [Quote]
Tim McKenna is an author who thinks and writes about the complexity of
truth in a world of flux. He wrote in October 2000: "When software gets
good enough for people to chat or talk on the web in real time in
different languages, then we will see a whole new world appear before
us. Scientists, political activists, businesses and many more groups
will be able to communicate immediately without having to go through
mediators or translators."
= A definition
Machine translation can be defined as the automated process of
translating a text from one language to another language. MT analyzes
the text in the source language and automatically generates the
corresponding text in the target language. With the lack of any human
intervention during the translation process, machine translation (MT)
differs from computer-assisted translation (CAT), which involves some
interaction between the translator and the computer.
As explained on the website of SYSTRAN, a company specializing in
translation software, "machine translation software translates one
natural language into another natural language. MT takes into account
the grammatical structure of each language and uses rules to transfer
the grammatical structure of the source language (text to be
translated) into the target language (translated text). MT cannot
replace a human translator, nor is it intended to."
The website of the European Association for Machine Translation (EAMT)
gives the following definition: "Machine translation (MT) is the
application of computers to the task of translating texts from one
natural language to another. One of the very earliest pursuits in
computer science, MT has proved to be an elusive goal, but today a
number of systems are available which produce output which, if not
perfect, is of sufficient quality to be useful for certain specific
applications, usually in the domain of technical documentation. In
addition, translation software packages which are designed primarily to
assist the human translator in the production of translations are
enjoying increasing popularity within professional translation
organizations."
Machine translation is the earliest type of natural language
processing, as stated on the website of Globalink, a company offering
language translation software and services: "From the very beginning,
machine translation (MT) and natural language processing (NLP) have
gone hand-in-hand with the evolution of modern computational
technology. The development of the first general-purpose programmable
computers during World War II was driven and accelerated by Allied
cryptographic efforts to crack the German Enigma machine and other
wartime codes. Following the war, the translation and analysis of
natural language text provided a testbed for the newly emerging field
of Information Theory.
During the 1950s, research on Automatic Translation (known today as
Machine Translation, or 'MT') took form in the sense of literal
translation, more commonly known as word-for-word translations, without
the use of any linguistic rules. The Russian project initiated at
Georgetown University in the early 1950s represented the first
systematic attempt to create a demonstrable machine translation system.
Throughout the decade and into the 1960s, a number of similar
university and government-funded research efforts took place in the
United States and Europe. At the same time, rapid developments in the
field of Theoretical Linguistics, culminating in the publication of
Noam Chomsky's "Aspects of the Theory of Syntax" (1965), revolutionized
the framework for the discussion and understanding of the phonology,
morphology, syntax and semantics of human language.
In 1966, the U.S. government-issued ALPAC (Automatic Language
Processing Advisory Committee) report offered a prematurely negative
assessment of the value and prospects of practical machine translation
systems, effectively putting an end to funding and experimentation in
the field for the next decade. It was not until the late 1970s, with
the growth of computing and language technology, that serious efforts
began once again. This period of renewed interest also saw the
development of the Transfer model of machine translation and the
emergence of the first commercial MT systems. While commercial ventures
such as SYSTRAN and METAL began to demonstrate the viability, utility
and demand for machine translation, these mainframe-bound systems also
illustrated many of the problems in bringing MT products and services
to market. High development cost, labor-intensive lexicography and
linguistic implementation, slow progress in developing new language
pairs, inaccessibility to the average user, and inability to scale
easily to new platforms are all characteristics of these second-
generation systems."
As explained in August 1998 by Eduard Hovy, head of the Natural
Language Group at USC/ISI (University of Southern
California/Information Sciences Institute), machine translation implies
"language-related applications/functionalities that are not
translation, such as information retrieval (IR) and automated text
summarization (SUM). You would not be able to find anything on the Web
without IR! -- all the search engines (AltaVista, Yahoo!, etc.) are
built upon IR technology. Similarly, though much newer, it is likely
that many people will soon be using automated summarizers to condense
(or at least, to extract the major contents of) single (long) documents
or lots of (any length) ones together."
= Experiences
In December 1997, AltaVista, a leading search engine, was the first to
launch a free translation software with Babel Fish -- also called
AltaVista Translation --, which could translate webpages (up to three
pages at the same time) from English into French, German, Italian,
Portuguese or Spanish, and vice versa. The software was developed by
SYSTRAN (an acronym for System Translation), a company specializing in
machine translation software. SYSTRAN's headquarters are located in
Soisy-sous-Montmorency, near Paris, France. Sales, marketing, and
research and development are based in its subsidiary in La Jolla,
California.
This initiative was followed by other translation software developed by
Alis Technologies, Globalink, Lernout & Hauspie, and Softissimo, with
free and/or paid versions on the web.
Based in Montreal, Quebec, Alis Technologies has specialized in
development and marketing of language handling solutions and services,
particularly language implementation in the information technology
industry. Alis Translation Solutions (ATS) has offered applications in
a number of languages, and tools and services to improve the quality of
translations. Language Technology Solutions (LTS) has marketed advanced
tools and services for language engineering and information technology
(90 languages covered).
Based in Ieper, Belgium, and Burlington, Massachusetts, Lernout &
Hauspie (L&H) was a leader in advanced speech technology for commercial
applications and products, with four core technologies: automatic
speech recognition (ASR), text-to-speech (TTS), text-to-text (TTT), and
digital speech compression (DSC). Its ASR, TTS and DSC technologies
were licensed to companies in telecommunications, computers and
multimedia, consumer electronics and automotive electronics. Its TTT
translation services were provided to IT companies, and vertical and
automation markets. The Machine Translation Group created by Lernout &
Hauspie included L&H Language Technology, AppTek, AILogic, NeocorTech,
and Globalink. Lernout & Hauspie was later bought by Nuance
Communications.
Globalink, a company created in 1990 in the U.S., focused on language
translation software and services, i.e. customized translation
solutions built around software products, online options, and
professional translation services. The software products were available
in Spanish, French, Portuguese, German, Italian and English, for
individuals, small businesses, multinational corporations and
governments, from a stand-alone product giving a fast draft translation
to a full system managing professional translations.
As explained on the company website in 1998, "with Globalink's
translation applications, the computer uses three sets of data: the
input text, the translation program and permanent knowledge sources
(containing a dictionary of words and phrases of the source language),
and information about the concepts evoked by the dictionary and rules
for sentence development. These rules are in the form of linguistic
rules for syntax and grammar, and some are algorithms governing verb
conjugation, syntax adjustment, gender and number agreement and word
re-ordering. Once the user has selected the text and set the machine
translation process in motion, the program begins to match words of the
input text with those stored in its dictionary. Once a match is found,
the application brings up a complete record that includes information
on possible meanings of the word and its contextual relationship to
other words that occur in the same sentence. The time required for the
translation depends on the length of the text. A three-page, 750-word
document takes about three minutes to render a first draft
translation."
At the headquarters of the World Health Organization (WHO) in Geneva,
Switzerland, the Computer-assisted Translation and Terminology Unit
(CTT) has been a pioneer since 1997 in assessing technical options for
using computer-assisted translation (CAT) systems based on translation
memory (TM). With such systems, translators can access previous
translations from portions of the text; accept, reject or modify them;
and add the new translation to the memory, thus enriching it for future
reference. By archiving the daily output, the translator helps in
building an extensive translation memory and in solving a number of
translation issues. Several projects have been under way at the CTT for
electronic document archiving and retrieval, bilingual/multilingual
text alignment, computer-assisted translation, translation memory and
terminology database management, and speech recognition.
The Pan American Health Organization (PAHO) in Washington, D.C. has
developed its own machine translation software, as a common work from
its own computational linguists, translators, and system programmers.
The PAHO Translation Unit has used SPANAM (Spanish to English) from
1980 and ENGSPAN (English to Spanish) from 1985, to process over 25
million words between 1980 and 1998. Staff translators and free-lance
translators post-edit the raw output to produce high-quality
translations with a 30-50% gain in productivity. The software is
available in the LAN (Local Area Network) of PAHO Headquarters, and is
regularly used by the staff of technical and administrative units. The
software is also available in a number of PAHO field offices, and has
been licensed to public and non-profit institutions in the U.S., Latin
America, and Spain. The software was later renamed PAHOMTS, and has
included new language pairs with Portuguese.
= Comments
# Comments from ZDNN
In "Web Embraces Language Translation", an article published in ZDNN
(ZDNetwork News) on 21 July 1998, Martha Stone explained: "Among the
new products in the $10 billion language translation business are
instant translators for websites, chat rooms, email and corporate
intranets. The leading translation firms are mobilizing to seize the
opportunities. Such as:
*SYSTRAN has partnered with AltaVista and reports between 500,000 and
600,000 visitors a day on babelfish.altavista.digital.com, and about 1
million translations per day -- ranging from recipes to complete
webpages. About 15,000 sites link to babelfish, which can translate to
and from French, Italian, German, Spanish and Portuguese. The site
plans to add Japanese soon. 'The popularity is simple. With the
internet, now there is a way to use U.S. content. All of these
contribute to this increasing demand,' said Dimitros Sabatakakis, group
CEO of SYSTRAN, speaking from his Paris home.
*Alis technology powers the Los Angeles Times' soon-to-be launched
language translation feature on its site. Translations will be
available in Spanish and French, and eventually, Japanese. At the click
of a mouse, an entire webpage can be translated into the desired
language.
*Globalink offers a variety of software and web translation
possibilities, including a free email service and software to enable
text in chat rooms to be translated.
But while these so-called 'machine' translations are gaining worldwide
popularity, company execs admit they're not for every situation.
Representatives from Globalink, Alis and SYSTRAN use such phrases as
'not perfect' and 'approximate' when describing the quality of
translations, with the caveat that sentences submitted for translation
should be simple, grammatically accurate and idiom-free. 'The progress
on machine translation is moving at Moore's Law -- every 18 months it's
twice as good,' said Vin Crosbie, a web industry analyst in Greenwich,
Conn. 'It's not perfect, but some [non-English speaking] people don't
realize I'm using translation software.'
With these translations, syntax and word usage suffer, because
dictionary-driven databases can't decipher between homonyms -- for
example, 'light' (as in the sun or light bulb) and 'light' (the
opposite of heavy). Still, human translation would cost between $50 and
$60 per webpage, or about 20 cents per word, SYSTRAN's Sabatakakis
said. While this may be appropriate for static 'corporate information'
pages, the machine translations are free on the web, and often less
than $100 for software, depending on the number of translated languages
and special features."
# Comments from RALI
Despite the imminent outbreak of a universal translation machine
announced at the end of the 1940s, machine translation hasn't produced
good translations yet. Pierre Isabelle and Patrick Andries, two
scientists from the RALI Laboratory (Laboratory for Applied Research in
Computational Linguistics - Laboratoire de Recherche Appliquée en
Linguistique Informatique) in Montreal, Quebec, explain the reasons for
this failure in "La Traduction Automatique, 50 Ans Après" (Machine
Translation, 50 Years Later), an article published in 1998 by
Multimédium, a French-language online magazine: "The ultimate goal of
building a machine capable of competing with a human translator remains
elusive due to slow progress in research. (...) Recent research, based
on large collections of texts called corpora -- using either
statistical or analogical methods -- has promised to reduce the
quantity of manual work required to build a machine translation (MT)
system, but can't promise for sure a significant improvement in the
quality of machine translation. (...) The use of MT will be more or
less restricted to tasks of information assimilation or tasks of text
distribution in restricted sub-languages."
According to Yehochua Bar-Hillel's ideas expressed in "The State of
Machine Translation", an article published in 1951, Pierre Isabelle and
Patrick Andries define three implementation strategies for machine
translation: (a) a tool of information assimilation to scan
multilingual data and supply rough translation, (b) situations of
"restricted language" such as the METEO system which, since 1977, has
translated the weather forecasts of the Canadian Ministry of
Environment, (c) the human/machine coupling before, during and after
the machine translation process, that may not save money if compared to
traditional translation.
Pierre Isabelle and Patrick Andries favor "a workstation for the human
translator" more than a "robot translator": "Recent research on the
probabilist methods showed it was possible to modelize in an efficient
way some simple aspects of the translation relationship between two
texts. For example, methods were set up to calculate the correct
alignment between the text sentences and their translation, that is, to
identify the sentence(s) of the source text corresponding to each
sentence of the translation. Applied on a large scale, these techniques
can use the archives of a translation service to build a translation
memory for recycling fragments from previous translations. Such systems
are already available on the translation market (IBM Translation
Manager II, Trados Translator's Workbench by Trados, RALI TransSearch,
etc.) The latest research focuses on models that can automatically set
up correspondences at a finer level than the sentence level, i.e.
syntagms and words. The results let hope for a bunch of new tools for
the human translator, including for the study of terminology, for
dictation and translation typing, and for detectors of translation
errors."
# Comments from Randy Hobler
In September 1998, Randy Hobler was a consultant in internet marketing
at Globalink, after working for IBM, Johnson & Johnson, Burroughs
Wellcome, Pepsi, and Heublein. He wrote in an email interview: "We are
rapidly reaching the point where highly accurate machine translation of
text and speech will be so common as to be embedded in computer
platforms, and even in chips in various ways. At that point, and as the
growth of the web slows, the accuracy of language translation hits 98%
plus, and the saturation of language pairs has covered the vast
majority of the market, language transparency (any-language-to-any-
language communication) will be too limiting a vision for those selling
this technology. The next development will be 'transcultural,
transnational transparency', in which other aspects of human
communication, commerce and transactions beyond language alone will
come into play. For example, gesture has meaning, facial movement has
meaning and this varies among societies. The thumb-index finger circle
means 'OK' in the United States. In Argentina, it is an obscene
gesture.
When the inevitable growth of multimedia, multilingual
videoconferencing comes about, it will be necessary to 'visually edit'
gestures on the fly. The MIT (Massachusetts Institute of Technology)
Media Lab, Microsoft and many others are working on computer
recognition of facial expressions, biometric access identification via
the face, etc. It won't be any good for a U.S. business person to be
making a great point in a web-based multilingual video conference to an
Argentinian, having his words translated into perfect Argentinian
Spanish if he makes the 'O' gesture at the same time. Computers can
intercept this kind of thing and edit them on the fly.
There are thousands of ways in which cultures and countries differ, and
most of these are computerizable to change as one goes from one culture
to the other. They include laws, customs, business practices, ethics,
currency conversions, clothing size differences, metric versus English
system differences, etc. Enterprising companies will be capturing and
programming these differences and selling products and services to help
the peoples of the world communicate better. Once this kind of thing is
widespread, it will truly contribute to international understanding."
= Machine translation R&D
Here is an overview of the work of four research centers, in Quebec
(RALI Laboratory), California (Natural Language Group), Switzerland
(ISSCO) and Japan (UNDL Foundation).
# RALI Laboratory
In Montreal, Quebec, the RALI Laboratory (Laboratory of Applied
Research in Computational Linguistics - Laboratoire de Recherche
Appliquée en Linguistique Informatique) has worked in automatic text
alignment, automatic text generation, automatic reaccentuation,
language identification, and finite state transducers. RALI produces
the "TransX family" of what it calls "a new generation" of translation
support tools (TransType, TransTalk, TransCheck, and TransSearch),
which are based on probabilistic translation models that automatically
calculate correspondences between the text produced by a translator and
the original text from the source language.
As explained on RALI's website in 1998: "(a) TransType speeds up the
keying-in of a translation by anticipating a translator's choices and
criticizing them when appropriate. In proposing its suggestions,
TransType takes into account both the source text and the partial
translation that the translator has already produced. (b) TransTalk is
an automatic dictation system that makes use of a probabilistic
translation model in order to improve the performance of its voice
recognition model. (c) TransCheck automatically detects certain types
of translation errors by verifying that the correspondences between the
segments of a draft and the segments of the source text respect well-
known properties of a good translation. (d) TransSearch allows
translators to search databases of pre-existing translations in order
to find ready-made solutions to all sorts of translation problems. In
order to produce the required databases, the translations and the
source language texts must first be aligned."
# Natural Language Group
The Natural Language Group (NLG) at the Information Sciences Institute
(ISI) of the University of Southern California (USC) has been involved
in various aspects of computational/natural language processing:
machine translation, automated text summarization, multilingual verb
access and text management, development of large concept taxonomies
(ontologies), discourse and text generation, construction of large
lexicons for various languages, and multimedia communication.
Eduard Hovy, head of the Natural Language Group, explained in August
1998: "People will write their own language for several reasons --
convenience, secrecy, and local applicability -- but that does not mean
that other people are not interested in reading what they have to say!
This is especially true for companies involved in technology watch
(say, a computer company that wants to know, daily, all the Japanese
newspaper and other articles that pertain to what they make) or some
Government Intelligence agencies (the people who provide the most up-
to-date information for use by your government officials in making
policy, etc.). One of the main problems faced by these kinds of people
is the flood of information, so they tend to hire 'weak' bilinguals who
can rapidly scan incoming text and throw out what is not relevant,
giving the relevant stuff to professional translators. Obviously, a
combination of SUM (automated text summarization) and MT (machine
translation) will help here; since MT is slow, it helps if you can do
SUM in the foreign language, and then just do a quick and dirty MT on
the result, allowing either a human or an automated IR-based text
classifier to decide whether to keep or reject the article. For these
kinds of reasons, the U.S. Government has over the past five years been
funding research in MT, SUM, and IR (information retrieval), and is
interested in starting a new program of research in Multilingual IR.
This way you will be able to one day open Netscape or Explorer or the
like, type in your query in (say) English, and have the engine return
texts in *all* the languages of the world. You will have them clustered
by subarea, summarized by cluster, and the foreign summaries
translated, all the kinds of things that you would like to have."
Eduard Hovy added in August 1999: "Over the past 12 months I have been
contacted by a surprising number of new information technology (IT)
companies and startups. Most of them plan to offer some variant of
electronic commerce (online shopping, bartering, information gathering,
etc.). Given the rather poor performance of current non-research level
natural language processing technology (when is the last time you
actually easily and accurately found a correct answer to a question to
the web, without having to spend too much time sifting through
irrelevant information?), this is a bit surprising. But I think
everyone feels that the new developments in automated text
summarization, question analysis, and so on, are going to make a
significant difference. I hope so!--but the level of performance is not
available yet.
It seems to me that we will not get a big breakthrough, but we will get
a somewhat acceptable level of performance, and then see slow but sure
incremental improvement. The reason is that it is very hard to make
your computer really 'understand' what you mean -- this requires us to
build into the computer a network of 'concepts' and their
interrelationships that (at some level) mirror those in your own mind,
at least in the subjects areas of interest. The surface (word) level is
not adequate -- when you type in 'capital of Switzerland', current
systems have no way of knowing whether you mean 'capital city' or
'financial capital'. Yet the vast majority of people would choose the
former reading, based on phrasing and on knowledge about what kinds of
things one is likely to ask the web, and in what way. Several projects
are now building, or proposing to build, such large 'concept' networks.
This is not something one can do in two years, and not something that
has a correct result. We have to develop both the network and the
techniques for building it semi-automatically and self-adaptively. This
is a big challenge."
Eduard Hovy added in September 2000: "I see a continued increase in
small companies using language technology in one way or another: either
to provide search, or translation, or reports, or some other
communication function. The number of niches in which language
technology can be applied continues to surprise me: from stock reports
and updates to business-to-business communications to marketing...
With regard to research, the main breakthrough I see was led by a
colleague at ISI (I am proud to say), Kevin Knight. A team of
scientists and students last summer at Johns Hopkins University in
Maryland developed a faster and otherwise improved version of a method
originally developed (and kept proprietary) by IBM about 12 years ago.
This method allows one to create a machine translation (MT) system
automatically, as long as one gives it enough bilingual text.
Essentially the method finds all correspondences in words and word
positions across the two languages and then builds up large tables of
rules for what gets translated to what, and how it is phrased.
Although the output quality is still low -- no-one would consider this
a final product, and no-one would use the translated output as is --
the team built a (low-quality) Chinese-to-English MT system in 24
hours. That is a phenomenal feat -- this has never been done before.
(Of course, say the critics: you need something like 3 million sentence
pairs, which you can only get from the parliaments of Canada, Hong
Kong, or other bilingual countries; and of course, they say, the
quality is low. But the fact is that more bilingual and semi-equivalent
text is becoming available online every day, and the quality will keep
improving to at least the current levels of MT engines built by hand.
Of that I am certain.)
Other developments are less spectacular. There's a steady improvement
in the performance of systems that can decide whether an ambiguous word
such as "bat" means "flying mammal" or "sports tool" or "to hit"; there
is solid work on cross-language information retrieval (which you will
soon see in being able to find Chinese and French documents on the web
even though you type in English-only queries), and there is some rather
rapid development of systems that answer simple questions automatically
(rather like the popular web system AskJeeves, but this time done by
computers, not humans). These systems refer to a large collection of
text to find 'factiods' (not opinions or causes or chains of events) in
response to questions such as 'what is the capital of Uganda?' or 'how
old is President Clinton?' or 'who invented the xerox process?', and
they do so rather better than I had expected."
# ISSCO
In Geneva, Switzerland, ISSCO (Dalle Molle Institute for Semantic and
Cognitive Studies - Institut Dalle Molle pour les Études Sémantiques et
Cognitives) is a research laboratory conducting basic and applied
research in computational linguistics (CL) and artificial intelligence
(AI), for a number of Swiss and European research projects. The
University of Geneva has provided administrative support and
infrastructure. Research is funded with grants and contracts with
public and private bodies.
Created by the Foundation Dalle Molle in 1972 to conduct research in
cognition and semantics, ISSCO has come to specialize in natural
language processing, including multilingual language processing, in a
number of areas: machine translation, linguistic environments,
multilingual generation, discourse processing, data collection, etc.
ISSCO is multi-disciplinary and multi-national. As explained on its
website in 1998, "its staff and its visitors [are drawn] from the
disciplines of computer science, linguistics, mathematics, psychology
and philosophy. The long-term staff of the Institute is relatively
small in number; with a much larger number of visitors coming for stays
ranging from a month to two years. This ensures a continual exchange of
ideas and encourages flexibility of approach amongst those associated
with the Institute."
# UNDL Foundation
The UNL (universal networking language) project was launched in the
mid-1990s as a main digital metalanguage project by the Institute of
Advanced Studies (IAS) of the United Nations University (UNU) in Tokyo,
Japan. As explained on the bilingual (English, Japanese) website in
1998: "UNL is a language that -- with its companion 'enconverter' and
'deconverter' software -- enables communication among peoples of
differing native languages. It will reside, as a plug-in for popular
web browsers, on the internet, and will be compatible with standard
network servers. The technology will be shared among the member states
of the United Nations. Any person with access to the internet will be
able to 'enconvert' text from any native language of a member state
into UNL. Just as easily, any UNL text can be 'deconverted' from UNL
into native languages. United Nations University's UNL Center will work
with its partners to create and promote the UNL software, which will be
compatible with popular network servers and computing platforms."
In 2000, 120 researchers worldwide were working on a multilingual
project in 16 languages (Arabic, Brazilian, Chinese, English, French,
German, Hindu, Indonesian, Italian, Japanese, Latvian, Mongolian,
Russian, Spanish, Swahiki, and Thai). The UNDL Foundation (UNDL:
Universal Networking Digital Language) was founded in January 2001 to
develop and promote the UNL project.
CHRONOLOGY
[Each line begins with the year or the year/month.]
1968: ASCII is the first character set encoding.
1971: Project Gutenberg is the first digital library.
1974: The internet takes off.
1990: The web is invented by Tim Berners-Lee.
1991/01: Unicode is a universal character set encoding for all languages.
1993/11: Mosaic is the first web browser.
1994/05: The Human-Languages Page is a catalog of language-related internet resources.
1994/10: The World Wide Web Consortium will deal with internationalization and localization.
1994: Travland is dedicated to both travel and languages.
1995/12: The Kotoba Home Page deals with language issues using our keyboard.
1995: The Internet Dictionary Project works on creating free translating dictionaries.
1995: NetGlos is a multilingual glossary of internet terminology.
1995: Global Reach is a virtual consultancy stemming from Euro-Marketing Associates.
1995: LISA is the localization industry standards association.
1995: "The Ethnologue: Languages of the World" offers a free online version.
1996/04 : OneLook Dictionaries is a fast finder in online dictionaries.
1997/01: UNL (universal networking language) is a digital metalanguage project.
1997/12: AltaVista launches AltaVista Translation, also called Babel Fish.
1997: The Logos Dictionary goes online for free.
1999/12: Britannica.com is the first main English-language online encyclopedia.
1999/12: WebEncyclo is the first main French-language online encyclopedia.
1999: WordReference.com offers free online bilingual translating dictionaries.
2000/02: yourDictionary.com is a major language portal.
2000/07: Non-English-speaking internet users reach 50%.
2001/01: Wikipedia is a main free multilingual cooperative encyclopedia.
2001/01: The UNDL Foundation develops UNL, a digital metalanguage project.
2001/04: The Human-Languages Project becomes the iLoveLanguages portal.
2004/01: Project Gutenberg Europe is launched as a multilingual project.
2007/03: IATE is the new terminological database of the European Union.
2009: "The Ethnologue" launches its 16th edition as an encyclopedic reference work.
WEBSITES
Alis Technologies: http://www.alis.com/
Aquarius.net: Directory of Localization Experts: http://www.aquarius.net/
ASCII Table: http://www.asciitable.com/
Asia-Pacific Association for Machine Translation (AAMT): http://www.aamt.info/
Association for Computational Linguistics (ACL): http://www.aclweb.org/
Association for Machine Translation in the Americas (AMTA): http://www.amtaweb.org/
CALL@Hull: http://www.fredriley.org.uk/call/
ELRA (European Language Resources Association): http://www.elra.info/
ELSNET (European Network of Excellence in Human Language Technologies): http://www.elsnet.org/
Encyclopaedia Britannica Online: http://www.britannica.com/
Encyclopaedia Universalis: http://www.universalis-edu.com/
Ethnologue: http://www.ethnologue.com/
Ethnologue: Endangered Languages: http://www.ethnologue.com/nearly_extinct.asp
EUROCALL (European Association for Computer-Assisted Language Learning): http://www.eurocall-languages.org/
European Association for Machine Translation (EAMT): http://www.eamt.org/
European Bureau for Lesser-Used Languages (EBLUL): http://www.eblul.org/
European Commission: Languages of Europe: http://ec.europa.eu/education/languages/languages-of-europe/
European Minority Languages (list of the Institute Sabhal Mòr Ostaig): http://www.smo.uhi.ac.uk/saoghal/mion-chanain/en/
Google Translate: http://translate.google.com/
Grand Dictionnaire Terminologique (GDT): http://www.granddictionnaire.com/
IATE: InterActive Terminology for Europe: http://iate.europa.eu/
ILOTERM (ILO: International Labor Organization): http://www.ilo.org/iloterm/
iLoveLanguages: http://www.ilovelanguages.com/
International Committe on Computational Linguistics (ICCL): http://nlp.shef.ac.uk/iccl/
Internet Dictionary Project (IDP): http://www.june29.com/IDP/
Internet Society (ISOC): http://www.isoc.org/
Laboratoire CLIPS (Communication Langagière et Interaction Personne-Système): http://www-clips.imag.fr/
Laboratoire CLIPS: GETA (Groupe d'Étude pour la Traduction Automatique): http://www-clips.imag.fr/geta/
LINGUIST List (The): http://linguistlist.org/
Localization Industry Standards Association (LISA): http://www.lisa.org/
Logos: Multilingual Translation Portal: http://www.logos.it/
MAITS (Multilingual Application Interface for Telematic Services): http://wwwold.dkuug.dk/maits/
Merriam-Webster Online: http://www.merriam-webster.com/
Natural Language Group (NLG) at USC/ISI: http://www.isi.edu/natural-language/
Nuance: http://www.nuance.com/
OneLook Dictionary Search: http://www.onelook.com/
Oxford English Dictionary (OED): http://www.oed.com/
Oxford Reference Online (ORO): http://www.oxfordreference.com/
PAHOMTS (PAHO: Pan American Health Organization): http://www.paho.org/english/am/gsp/tr/machine_trans.htm
Palo Alto Research Center (PARC): http://www.parc.com/
Palo Alto Research Center (PARC): Natural Language Processing: http://www.parc.com/work/focus-area/NLP/
RALI (Recherche Appliquée en Linguistique Informatique): http://www-rali.iro.umontreal.ca/
Reverso: Free Online Translator: http://www.reverso.net/
SDL: http://www.sdl.com/
SDL: FreeTranslation.com: http://www.freetranslation.com/
SDL Trados: http://www.trados.com/
Softissimo: http://www.softissimo.com/
SYSTRAN: http://www.systranlinks.com/
SYSTRANet: Free Online Translator: http://www.systranet.com/
TEI: Text Encoding Initiative: http://www.tei-c.org/index.xml
TERMITE (Terminology of Telecommunications): http://www.itu.int/terminology/index.html
*tmx Vokabeltrainer: http://www.tmx.de/
Transparent Language: http://www.transparent.com/
TransPerfect: http://www.transperfect.com/
Travlang: http://www.travlang.com/
Travlang's Translating Dictionaries: http://dictionaries.travlang.com/
UNDL (Universal Networking Digital Language) Foundation: http://www.undl.org/
Unicode: http://www.unicode.org/
Yahoo! Babel Fish: http://babelfish.yahoo.com/
YourDictionary.com: http://www.yourdictionary.com/
YourDictionary.com: Endangered Languages: http://www.yourdictionary.com/elr/index.html
W3C: World Wide Web Consortium: http://www.w3.org/
W3C Internationalization Activity: http://www.w3.org/International/
WELL (Web Enhanced Language Learning): http://www.well.ac.uk/
Wordfast: http://www.wordfast.org/
Xerox XRCE (Xerox Research Centre Europe): http://www.xrce.xerox.com/
Xerox XRCE: Cross-Language Technologies: http://www.xrce.xerox.com/competencies/cross-language/
Copyright © 2009 Marie Lebert. All rights reserved.
End of Project Gutenberg's The Internet and Languages, by Marie Lebert
*** END OF THE PROJECT GUTENBERG EBOOK 30422 ***