An ancient language has defied translation for 100 years. Can AI crack the code?
Machine learning can translate between two known languages, but could it ever decipher those that remain a mystery to us?
Jiaming Luo grew up in mainland China thinking about neglected languages. When he was younger, he wondered why the different languages his mother and father spoke were often lumped together as Chinese “dialects.”
When he became a computer science doctoral student at MIT in 2015, his interest collided with his advisor’s long-standing fascination with ancient scripts. After all, what could be more neglected — or, to use Luo’s more academic term, “lower resourced” — than a long-lost language, left to us as enigmatic symbols on scattered fragments? “I think of these languages as mysteries,” Luo told Rest of World over Zoom. “That’s definitely what attracts me to them.”
In 2019, Luo made headlines when, working with a team of fellow MIT researchers, he brought his machine-learning expertise to the decipherment of ancient scripts. He and his colleagues developed an algorithm informed by patterns in how languages change over time. They fed their algorithm words in a lost language and in a known related language; its job was to align words from the lost language with their counterparts in the known language. Crucially, the same algorithm could be applied to different language pairs.
Luo and his colleagues tested their model on two ancient scripts that had already been deciphered: Ugaritic, which is related to Hebrew, and Linear B, which was first discovered among Bronze Age–era ruins on the Greek island of Crete. It took professional and amateur epigraphists — people who study ancient written matter — nearly six decades of mental wrangling to decode Linear B. Officially, 30-year-old British architect Michael Ventris is primarily credited with its decipherment, although the private efforts of classicist Alice Kober lay the groundwork for his breakthrough. Sitting night after night at her dining table in Brooklyn, New York, Kober compiled a makeshift database of Linear B symbols, comprising 180,000 paper slips filed in cigarette boxes, and used those to draw important conclusions about the nature of the script. She died in 1950, two years before Ventris cracked the code. Linear B is now recognized as the earliest form of Greek.
Luo and his team wanted to see if their machine-learning model could get to the same answer, but faster. The algorithm yielded what was called “remarkable accuracy”: it was able to correctly translate 67.3% of Linear B’s words into their modern-day Greek equivalents. According to Luo, it took between two and three hours to run the algorithm once it had been built, cutting out the days or weeks — or months or years — that it might take to manually test out a theory by translating symbols one by one. The results for Ugaritic showed an improvement on previous attempts at automatic decipherment.
The work raised an intriguing proposition. Could machine learning assist researchers in their quests to crack other, as-yet undeciphered scripts — ones that have so far resisted all attempts at translation? What historical secrets might be unlocked as a result?
British India, 1872-1873. Alexander Cunningham, an English army engineer turned archeological surveyor, clomped about the ruins of a town in Punjab province that locals called Harappa. On the face of it, there wasn’t much to survey: about two decades earlier, engineers working to link the cities of Lahore and Multan had stumbled across the site and used many of the bricks they found — perfectly preserved, fire kilned — as ballast for nearly 100 miles of railway track, blithely unaware they were remnants of one of the world’s oldest civilizations.

Jiaming Luo, a Phd student at the Massachusetts Institute of Technology. Tim Dunk for Rest of World
Cunningham didn’t know this either — the Indus Valley civilization wouldn’t be formally “discovered” until the 1920s — but he knew the site had some historical value. Burrowing through the ruins, he and his team chanced upon stone implements they surmised were used for scraping wood or leather. They gathered shards of ancient pottery and what appeared to be a clay ladle. The most striking discovery, though, was a tiny stone tablet, roughly 1.5 inch by 1.5 inch. “On it is engraved very deeply a bull, without a hump, looking to the right, with two stars under the neck,” Cunningham wrote in his report. “Above the bull there is an inscription in six characters, which are quite unknown to me. They are certainly not Indian letters; and as the bull which accompanies them is without a hump, I conclude that the seal is foreign to India.”
I have a cheap replica of that first seal, bought years ago from a museum gift shop at one of the Indus Valley sites: the animal on it has a thick neck, a lumpen torso, and a single swooping horn. Some people insist it is a unicorn. The inscription scrawled above it resembles a string of hieroglyphics; one character looks like a fish. In the century and a half since the discovery of the first seal, thousands more have been unearthed: 90% of them along the Indus River in modern-day Pakistan, the remaining in India or as far afield as modern-day Iraq.
We know now that these tablets, described by one excavator as “little masterpieces of controlled realism,” are indigenous to the Indian subcontinent; researchers believe they were probably used to close documents and mark packages of goods, which is why they are referred to as seals. In part because of how the symbols in the inscriptions jostle each other at one end, almost as if the inscriber had run out of space, researchers have concluded that the inscriptions are meant to be read right to left. But we still don’t know what they actually say.

A stone stamp-seal found at Harappa in the Indus Valley, mondern-day Pakistan’s Punjab and Sindh provinces. The Trustees of the British Museum
This isn’t from a lack of trying. Scholars often point out that the Indus script, as the collection of some 4,000 excavated inscriptions, comprising between 400 and roughly 700 unique symbols, is known, might be one of the most deciphered scripts in history. More than a hundred attempts have been published since the 1920s. One theory links it to the Rongorongo script of Easter Island, also still undeciphered; another, offered by a German tantric guru claiming to have achieved his solution through meditation, links it to the cuneiform script used to write the Sumerian language.
For some groups in South Asia, the quest to decode the Indus script is almost existential. India and Pakistan, increasingly riven by their respective strains of religious nationalism, have markedly different relationships to their shared ancient past. The Pakistani state, deeply wedded to the idea of itself as a Muslim homeland, largely ignores its pre-Islamic heritage; its Indian counterpart, on the other hand, has taken to scouring history to find justification for the claim that India has always been a Hindu nation.
Our new South Asia newsletter The future of global tech will be determined outside the West. Subscribe to our biweekly South Asia newsletter to find out how the region is inventing the future.
Up until the discovery of Harappa, the earliest Indians were believed to be people who lived between 1500 and 500 B.C. and composed the Vedas, the Sanskrit texts that form the basis of modern-day Hinduism. The discovery of a civilization of people who lived before the Vedic people upended the story of India. Given that it undermines their claims of indigeneity, proponents of Hindutva — the most mainstream strain of Hindu nationalism — balk at the theory of a pre-Vedic civilization, even as evidence for it accumulates across disciplines, including archaeology, genetics, and linguistics.
The smallest of advances in Indus Valley research, therefore, tends to reverberate far beyond the confines of academics. Attempts to prove that the Indus people worshipped Hindu gods and spoke an earlier form of Sanskrit continue unabated. In 2000, one researcher even digitally distorted an image of an Indus seal to make the animal on it look like a horse, which figures prominently in Sanskrit literature.
Politics aside, it is remarkable how little we know about the original people of the Indus Valley, who at one point constituted nearly 10% of the world’s inhabitants. It is especially galling given how much more we know about their contemporaries, such as the people of the Egyptian and Mesopotamian civilizations. Part of the reason for this is the continued elusiveness of the Indus script.
Putting machines to work on the Indus script is trickier than using them to reverse-engineer Linear B. We don’t have a great deal of information about the Indus script: most crucially, we don’t know what other language it may be related to. As a result, a model like Luo’s wouldn’t work for the Indus script. That’s not to say technology can’t help, though. In some ways, computer modeling has already played a crucial role: by showing that the Indus script is a language at all.
For most of the 20th century, the Indus inscriptions were widely accepted as representations of an undeciphered language. Then, in 2004, a group of Harvard researchers — cultural neurobiologist and comparative historian Steve Farmer, computational theorist Richard Sproat, and philologist Michael Witzel — published a paper essentially rubbishing nearly all existing research on the matter. The Indus seals, they claimed, were nothing more than a collection of religious or political symbols — similar to, say, highway signs — and all attempts to decipher them as a language were a waste of time. To underscore their point, Farmer offered a $10,000 reward to anyone who could find an Indus inscription containing at least 50 symbols.
Most Indologists and other Indus script researchers dismissed these arguments. One group of mathematicians, however, turned to computers to investigate the claims. Ronojoy Adhikari, a professor of statistical physics at the University of Cambridge, was one of them.
Before Cambridge, Adhikari worked at the Institute of Mathematical Sciences, in Chennai. In 2009, he attended a talk by Iravatham Mahadevan, an Indian civil servant turned epigraphist. Mahadevan, who died in 2018, had already cracked Tamil-Brahmi, another undeciphered script, then turned his attention to the Indus script.

Ronojoy Adhikari, a professor of statistical physics at the University of Cambridge. Tim Dunk for Rest of World
Adhikari remembers being fascinated. “I’m a person from the sciences; I don’t have a humanities background,” he said. “But what I found very attractive in Mahadevan’s way of looking at the problem was that he had a very quantitative, almost scientific, approach. He was asking, how many times does a particular symbol occur? What does it occur against? What is the context in which it is occurring? And it appeared to me that because it had already been so quantified, it would be easy to translate this into a formal mathematical analysis.”
A few other data scientists in attendance joined forces with Adhikari. They knew they couldn’t decipher the script. “So the question we asked was: Can we at least tell whether it’s conveying any sort of linguistic information?”
Led by computer scientist Rajesh Rao, the researchers devised a computer program to see if they could answer this question: Was the Indus script a language? “You can give me any sequence of symbols, I don’t care what they are — hieroglyphics, written language, sheet music, computer code — and I will look at them from the point of view of a mathematician,” explained Adhikari. “Meaning, I will simply count how many times one sign occurs next to another.”
“So the question we asked was: can we at least tell whether it’s conveying any sort of linguistic information?”
Their program drew on the work of Claude E. Shannon, a mid-century American mathematician, engineer, and decoder of wartime codes, who formulated the notion of information entropy — essentially a mathematical measure of disorder. In linguistic systems, symbols occur with somewhat fixed frequencies. “For instance, I just can’t pick up a letter from the alphabet, string it with another letter from the alphabet, and expect to get an English word,” explained Adhikari. In common English, for instance, the letter “q” is nearly always followed by “u.” This semiflexibility is a marker of all linguistic systems. Computer code, on the other hand, is completely rigid: the slightest deviation, and it falls apart.

A stamp-seal carved from grey steatite with a rhinoceros and an inscription in the Indus script, found at the Mohenjo-daro archaeological site in Sindh, Pakistan. The Trustees of the British Museum
The researchers fed their program the 4,000 inscriptions that form the entirety of the Indus script. For good measure, they also ran the program on other linguistic samples (English characters and words, Sanskrit, Tamil, Sumer, and Tagalog) and some nonlinguistic scripts (DNA, protein, Beethoven’s Sonata no. 32, and a computer code called Fortran). The program took about 45 minutes.
“I remember the first time that plot was generated,” recalled Adhikari. On the graph, the curves depicting music, protein, and DNA sequences hovered high, close to the maximum level of entropy, indicating a high level of randomness. Lower down, the known languages are all in a tight cluster. Fortran appears further below.
As for the Indus script, it appears with the other languages, just under Sanskrit and mapping almost cleanly onto Tamil. “It felt fantastic. It really felt very good. It’s nice to have a hunch, but to be able to prove it — I remember thinking, Yes, we’ve really got something here.”
For the rest of this article pleae use source link below
