Researchers are developing an AI system that can learn the rules of human language

A team of researchers from Cornell University, MIT and McGill University have developed an AI system that can learn the rules and patterns of human languages ​​on its own. The study “Synthesizing theories of human language with Bayesian program induction” was published in NatureCommunications.

The researchers were interested in discovering AI-driven theories, they chose human language as a test field. They focused on the linguist’s construction of language-specific theories and his synthesis of abstract cross-linguistic meta-theories, while proposing links to language acquisition in children. The cognitive sciences of language have indeed established an explicit analogy between the scientist constructing grammars of particular languages ​​and the child learning these languages.

Kevin Ellis, assistant professor of computer science at Cornell University and lead author of the paper, explains:

“One of the motivations for this work was our desire to study systems that learn patterns from datasets that are represented in a way that humans can understand. Instead of learning weights, can the model learn expressions or rules? And we wanted to see if we could build this system so that it learns on a whole battery of interrelated datasets, so that the system learns a bit about how to better model each of them.”

The choice of human language

Natural language is an ideal field to study theoretical discovery for several reasons:

  • Decades of work in linguistics, psycholinguistics, and other cognitive language sciences provide diverse raw material for developing and testing models of automated theoretical discovery. One finds corpora, datasets and grammars from a wide variety of typologically distinct languages, which represents a rich and varied test bed for the comparative analysis of theory-induction algorithms;
  • On the other hand, children acquire language from modest amounts of data compared to AI. Similarly, field linguists develop grammars based on very small amounts of obtained data. These facts suggest that the analogy of the child as a linguist is productive and that the induction of language theories is tractable from sparse data with the right inductive biases;
  • Finally, language representation and learning theories are formulated in computational terms, exposing a series of formalisms ready for deployment by AI researchers.

These three characteristics of human language: the availability of a large number of very diverse empirical targets, the interfaces with cognitive development and the computational formalisms within linguistics, have led researchers to choose language as a target for research in automated theoretical induction.

A Bayesian program learning model

Linguistics aims to understand the general representations, processes and mechanisms that allow people to learn and use a language, and not just to catalog and describe particular languages. To capture this aspect at the level of the framework of the problem of theoretical induction, the researchers adopted the paradigm of learning by Bayesian program (BPL). They built the model using Sketch, a program synthesizer that was developed at MIT by Armando Solar-Lezama.

They focused on theories of natural language morphophonology, the domain of language governing the interplay of word formation and sound structure.

The team evaluated the BPL model on 70 datasets covering the morphophonology of 58 languages. These datasets came from phonology textbooks: although of great linguistic diversity, they are much simpler than the complete learning of the language, they only count a hundred words at most and only isolate a handful of grammatical phenomena. When given words and examples of how those words change to express different grammatical functions (like tense, case, or gender) in a language, this machine learning model comes up with rules that explain why forms of these words change..

The conclusions of the study

The model was able to come up with a correct set of rules to describe these shape changes for 60% of the problems. It could be used to investigate linguistic assumptions and study similarities in how various languages ​​transform words.

According to the researchers, humans deploy their theories more richly than their model. They thus propose new experiments to test theoretical predictions, design new tools based on the conclusions of a theory and distill higher-level knowledge that goes far beyond what their “Fragment-Grammar” approximation can do. However, continuing to push theoretical induction along these many dimensions remains a prime target for future research.

Sources of the article:

“Synthesizing theories of human language with Bayesian program induction”


  • Kevin Ellis, assistant professor of computer science at Cornell University;
  • Adam Albright, Professor of Linguistics, MIT;
  • Armando Solar-Lezama, Professor and Associate Director of the Computer Science and Artificial Intelligence Laboratory (CSAIL), MIT;
  • Joshua B. Tenenbaum, Paul E. Newton Career Development Professor of Cognitive Science and Computation in the Department of Brain and Cognitive Science and CSAIL Fellow, MIT;
  • Timothy J. O’Donnell, Assistant Professor in the Department of Linguistics at McGill University and holder of the Canada-CIFAR Chair in AI at Mila – Quebec Institute of Artificial Intelligence.

We would love to say thanks to the writer of this post for this awesome web content

Researchers are developing an AI system that can learn the rules of human language

Check out our social media profiles as well as other pages that are related to them.