Research

Training Language Models: Danqi Chen on Academia’s Role in AI Research

In Columbia Engineering’s latest Lecture Series in AI, the celebrated computer scientist discussed how academic efforts can advance large-scale language models, from evaluating capabilities to understanding pre-training data challenges.

December 20, 2024
Mariam Lobjanidze

Training a large language model (LLM) is so expensive that only a handful of companies can afford to do it.

Where does that leave academic researchers?

That’s the question computer scientist Danqi Chen, an assistant professor at Princeton University, posed in her Dec. 6 talk in Davis Auditorium at Columbia University. Chen, who serves as the associate director of Princeton’s Language and Intelligence Initiative and co-leads the Macro-Language Policy Group, came to campus as part of Columbia Engineering’s Lecture Series in AI

Chen’s work focuses on training, adapting, and understanding large language models, with a strong emphasis on making advancements accessible to academic researchers. Her accolades include a Sloan Fellowship, an NSF CAREER Award, the Samsung AI Researcher of the Year Award, and Outstanding Paper Awards from both ACL and EMNLP.

Academia's vital role

Image
Danqi Chen speaking with a lecture attendee
Danqi Chen speaks to students and guests after the keynote.

Chen began her lecture by asking how academic research can contribute meaningfully to the large-scale language model ecosystem. One answer, she argued, lies not in training models at the scale of industry but rather in evaluating those models.

“We can develop benchmarks and understand the progress of larger models, and what they can do and what they cannot do,” Chen said. “This can be done for both API and open-source models.”

Chen also pointed to the value in training smaller models, which can enable researchers to answer quantitative questions about how they work and behave.

“I deeply believe that only if we can actually engage with the model training, we are actually able to come up with better and new solutions to make these models better,” Chen said.

Chen highlighted Princeton’s investment in a GPU cluster of 300 and 200 GPUs to support foundational research on large-scale and AI models. Similar investments are being made by Amazon and other institutions.

“While this GPU cluster is in no way comparable to the industry scale, at least it allows us to engage meaningfully in the model training.” 

Tackling pre-training challenges with smaller-scale models

Image
Danqi Chen speaks with students
Lecture Series in AI reception

Chen used the concept of long-term model training — which involves stages of pre- and post-training — to highlight other useful roles for academia. In pre-training, vast amounts of unstructured, noisy data—such as internet text—are fed into a language model and used to determine optimal parameters. Post-training focuses on refining the model to follow instructions, align with human preferences, and develop specialized capabilities like coding and reading.

“Pre-training is particularly challenging because the data is messy,” Chen said. She highlighted two critical research areas for this stage: studying small-scale models with two to three billion parameters and developing methods to evaluate pre-training data quality.

The pre-training data, she explained, comes from seven primary domains: Common Crawl, SQL, GitHub, Books, Archive, Wikipedia, and Stack Exchange. The success of pre-training hinges on selecting high-quality data, but measuring and sampling data based on quality remains a significant challenge.

“Most approaches rely on rule-based filters, like the C4 dataset’s filtering rules, or attempt to match distributions of high-quality data sources, such as Wikipedia,” Chen said.

Chen emphasized the need for systematic approaches to understanding pre-training data, including quality signals, filtering mechanisms, and optimal domain mixtures.

Post-training advances–fine-tune models for specialized skills

For the post-training stage, Chen explained that the goal is to fine-tune models to follow instructions, align with human preferences, and gain specialized skills. This involves two processes: instruction tuning and practice learning.

“Post-training is usually much more affordable and accessible for academic researchers compared to pre-training,” she said.

Instruction tuning, in particular, enables the model to learn from prompt-response pairs and predict outputs conditioned on specific inputs.

Chen concluded her lecture by reiterating the unique contributions academia can make in advancing large language models. By focusing on evaluation, understanding, and systematic data approaches, academic research can complement industry efforts and push the boundaries of what’s possible in AI.


Lead Photo Caption: Danqi Chen, assistant professor of computer science at Princeton, delivered the AI Lecture at Columbia on Dec. 6.

Stay up-to-date with the Columbia Engineering newsletter

*indicates required