Research from MIE examines ways to make the use of language models more resource efficient by replacing their high-precision parameters with low-precision ones.

The work is described in two conference papers published in the last few months. Together, they outline how quantization methods such as partial retraining and distribution alignment can compress LLMs while maintaining performance.
“From both economic and environmental perspectives, as LLMs grow in size to achieve higher performance, they are becoming too expensive to run”, says Aref, who is an Assistant Professor, Teaching Stream in the Department of Mechanical & Industrial Engineering.
Aref says that the remarkable capabilities of LLMs come at high economic and environmental costs, driven by computations on billions of parameters. But it doesn’t have to be this way: compressing LLM while keeping the performance nearly intact means these computations can be made far more resource efficient and environmentally friendly. This compression also allows large models to run on smaller devices without jeopardizing output performance.
“Smaller LLMs address a critical need for more resource efficient models, potentially compressed by quantizing parameters into lower-precision values that collectively play the same role, or a ‘close-enough’ role to the original full-precision model,” says Aref.
The first paper was presented at Modelling Decisions for Artificial Intelligence, a conference held in Valencia, Spain in the summer of 2025. Aref and Deyu Cao, an undergraduate exchange student from the University of Tokyo, looked at the effectiveness of compressing LLMs with 7-13 billion parameters through partial retraining to gain resource usage efficiencies while conserving output integrity. Unlike complete retraining of LLMs which may take weeks or months, partial retraining can be done in a few hours.
After analyzing several key factors for effective quantization, they proposed a partial retraining method that shrinks LLMs by quantizing their parameters from 16-bit precisions to 3-bit and 2-bit precisions.
Their proposed regularization term prioritizes parameters most influential on the model’s output. This regularization retains some of the accuracy that would otherwise be lost through alternative quantization approaches.
The second paper was presented at the 18th International Conference on Agents and Artificial Intelligence, held in Marbella, Spain earlier this year; this one received the best paper award. In this project, UTSC computer science undergraduate Yixin Yin joined Aref and Cao. Yixin was supported by the Data Sciences Institute Summer Undergraduate Data Science (SUDS) program in 2025, where she became interested in improving the efficiency of LLMs.
Together, the three investigated the potential gains from using distribution alignment for improving the quantization of LLMs. Their proposed method relies on a sliced Wasserstein loss function and recovers the performance loss of quantized language models by up to 20.37% in relative terms.
“Our main focus was on finding better trade-offs between compression and accuracy”, says Aref. “Our two methods produce compressed LLMs that use far less resources and are more accurate than LLMs compressed by alternative methods.”
– This story was originally published on the Department of Mechanical & Industrial Engineering’s site on March 30, 2026 by Kendra Hunter.