Sarcouncil Journal of Engineering and Computer Sciences
Sarcouncil Journal of Engineering and Computer Sciences
An Open access peer reviewed international Journal
Publication Frequency- Monthly
Publisher Name-SARC Publisher
ISSN Online- 2945-3585
Country of origin-PHILIPPINES
Impact Factor- 3.7
Language- English
Keywords
- Engineering and Technologies like- Civil Engineering, Construction Engineering, Structural Engineering, Electrical Engineering, Mechanical Engineering, Computer Engineering, Software Engineering, Electromechanical Engineering, Telecommunication Engineering, Communication Engineering, Chemical Engineering
Editors

Dr Hazim Abdul-Rahman
Associate Editor
Sarcouncil Journal of Applied Sciences

Entessar Al Jbawi
Associate Editor
Sarcouncil Journal of Multidisciplinary

Rishabh Rajesh Shanbhag
Associate Editor
Sarcouncil Journal of Engineering and Computer Sciences

Dr Md. Rezowan ur Rahman
Associate Editor
Sarcouncil Journal of Biomedical Sciences

Dr Ifeoma Christy
Associate Editor
Sarcouncil Journal of Entrepreneurship And Business Management
Post-Training Optimization Techniques for AI Models: A Comprehensive Framework
Keywords: Post-Training Optimization, Model Quantization, Parameter Efficiency, Deployment Frameworks, Inference Acceleration.
Abstract: Post-training optimization techniques play a crucial role in transforming trained AI models into practical systems that can be deployed in production. The article introduces a strata model that combines model, runtime, and system-level strategies to assist AI practitioners in developing high-performance systems that can efficiently address resource constraints. Post-training quantization (PTQ), sparsity pruning, low-rank adaptation (LoRA), and knowledge distillation are some of the model-level methods that help improve parameter efficiency. Compiler-time optimizations include fusion of operators, restructuring of memory layout, constant folding, auto-tuning of kernels, and compiler re-architecture to improve computational efficiency. System-level strategies, such as dynamic batching, KV cache reuse, paged attention, request coalescing, and model routing, are used to optimize models for a particular deployment environment, enabling efficient resource utilization and an optimal user experience. The article introduces a systematic approach to trade-off analysis between quality and latency, throughput and memory, energy consumption, and cost, which enables AI practitioners to make informed decisions based on optimization, ensuring efficiency and reliability in various applications and serving infrastructure.
Author
- Reeshav Kumar
- Independent Researcher USA