Skip to main content

Command Palette

Search for a command to run...

Taming Computational Challenges with Quantization, DDP and FSDP

Published
2 min read
Taming Computational Challenges with 
Quantization, DDP and FSDP
S

🚀 Passionate Data Enthusiast and Problem Solver 🤖

🎓 Education: Bachelor's in Engineering (Information Technology), Vidyalankar Institute of Technology, Mumbai (2021)

👨‍💻 Professional Experience:

  • Over 2 years in startups and MNCs, honing skills in Data Science, Data Engineering, and problem-solving.
  • Worked with cutting-edge technologies and libraries: Keras, PyTorch, sci-kit learn, DVC, MLflow, OpenAI, Hugging Face, Tensorflow.
  • Proficient in SQL and NoSQL databases: MySQL, Postgres, Cassandra.

📈 Skills Highlights:

  • Data Science: Statistics, Machine Learning, Deep Learning, NLP, Generative AI, Data Analysis, MLOps.
  • Tools & Technologies: Python (modular coding), Git & GitHub, Data Pipelining & Analysis, AWS (Lambda, SQS, Sagemaker, CodePipeline, EC2, ECR, API Gateway), Apache Airflow. Flask, Django and streamlit web frameworks for python.
  • Soft Skills: Critical Thinking, Analytical Problem-solving, Communication, English Proficiency.

💡 Initiatives:

  • Passionate about community engagement; sharing knowledge through accessible technical blogs and linkedin posts.
  • Completed Data Scientist internships at WebEmps and iNeuron Intelligence Pvt Ltd and Ungray Pvt Ltd. successfully.

🌏 Next Chapter:

  • Pursuing a career in Data Science, with a keen interest in broadening horizons through international opportunities.
  • Currently relocating to Australia, eligible for relevant work visas & residence, working with a licensed immigration adviser and actively exploring new opportunities & interviews.

🔗 Let's Connect!

  • Open to collaborations, discussions, and the exciting challenges that data-driven opportunities bring.
  • Reach out for a conversation on Data Science, technology, or potential collaborations!
  • Email: naiksaurabhd@gmail.com

Introduction:

Large Language Models (LLMs) are at the forefront of natural language processing, but they come with a significant computational burden. Storing an LLM can be a memory-intensive task. For instance, if you need to store an LLM with one billion parameters, it would occupy around 4 GB of memory. However, training an LLM introduces additional parameters like Adam optimizer states, gradients, activation variables, and temporary variables, which can easily consume up to 24 GB of memory. This memory challenge is a serious roadblock in the world of LLMs.

Quantization:

To optimize memory usage, one powerful technique is quantization. Quantization reduces the precision of LLM parameters from higher to lower precision points. While this approach can significantly reduce memory usage, it may result in a loss of data, which is often an acceptable trade-off.

Distributed Data-Parallel(DDP):

Quantization is effective, but if you're aiming for top-notch performance and speed, leveraging multiple GPUs can be a game-changer. Distributed Data-Parallel (DDP) is a multi-GPU strategy that can help you achieve that. In DDP, you deploy your LLM on each GPU, split your data, and pass each split to a dedicated GPU. The outputs are then collected and synchronized to update gradients, making it a highly efficient parallel processing technique.

However, DDP has a drawback. If your LLM exceeds your GPU's storage capacity, you might face difficulties. Additionally, redundancy can be observed in DDP. To overcome these issues, you can turn to Full Sharded Data Parallel (FSDP), which leverages the ZeRO memory optimization technique.

Full Sharded Data Parallel (FSDP):

ZeRO operates in three stages, each progressively optimizing memory usage. In the first stage, only optimizer states are sharded across all GPUs. The second stage includes optimizer states and gradients. The final stage goes all-in, sharding optimizer states, gradients, and model parameters.

Combining ZeRO's stages with the DDP concept results in FSDP. In this process, instead of storing the entire LLM on each GPU, a shared LLM is stored, and split data is passed to each GPU. The outputs are collected, combined, and then forwarded for both forward and backward propagation. The final step involves gradient synchronization and weight updates.

With ZeRO and FSDP in your toolbox, you can efficiently address the computational challenges posed by large LLMs. These techniques open up new possibilities in natural language processing, allowing you to tackle more complex tasks and push the boundaries of LLM applications.

More from this blog

Riding the Wave: Emerging Trends in Data Science

134 posts