Model Card for oopere/pruned40-llama-3.2-3b

This model is a pruned version of the Llama-3.2-3b model, with a parameter reduction of 40% in the MLP Layers. The pruning process aims to enhance computational efficiency while maintaining acceptable performance across specific tasks. This model is not intended to be used directly, but rather to be fine-tuned for specific tasks where it can achieve equal or superior performance compared to fine-tuning the base model for the same task.

Model Details

  • Model Type: Pruned version of LLaMA-3.2 using structured pruning
  • Original Model: meta-llama/Llama-3.2-3B
  • Pruning Method: Structured pruning of MLP layers using importance scores based on absolute maximum weights
  • Size Reduction: 26.2% (from 2.79B to 2.37B parameters)
  • Architecture: Same as original LLaMA but with reduced MLP layer sizes
  • Language(s): Same as original model
  • License: Same as original model
  • Developed by: Pere Martra

Performance on Standard Benchmarks

Benchmark Original Model Pruned Model Relative Change
ARC-Easy 65.19% 47.01% -27.9%
BoolQ 64.16% 42.57% -33.6%
LAMBADA-OpenAI 62.20% 34.54% -44.5%
LAMBADA-Standard 53.46% 28.27% -47.1%

Key Findings

  • Performance Drop: Pruning to 40% results in significant degradation across all benchmarks, particularly for tasks requiring nuanced reasoning and long-range comprehension.
  • ARC-Easy: Retains moderate accuracy, showing the model is still usable for simpler reasoning tasks despite reduced performance.
  • LAMBADA: Both OpenAI and Standard versions show steep declines, indicating the model struggles with language completion tasks.
  • BoolQ: Performance drops highlight challenges with binary classification tasks.

Limitations

  • Severe Impact on Long-Range Dependencies: Performance on tasks like LAMBADA indicates the model struggles with understanding and predicting longer sequences.
  • Lower Usability for High-Accuracy Scenarios: The model's limitations make it less suitable for demanding applications.

Implementation Details

Pruning Method

  • Technique: Structured pruning targeting MLP layers
  • Pruning Ratio: 40% of neurons removed from MLP layers
  • Selection Criteria: Importance scoring based on absolute maximum weights
  • Architecture Specifics: Maintained GLU structure during pruning

Hardware Requirements

  • Reduced memory footprint compared to original model
  • Can run on hardware with ~30% less memory than original

Acknowledgments

Downloads last month
31
Safetensors
Model size
2.37B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for oopere/pruned40-llama-3.2-3b

Finetuned
(66)
this model
Quantizations
1 model

Collection including oopere/pruned40-llama-3.2-3b