Model Card: Custom Language Model
Overview
This model was trained using the WikiText-103 dataset to generate text based on input prompts.
Dataset
Dataset Used: WikiText-103
Source: Hugging Face Datasets
Dataset Details: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks.
Data Cleaning
To ensure high-quality input for training, the dataset underwent the following cleaning steps:
- Removal of non-standard characters and punctuation.
- Tokenization using BERT's tokenizer.
- Lowercasing all text.
- Filtering out any overly short or long sequences to maintain a consistent input size.
Neural Network Definition
The neural network used for this model is based on a transformer architecture with the following specifications:
- Model Type: BERT-based transformer
- Number of Layers: 5
- Dropout: Applied at each layer to prevent overfitting
- Optimizer: AdamW with a learning rate of 5e-5
- Loss Function: Cross-entropy loss for language modeling
Training Details
The model was trained on an L4 GPU with the following resources:
- CPU Cores: 16
- System RAM: 62.8 GB
- GPU RAM: 22.5 GB
- Disk: 201.2 GB
Training Configuration:
- Batch Size: Dynamic, adjusted based on GPU RAM availability
- Epochs: 50
- Initial Learning Rate: 5e-5
Training Results
The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance.
Usage
To use this model, you can load it from Hugging Face and generate text as follows:
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
- Downloads last month
- 0