Model Card: Custom Language Model

Overview

This model was trained using the WikiText-103 dataset to generate text based on input prompts.

Dataset

Dataset Used: WikiText-103

Dataset Details: The WikiText-103 dataset is a collection of over 100 million tokens extracted from the set of verified "Good" and "Featured" articles on Wikipedia. It is designed for language modeling and other text generation tasks.

Data Cleaning

To ensure high-quality input for training, the dataset underwent the following cleaning steps:

Removal of non-standard characters and punctuation.
Tokenization using BERT's tokenizer.
Lowercasing all text.
Filtering out any overly short or long sequences to maintain a consistent input size.

Neural Network Definition

The neural network used for this model is based on a transformer architecture with the following specifications:

Model Type: BERT-based transformer
Number of Layers: 5
Dropout: Applied at each layer to prevent overfitting
Optimizer: AdamW with a learning rate of 5e-5
Loss Function: Cross-entropy loss for language modeling

Training Details

The model was trained on an L4 GPU with the following resources:

CPU Cores: 16
System RAM: 62.8 GB
GPU RAM: 22.5 GB
Disk: 201.2 GB

Training Configuration:

Batch Size: Dynamic, adjusted based on GPU RAM availability
Epochs: 50
Initial Learning Rate: 5e-5

Training Results

The training involved several experiments with different batch sizes and epochs. The final training loss was plotted to visualize the model's performance.

Usage

To use this model, you can load it from Hugging Face and generate text as follows:

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")
model = AutoModelForCausalLM.from_pretrained("RicardoPoleo/DL_LLM_from_scratch_2")

input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model.generate(**inputs)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))