Using training code in github.com/karpathy/build-nanogpt, the model is trained on the HuggingFaceFW/fineweb-edu dataset with 10B tokens, which took 112 hours on my 4070Ti GPU. The training context is set to 512 tokens, with final validation loss 2.91 and Hellaswag eval 0.3465.
With n_layer=20,n_head=14 and n_embd=896, model's architecture is very similar to GPT-2, except that I modify the span of the mlp from 2x to 5x.
Usage:
import torch
from transformers import GPT2Tokenizer
from model import GPT,GPTConfig
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model=GPT.from_pretrained('model.pt')
model=model.to(device)
prompt = "Hello, I'm a language model,"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
generated_ids=model.generate(input_ids.to(device))
print(tokenizer.decode(generated_ids))