liuhaotian/llava-v1.6-vicuna-13b · ViT size mismatch of mlp (ffn) tensors

First of all: gratulations on the llava-1.6 launch. You've just showcased how simple a solution can be to be on eye level with much more complex architectures.

I struggle with your ViT:

        size mismatch for vision_model.encoder.layers.23.mlp.fc1.bias: copying a param with shape torch.Size([4096]) from checkpoint, the shape in current model is torch.Size([13824]).
        size mismatch for vision_model.encoder.layers.23.mlp.fc2.weight: copying a param with shape torch.Size([1024, 4096]) from checkpoint, the shape in current model is torch.Size([1024, 13824]).

That's from your embedded ViT model, it appears to have a larger shape than normal.
That also differs from the foundation (336 patch)