Fails to run in transformers.js v3 on webgpu
I'm currently trying to run this model on webgpu inside a serviceworker in a chrome extensions (Canary 124, webgu for service workers enabled)
distilled code:
import { pipeline, env } from '@xenova/transformers';
env.allowLocalModels = false;
env.backends.onnx.wasm.wasmPaths = 'https://cdn.jsdelivr.net/npm/[email protected]/dist/';
env.backends.onnx.wasm.numThreads = 1;
const model = pipeline(
'text-generation',
'Xenova/stablelm-2-zephyr-1_6b',
{
quantized: true,
progress_callback: console.log,
device: 'webgpu',
}
);
const messages = [
{ "role": "system", "content": "You are a fun fact generator." },
{ "role": "user", "content": "tell me a fun fact about cats" },
]
const inputs = model.tokenizer.apply_chat_template(messages, {
tokenize: false,
add_generation_prompt: true,
});
const output = await model(inputs, {
max_length: 4096,
do_sample: true,
top_p: 0.95,
temperature: 0.2,
});
console.log(output);
running in wasm
works (but slowly since we're limited to 1 thread)
running equivalent code in node also works
running in webgpu
produces the following output:
2024-03-14 10:55:47.070600 [W:onnxruntime:, session_state.cc:1166 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
ort-wasm-simd.jsep.wasm:0xe98a92 2024-03-14 10:55:47.071900 [W:onnxruntime:, session_state.cc:1168 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
ort-wasm-simd.jsep.wasm:0xe98a92 2024-03-14 10:55:49.013700 [E:onnxruntime:, sequential_executor.cc:514 ExecuteKernel] Non-zero status code returned while running Concat node. Name:'/model/layers.0/self_attn/Concat_7' Status Message: Failed to run JSEP kernel
An error occurred during model execution: "Error: [WebGPU] Kernel "[Concat] /model/layers.0/self_attn/Concat_7" failed. Error: no GPU data for input: 0".
I am unsure if the runtime is at fault here (onnx web 1.17.1) or how the model was saved. This also manifests for Xenova/stablelm-2-1_6b
ah yes, this was a 0 sized tensor issue that we fixed. The fix is going to be in upcoming the ort-1.1.7.3 release or you can point to a dev build (https://www.npmjs.com/package/onnxruntime-web/v/1.18.0-dev.20240311-5479124834) that has the fix.
I'm mostly using tinyllama fp16 and mixed fp16/int4 for testing and that was sufficient.
stablelm-2-zephyr-1_6b: the fp16 model had fp16 overflows in some layers in a Pow -> ReduceMean sequence so I kept those on fp32.
phi2: also had fp16 overflows and I had to keep the last 3 layers in fp32.
mixed fp16/int4 works fine with main, the dev build and 1.17.3.
We are still working on performance, right now it is ~10 tokens/sec.