[FEEDBACK] Inference Playground

#1
by victor HF staff - opened
Hugging Face org
โ€ข
edited Sep 23, 2024

Inference Playground

image.png

This discussion is dedicated to providing feedback on the Inference Playground and Serverless Inference API.

About the Inference Playground:

The Inference Playground is a user interface designed to simplify testing our serverless inference API with chat models. It lists available models for you to try, allowing you to experiment with each model's settings, test available models via a UI, and copy code snippets.

If you need more usage, you can subscribe to PRO.

User Tier Rate Limit
Unregistered Users 1 request per hour
Signed-up Users 50 requests per hour
PRO and Enterprise Users 500 requests per hour

Upcoming Features:

  • Continuous UI improvements
  • A dedicated UI for function calling
  • Support for vision language models
  • A feature to easily compare models
victor pinned discussion

Add place to change API keys in Playground.

Hugging Face org
โ€ข
edited Sep 26, 2024

Add place to change API keys in Playground.

Yes I'll try to add this today. edit: I added it.

support tool use๐Ÿ› ๏ธ

How to use this ๐Ÿค— InferenceClient with "Langchain" or "Llama-Index " ?

Hugging Face org

How to use this ๐Ÿค— InferenceClient with "Langchain" or "Llama-Index " ?

What do you mean? This is just a UI for easier testing and getting the code to do Inference on HF models.

Being able to use text completion like in Open Web UI would be great.

Testing function calling output would also be very appreciated. Actually calling the functions doesn't make any sense in this case, but generating the json for it would be very useful.

I am aware that this probably takes some work to accomplish, as the templates need to be evaluated to implement this, but it would be great, even if just for popular models (like the recent llama 3b)

Hugging Face org
โ€ข
edited Oct 9, 2024

Being able to use text completion like in Open Web UI would be great.

What do you mean? I think this is HuggingChat no?

Testing function calling output would also be very appreciated. Actually calling the functions doesn't make any sense in this case, but generating the json for it would be very useful.

Yes, we'll add that.

Hugging Face org

This is awesome! one small nit, is on an iPhone with compare mode it doesn't show both well.

maybe a carousel type component (swipe) to show the different models could work good there.

Hugging Face org

It should be a bit better on mobile now @cfahlgren1

@victor Maybe I am missing something, but as far as I know, HuggingChat does not have a text-completion feature.
What I am refering to is a feature, where you provide some text, and the model completes it, like base models tend to do. Like this:

User input

The following artilcle will discuss the differences between lemon and carrot cake:
# Lemon cake vs Carrot cake!
**Lemon cake**
A delicious cake made of flour, some sugar and some other stuff too
**Carrot cake**

AI output

Another cake also made from flour, but with a carroty twist!
[...]

In open webui, one can simple use any text model, and use it to make the given input text longer.
I hope this clears up what I mean

EDIT: Got a question, does this inference playgound "subtract" the amount of compute we can spend on the spaces, or is that on another seemingly endless supply of compute, like HuggingChat is? (As far as I can tell, there is no limit to how many generations one can do in HuggingChat. I use it daily and haven't encountered any limit yet)

Hugging Face org

Hi @Smorty100 ,

  1. The playground is only available for instruct models at the moment (I don't know if we will add support for base models).
  2. Yes, when you use the playground you're calling the endpoint with your HF tokens, so it's subtracting from your quotas. We'll be clarifying the limits soon.

Hi @victor ,

Instruction tuned models can also complete text, just like base models. Just tested this with the granite3-moe-1B-instruct model with ollama (in Godot Game engine)

Here the test:

The model is prompted completely without any template, it simply continues what has already been said in a somewhat logical way.

The model I picked here is not the best, but I used it for the speed for the demo.

Do you have any plans to add pay-as-you-go pricing per token?

And, would it be possible to support Qwen2.5-7B or 14B?

Currently, when accessing the playground from HuggingChat, it sometimes give an Error 500 code. This happens when a model is not available on the playground, but has the button on HuggingChat.

Here an example link which links from HuggingChat Llama 3.2 vision right to the playspace.

Maybe put a check in place on the playground so that it defaults to a certain model when the one in the URL isn't available and pop show some kind of message like This model isn't available anymore.

[FEATURE REQUEST]

Please give us the option to Write a prefix for the model response by typing something into its message field and then let the LLM complete it.
This would give us the ability to steer the models even better.

Here a short post about Why prefixes for LLMs can be real useful by me

Currently, when accessing the playground from HuggingChat, it sometimes give an Error 500 code.

This may be a variant of the problem.
https://huggingface.co./posts/Tonic/169924015276572

We have the qwen2.5-1.5B base model (not instruction tuned) on the playground but we don't have any kind of text completion interface yet.
So as it stands right now, the LLM is probably being fed with all sorts of chat tokens like <|im_start|> and <|im_end|>, which it doesn't know what to do with.
grafik.png
I tried to give it some start of a sentence, but it just ended up token-dumping on me.
So either we need a text completion interface like I requested previously, or we need to ban base models from the playground.
I would prefer a text completion interface ~

Aaand here I am again with another problem.
When using Zephyr beta for chat, it usually responds with <user> and then writes a prompt a user might ask. Which is surprising that Zephyr can even do that. I always though we only train the LLM on the response side and never on the input side.
grafik.png
It seems like the chat templates in general are broken for many models. Some models simply don't have a chat template and give an error, some kinda work but don't really, like here with zaphyr beta, and sometimes even base models get a chat template, even though they haven't been instruction tuned and thus don't know what to do with these chat tokens.

Is playground pulling the templates for these from somewhere? Maybe some of the repos don't include the correct template?
But especially with Zephyr by HuggingFace I would expect the template to be correct...

Hugging Face org

Thank you for your feedback and sorry for the late reply.

We have the qwen2.5-1.5B base model (not instruction tuned) on the playground but we don't have any kind of text completion interface yet.

Mhhh I don't think we have the base model but only the instruct one (Qwen/Qwen2.5-1.5B-Instruct) on hf.co/playground. And it seems to work for me, note that non-instruct models are not supported in the Playground.

image.png

But especially with Zephyr by HuggingFace I would expect the template to be correct...

Mhhh same for me with <|user|> I think it works on the model page widget but I dont remember if we did something special for Zephyr there cc @mishig maybe.

Hugging Face org

Small nit:

If you switch models it deletes your system prompt, would be awesome if it could preserve that across model switching etc.

This one goes hand-in-hand with the feature request for function calling, but I'd like to see a JSON mode, like many other playgrounds have aswell, like Coheres playground

I use this playground a lot for iterating and improving my prompting on medium-sized LLMs (mostly llama3.2 3B).
Sadly, the iteration process takes a while, because every time I enter something else as the message, I first have remove the previous generation before I can generate a new one via mouse click, because else I get this error, complaining about a generation already being in place
image.png
This is very annoying. I would like to just press Ctrl+Enter a bunch of times while having the user message text box focused to check for good consistency with the prompts.

Also, and this is a very smol nit, but having built-in markdown UI for these discussions sections here would be very nice. I can live with it, but other users might not even realize this supports markdown. I only figured that out by uploading an image and recognizing the formatting.

Sign up or log in to comment