Sep 22, 2023

Feel free to add your feedback about the Inference API for PRO users as well as other features.

osanseviero pinned discussion Sep 22, 2023

Oct 7, 2023

I am getting a very partial response. I am using huggingfacehub like:
repo_id = "meta-llama/Llama-2-70b-chat-hf"
args = {
'temperature': 1,
"max_length":1024
}
HuggingFaceService.llm = HuggingFaceHub(repo_id=repo_id, model_kwargs=args)

prompt is : You are a assistant who can generate the response based on the prompt
Use the following pieces of context to answer the question at the end.
If you don't find the answer, just say Sorry I didn't understand, can you rephrase please.
[Document(page_content='Types of workflow in the DigitalChameleon platform There are two types of workflows that can be created in the platform which include: 1.\tConversation: A series of nodes with questions or text displayed to the customer in a sequence one by one, to capture the response of Customer, is referred to as Conversation workflow. The nodes of a workflow of conversation type are loaded on the webpage to the customer one at a time. The flow can be modified to return to a previous flow or allow customer to resume work at a later point in time. Workflow will go to the next node only when the customer performs the desired action in the previous node as configured in the workflow. 2.\tForm: A one time loading of the nodes/questions/messages to the end customer all at once in the UI of a form. The form will be created in the similar manner as we create for conversation in the CMS except for the workflow type in the journey properties should be selected as Form while creating/copying the workflow.
Question: explain the Types of workflow in the DigitalChameleon platform

result : "result": ". \n ')]] Sure, I'd be happy to explain the types of"

I am using langchain to get the answers based on a text file.

julien-c

Hugging Face org Oct 9, 2023

@it-chameleoncx can you format your post with codeblocks (```) thanks

pzaback

Oct 26, 2023

Have any models been added for use by Pro users beyond the four listed in the announcement blog post, such as current top Code Llama derivative from Phind, or if I wanted to use that with llm-vscode would I still need to pay for my own inference endpoint?

FPVG

Nov 20, 2023

Are you planning to add more models to PRO interfaces like for example teknium/OpenHermes-2.5-Mistral-7B?

FPVG

Dec 18, 2023

Hi,
Please add PRO interface for mistralai/Mixtral-8x7B-Instruct-v0.1. It would also be nice to have interfaces for other models that are available through HuggingChat and are not available for PRO subscribers.
Thank you 🙂

Failing9617

Dec 18, 2023

•

edited Dec 18, 2023

Hi, can you please provide a link to a privacy policy that applies to the PRO Inference API?

momegas

May 10, 2024

Hello, can you please add https://huggingface.co./aaditya/Llama3-OpenBioLLM-70B in the pro subscription.

Silver13th

Jun 19, 2024

Sorry, I didn't understand are any limiations for requests in PRO / Free accounts, like limit of tokens

EmanueleLaMalfa

Sep 21, 2024

I am trying to access meta-llama/Llama-2-70b-chat-hf, which was previously available as a PRO subscriber, but it seems the model does not respond.
Can you please reactivate it?

Mihaiii

Oct 15, 2024

I can't apply on spaces.GPU on async funtions. I can't apply spaces.GPU in wrapped functions. It would be nice if both would be possible.

cdcvd

Nov 11, 2024

Hello Hugging Face Support Team,

I’m interested in using the models available through the PRO subscription and have reviewed the details on Inference for PRO in the following link: https://huggingface.co./blog/inference-pro.

Specifically, I would like to use the following model:
https://api-inference.huggingface.co/models/openai/whisper-large-v3-turbo

I’d like to know what the monthly usage limits are for this model under the PRO subscription. Specifically, how many requests can I make in a month, and what other limitations might apply?

Could you please provide detailed information regarding rate limits, monthly request quotas, response times, and any other restrictions associated with the PRO plan?

Thank you for your assistance.

Nymbo

Feb 1

•

edited Feb 1

Hey folks, I'm not sure where the best place to put this but I'd like some clarity on the models that have increased inference usage for PROs ~

Current Inference Docs

In the recently published serverless inference docs it mentions these models as having higher rate limits:

Model	Size	Supported Context Length	Use
Meta Llama 3.1 Instruct	8B, 70B	70B: 32k tokens / 8B: 8k tokens	High quality multilingual chat model with large context length
Meta Llama 3 Instruct	8B, 70B	8k tokens	One of the best chat models
Meta Llama Guard 3	8B	4k tokens
Llama 2 Chat	7B, 13B, 70B	4k tokens	One of the best conversational models
DeepSeek Coder v2	236B	16k tokens	A model with coding capabilities.
Bark	0.9B	-	Text to audio generation

Old Blog Article

But there's also this old blog post that introduces the feature with these models and it hasn't been updated:

Model	Size	Context Length	Use
Meta Llama 3 Instruct	8B, 70B	8k tokens	One of the best chat models
Mixtral 8x7B Instruct	45B MOE	32k tokens	Performance comparable to top proprietary models
Nous Hermes 2 Mixtral 8x7B DPO	45B MOE	32k tokens	Further trained over Mixtral 8x7B MoE
Zephyr 7B β	7B	4k tokens	One of the best chat models at the 7B weight
Llama 2 Chat	7B, 13B	4k tokens	One of the best conversational models
Mistral 7B Instruct v0.2	7B	4k tokens	One of the best chat models at the 7B weight
Code Llama Base	7B and 13B	4k tokens	Autocomplete and infill code
Code Llama Instruct	34B	16k tokens	Conversational code assistant
Stable Diffusion XL	3B UNet	-	Generate images
Bark	0.9B	-	Text to audio generation

I assume that the new inference docs have the correct supported models list but could be updated to avoid confusion.

My Suggestions

If the inference docs are correct, I think it could use some updating!

Llama-3-70B could be swapped out for Llama-3.3-70B-Instruct, while keeping Llama-3.1-8B-Instruct.
We probably don't need two large Llama 3.x models, so I'd suggest replacing Llama-3.1-70B with Qwen2.5-72B-Instruct.
It's time to retire Llama-2... In 2025 we have plenty of great reasoning models to prioritize like QwQ-32B-Preview or DeepSeek-R1-Distill-Qwen-32B.
Although I do like the novel nature of suno/bark, it is starting to show its age. I'd suggest replacing it hexgrad/Kokoro-82M for its small size, exceptional quality, and long inputs.
DeepSeek-Coder-V2 is a very large model that is matched or outperformed by Qwen2.5-Coder-32B-Instruct. If there are any concerns about the size of models or potential load, I think aiming to replace DeepSeek-Coder-V2 would be a wise use of resources.

Notable mentions and other thoughts

I tried to keep my suggestions limited to the current paradigm of serverless inference so that each model is a drop-in replacement for existing ones, while being realistic about size. However, it would be awesome to have a text-to-image model available on this list. The best and most agreeable image gen model is either FLUX.1-schnell or stabilityai/stable-diffusion-3.5-medium. Both models are relatively smol, and all above models are commercially permissive or already available on HuggingChat.

Thanks for reading :)

Spaces:

huggingface
/

HuggingDiscussions

Running

[FEEDBACK and SHOWCASE] PRO subscription

Current Inference Docs

Old Blog Article

My Suggestions

Notable mentions and other thoughts