Inference Endpoints Changelog πŸš€

Community Article Published October 11, 2024

Week 04, Jan 20 - Jan 26

This week the whale got introduced to our inference catalog πŸ‹

  • Both deepseek-r1-distill-llama-70 and deepseek-r1-distill-qwen-32b are available as one-click deploys with optmized configurations
  • You can also now share the catalog models and get smooth previews on social media platforms, like so: image/png
  • We also made it more clear in the UI if you're about to deploy a model that has an optimized configuration, a.k.a a catalog model: image/png
  • And finally we've started remaking the settings page, let us know what you think! πŸ‘€ image/png
  • And ofc we fixed some small bugs πŸ€“

Week 03, Jan 13 - Jan 19

This update comes with super nice features to the catalog and managing endpoints πŸ”₯

  • managing your endpoints now looks like this πŸš€ you can:
    • filter
    • sort by different columns
    • bulk delete

image/png

  • the /new page can be accessed without being logged in, try this link in in incognito mode πŸ‘€
  • we also squashed bugs that were causing issues with deploying text embedding models and tweaded the instance recommendation for gemma models!

Week 02, Jan 06 - Jan 12

We're getting back to a good momentum after Christmas and New Years with some cool updates! 😎

  • Download pattern πŸ“¦ here the core idea is that you should only download the files that are relevant for your endpoint. Using only .safetensor files? No need to download the .pt files. Especially if you want to optimize the startup time for your endpoint. We now have an explict setting to tune this! image/png

  • When creating an endpoint, you don't need to open the full modal to see what you've chosen, previews are visible like this πŸ”₯ image/png

  • We made the dashboard more straight to point! By default you'll now arrive on the page for managing your endpoints, no intermediary views image/png

Week Christmas and New Years, Dec 15 - Jan 05

I'll do one larger with several weeks of updates combined with the Holidays we had. Since last time we have some nice improvements on:

  • added llama.cpp supported models to the catalog πŸ”₯ check them out here
  • fixes bugs on the /new page
  • bug fix related to updating passwords in container
  • a lot of nitty gritty work in the background
  • and recharged for the new year πŸ’ͺ

Week 50, Dec 09 - Dec 15

The big update from this week is getting TGI v3 out πŸ”₯ You can read all about the update here but a short tl;dr is:

  • zero configuration
  • increased performance

Check the benchmarks: image/png

We also:

  • improved the messaging in the UI when you reach your quota
  • did minor bug fixes

Week 49, Dec 02 - Dec 08

This week we have a lot of nice updates πŸš€

  • New and improved UI for the /new page πŸ™Œ our aim was to make the configuration cleaner and remove outdated fields, there are more updates coming but we think this is already a nice improvement.

image/png

  • You can now configure the hardware utilization threshold for autoscaling.
  • A bunch of models are now supported on the inf2 accelerator.
  • Mixtral-8x7B is now supported on TPUs.

Week 48, Nov 25 - Dec 01

This week we finally got back to shipping after the off-site and flu πŸ”₯

Updates:

  • If you autoscale based on pending requests, you can manually set the threshold to meet your specific requirements image/png
  • You can now view logs further back in history. Up to the last 50 replicas for a particular deployment.
  • New models added to the catalogue, like Qwen2-VL-7B-Instruct and Qwen2.5-Coder-32B-Instruct.
  • Updated default TGI version to 2.4.1
  • Added CPU as an alternative for the llama.cpp container type (shoutout to @ngxson)
  • Fixed an issue with the revision link and default hardware configurations for catalog models.
  • The default scale-to-zero timeout is now 15min. Previously it was never scale to zero.

Week 47, Nov 18 - Nov 24

Unfortunately, a wave of flu has hit our team, and we needed some time to recover πŸ€’ No updates this week, but stay tuned for next weekβ€”we have a lot of exciting things coming up! πŸ”₯

Week 46, Nov 11 - Nov 17

No changes this week as the team was on an off-site in Martinique! But a lot of ideas and energy cooked up for the coming week πŸ™Œ

image/jpeg

Week 45, Nov 04 - Nov 10

This week, we have some awesome updates that are finally out πŸ™Œ

  • Scaling replicas based on pending requests is now in beta πŸ”₯ Since it's in beta, things might change, but you can try it out and read more about it here image/png
  • Improved analytics with a graph of the replica history image/png
  • Updates to the widgets
    • Fixed bug in streaming
    • Conversations can now be cleared
    • Submit message with cmd+enter

Week 44, Oct 28 - Nov 03

Probably the biggest update this week was a revamp to the Inference Catalogue πŸ”₯ You can now with a one-click-deploy find a model based on:

  • license
  • price range
  • inference server
  • accelerator
  • and the previously existing task and search filters image/png

Additionally:

  • we fixed the config for MoritzLaurer/deberta-v3-large-zeroshot-v2.0 so that you can run it on CPU as well
  • and also thanks to @ngxson for fixing a bug in the llama.cpp snippet

Week 43, Oct 21-27

This week you'll get a sneak peak of the upcoming autoscaling, in the form of analytics πŸ‘€

We have:

  • Added pending http requests to the analytics
  • Support for Image-Text-To-Text, aka language vision models πŸ”₯ (llama vision has some good jokes πŸ˜…) image/png
  • Improved the log pagination and added some nice visual touches
  • Fixed a bug related to total request count in the analytics

Week 42, Oct 14-20

This week was unfortunately slower on the user-facing updates.

Behind the scenes, we:

  • fixed several recommendation values for LLaMA and Qwen 2,
  • improved our internal analytics,
  • debugged issues related to weights downloading and getting 429s,
  • and hopefully squashed the last bugs so we can soon release the new autoscaling πŸ”₯

Week 41, Oct 7-13

This week we had a lot of nice UI/UX improvements:

  • clearer error on models that are too large for any instance type, like for llama 405B πŸ˜… image/png

  • better logs loading message if the endpoint isn't ready image/png

Additionally:

  • deprecated the "text2text-generation" tasks, it's been deprecated on the Hub and in the Inference API as well
  • you can now pass the "seed" parameter in the widget for diffuser models
  • small bug fixes on llama.cpp containers
  • you can directly play in the widget with openAI API parameters
  • Shoutout to Alvaro for making the NVLM-D-72B model compatible on endpoints πŸ™Œ

On the backend we're also making improvements to the autoscaling. This might not immediately have noticeable impact for user but soon it'll ripple to the front end as well. Stay tuned πŸ‘€

Community

Sign up or log in to comment