Clelia (Astra) Bertelli
AI & ML interests
Recent Activity
Articles
Organizations
as-cle-bert's activity
Hi!
I generally use LangChain + PyPDF, I leave here a code snippet:
from langchain_community.document_loaders import PyPDFLoader
def preprocess(pdf: str) -> list:
"""
Uses LangChain's PyPDFLoader to extract text.
"""
loader = PyPDFLoader(pdf)
documents = loader.load()
for doc in documents:
print(doc.page_content)
This should give a more solid result :)
PS: Langchain is distributed under an MIT license, see their GitHub (https://github.com/langchain-ai/langchain)
Convert (almost) everything to PDF with ๐๐๐๐๐ญ๐๐จ๐ฐ๐ง, now on Spaces! ๐ as-cle-bert/pdfitdown
You can also install it locally:
python3 -m pip install pdfitdown
Don't forget to star it on GitHub, if you find it useful! ๐ https://www.github.com/AstraBert/PdfItDown
๐ช๐๐ฎ๐ซ๐ฅ๐ฅ๐ฆ ๐๐ท-๐๐.๐ท (https://github.com/AstraBert/qdurllm/tree/january-2025)
Qdurllm (๐ค๐ฑrant, ๐จ๐ฅLs, ๐arge ๐anguage ๐ odels) is a local Gradio (Gradio) application that lets you upload you web content to a local Qdrant (Qdrant) database and search through it or chat with it.
The ๐ป๐ฒ๐ ๐ฝ๐ฟ๐ฒ-๐ฟ๐ฒ๐น๐ฒ๐ฎ๐๐ฒ (https://github.com/AstraBert/qdurllm/releases/tag/v1.0.0-rc.0) implements ๐๐ฝ๐ฎ๐ฟ๐๐ฒ ๐๐ฒ๐ฎ๐ฟ๐ฐ๐ต (with prithivida/Splade_PP_en_v1) + ๐ฟ๐ฒ๐ฟ๐ฎ๐ป๐ธ๐ถ๐ป๐ด (with nomic-ai/modernbert-embed-base by Hugging Face + Nomic AI) and ๐๐ฒ๐บ๐ฎ๐ป๐๐ถ๐ฐ ๐ฐ๐ฎ๐ฐ๐ต๐ถ๐ป๐ด (based on Qdrant) and switched ๐ณ๐ฟ๐ผ๐บ google/gemma-2-2b-it ๐๐ผ Qwen/Qwen2.5-1.5B-Instruct to conform to the SOTA landscape and to finally make the application based ๐ผ๐ป๐น๐ ๐ผ๐ป ๐๐ฟ๐๐น๐ ๐ผ๐ฝ๐ฒ๐ป ๐บ๐ผ๐ฑ๐ฒ๐น๐.
The pre-release is ๐ฎ๐๐ฎ๐ถ๐น๐ฎ๐ฏ๐น๐ฒ ๐ณ๐ผ๐ฟ ๐๐ฒ๐๐๐ถ๐ป๐ด and I would be really really happy if you wanted to give it a try and leave your feedback on the discussion thread on GitHub (https://github.com/AstraBert/qdurllm/discussions/8) or here on Hugging Face forum via comments under this postโจ.
Find all the information to install and launch it here ๐ https://astrabert.github.io/qdurllm/#2-installation
Thank you so much for letting me know! This is indeed a very interesting role :)
I recently released PrAIvateSearch v2.0-beta.0 (https://github.com/AstraBert/PrAIvateSearch), my privacy-first, AI-powered, user-centered and data-safe application aimed at providing a local and open-source alternative to big AI search engines such as SearchGPT or Perplexity AI.
We have several key changes:
- New chat UI built with NextJS
- DuckDuckGo API used for web search instead of Google
- Qwen/Qwen2.5-1.5B-Instruct as a language model served on API (by FastAPI)
- Crawl4AI crawler used for web scraping
- Optimizations in the data workflow inside the application
Read more in my blog post ๐ https://huggingface.co./blog/as-cle-bert/search-the-web-with-ai
Have fun and feel free to leave feedback about how to improve the application!โจ
If the answer is yes, then this post might be for you!โ
I recently created ๐จ๐๐ฌ๐ข๐๐ข๐๐ง-๐๐ข๐ ๐๐ฌ๐ญ, a Google Gemini-powered application that gives you feedback on style and contents of the documents you have been working on๐ง
Repo ๐ https://github.com/AstraBert/obsidian-digest
PyPi Package ๐ https://pypi.org/project/obsidian-digest/
The app is available as:
- ๐๐จ๐ฆ๐ฆ๐๐ง๐-๐ฅ๐ข๐ง๐ ๐ญ๐จ๐จ๐ฅ: install it as a python package with ๐ฝ๐ถ๐ฝ, and execute it from terminal anytime!๐ฆ
-๐๐ข๐ฌ๐๐จ๐ซ๐ ๐๐จ๐ญ ๐๐ฎ๐ข๐ฅ๐ญ ๐๐ซ๐จ๐ฆ ๐ฌ๐จ๐ฎ๐ซ๐๐ ๐๐จ๐๐: clone the GitHub repo, install the needed dependencies through ๐ฐ๐ผ๐ป๐ฑ๐ฎ, and run the bot: you will get hourly messages with suggestions and considerations about your activity on Obsidian in the previous hour๐ค
- ๐๐ข๐ฌ๐๐จ๐ซ๐ ๐๐จ๐ญ ๐๐๐ฉ๐ฅ๐จ๐ฒ๐๐ ๐ฅ๐จ๐๐๐ฅ๐ฅ๐ฒ ๐ฐ๐ข๐ญ๐ก ๐๐จ๐๐ค๐๐ซ ๐๐จ๐ฆ๐ฉ๐จ๐ฌ๐: clone the GitHub repo and launch ๐ฑ๐ผ๐ฐ๐ธ๐ฒ๐ฟ ๐ฐ๐ผ๐บ๐ฝ๐ผ๐๐ฒ ๐๐ฝ. Docker builds an image on the fly with all the needed dependencies and scripts, and runs them. You'll have the same functionalities as the ones from source code, but with a way easier deployment process๐
Go check out the GitHub repo for more info ๐ https://github.com/AstraBert/obsidian-digest
Have fun!โจ
Hi and thanks a lot for the specification!๐ฅฐ
Just as a note from my side, in the article I specify that there is a difference between "open weights" and "open source" models, and I link this blog post: https://www.agora.software/en/llm-open-source-open-weight-or-proprietary/ for a deeper explanation of the difference. I never (and I would never) claimed that Llama is open source, let alone a free software (see the introduction in this article of mine on privacy and data "stealing" risks: https://huggingface.co./blog/as-cle-bert/build-an-ai-powered-search-engine-from-scratch).
And I would have gladly used also DeepSeek, if it had been available on HuggingChat! :)
I nevertheless highly appreciate your comment and I'll for sure be more cautious in using the word "open/open source" in the future. Thanks!โจ
Both PdfItDown and SenTrEv only work with text for now: in future releases, support for image will be added :)
For text extraction, I use PyPDF + Langchain
Hi HuggingFacers๐ค, I decided to ship early this year, and here's what I came up with:
๐๐๐๐๐ญ๐๐จ๐ฐ๐ง (https://github.com/AstraBert/PdfItDown) - If you're like me, and you have all your RAG pipeline optimized for PDFs, but not for other data formats, here is your solution! With PdfItDown, you can convert Word documents, presentations, HTML pages, markdown sheets and (why not?) CSVs and XMLs in PDF format, for seamless integration with your RAG pipelines. Built upon MarkItDown by Microsoft
GitHub Repo ๐ https://github.com/AstraBert/PdfItDown
PyPi Package ๐ https://pypi.org/project/pdfitdown/
๐๐๐ง๐๐ซ๐๐ฏ ๐ฏ๐.๐.๐ (https://github.com/AstraBert/SenTrEv/tree/v1.0.0) - If you need to evaluate the ๐ฟ๐ฒ๐๐ฟ๐ถ๐ฒ๐๐ฎ๐น performance of your ๐๐ฒ๐ ๐ ๐ฒ๐บ๐ฏ๐ฒ๐ฑ๐ฑ๐ถ๐ป๐ด models, I have good news for you๐ฅณ๐ฅณ
The new release for ๐๐๐ง๐๐ซ๐๐ฏ now supports ๐ฑ๐ฒ๐ป๐๐ฒ and ๐๐ฝ๐ฎ๐ฟ๐๐ฒ retrieval (thanks to FastEmbed by Qdrant) with ๐๐ฒ๐ ๐-๐ฏ๐ฎ๐๐ฒ๐ฑ ๐ณ๐ถ๐น๐ฒ ๐ณ๐ผ๐ฟ๐บ๐ฎ๐๐ (.docx, .pptx, .csv, .html, .xml, .md, .pdf) and new ๐ฟ๐ฒ๐น๐ฒ๐๐ฎ๐ป๐ฐ๐ฒ ๐บ๐ฒ๐๐ฟ๐ถ๐ฐ๐!
GitHub repo ๐ https://github.com/AstraBert/SenTrEv
Release Notes ๐ https://github.com/AstraBert/SenTrEv/releases/tag/v1.0.0
PyPi Package ๐ https://pypi.org/project/sentrev/
Happy New Year and have fun!๐ฅ
As my last 2024 contribution, I decided to write an article about a Competitive Debate Championship simulation I ran with 5 LLMs as competitors and 2 as judges:
https://huggingface.co./blog/as-cle-bert/debate-championship-for-llms
The article covers code, analyses and results, and you can find everything to reproduce this tournament in the GitHub repo ๐ https://github.com/AstraBert/DebateLLM-Championship
I also released a dataset related to the data (motions, arguments, topics, winners...) collected during the tournament ๐ as-cle-bert/DebateLLMs
Happy reading and happy new yeAIr!๐
Get yours here on HuggingFace ๐ as-cle-bert/what-a-git-year
GitHub repo with the code to reproduce it ๐ https://github.com/AstraBert/what-a-git-year
Hope that everybody had a Git year!๐
As my last 2024 project, I've dropped a Discord Bot that knows a lot about Pokemons๐ฆ
GitHub ๐ https://github.com/AstraBert/Pokemon-Bot
Demo Space ๐ as-cle-bert/pokemon-bot
The bot integrates:
- Chat features (Cohere's Command-R) with RAG functionalities (hybrid search and reranking with Qdrant) and chat memory (managed through PostgreSQL) to produce information about Pokemons
- Image-based search to identify Pokemons from their images (via Qdrant)
- Card package random extraction and description
HuggingFace๐ค, as usual, plays the most important role in the application stack, with the following models:
- sentence-transformers/LaBSE
- prithivida/Splade_PP_en_v1
- facebook/dinov2-large
And datasets:
- Karbo31881/Pokemon_images
- wanghaofan/pokemon-wiki-captions
- TheFusion21/PokemonCards
Have fun!๐
I just published a blog article on building PrAIvateSearch (https://github.com/AstraBert/PrAIvateSearch), a user-owend, local and open-source AI-powered search engine๐:
https://huggingface.co./blog/as-cle-bert/build-an-ai-powered-search-engine-from-scratch
"Own your AI, search the web with it๐๐"
Feel free to try it out and contribute to it on GitHub: let's make OSS AI grown and thrive!๐
December is here and time has come, for most of us, to wrap up our code projects and take stock of our 2024 contributions๐๏ธ
In order to do this, I made a small Gradio application,
what-a-git-year
:as-cle-bert/what-a-git-year
that scrapes information from your GitHub profile and summarizes them, producing also nice plots๐
Find also the GitHub repo here: https://github.com/AstraBert/what-a-git-year โญ
Hope that everyone had a Git year!๐
I just deployed a Streamlit-based space on HF that fetches your Home Feed on BlueSky and summarizes it with Cohere's CommandR via Langchain๐งช
Find it here:
as-cle-bert/bsky-feedllama-demo
I'm also working on a Gradio local implementation with Llama3.2 that for now only works with source code and doesn't have docs, but that will be soon supported by Docker๐ณ and have a nice README:
https://github.com/AstraBert/bluesky-feedllama
Contributions and feedback are always welcome!๐ค๐ฆ
I'm thrilled to introduce my latest project: ๐ฆ๐ฒ๐ป๐ง๐ฟ๐๐ (๐ฆ๐ฒ๐ปtence ๐ง๐ฟansformers ๐๐aluator), a python package that offers simple customizable evaluation for text retrieval accuracy and time performance of Sentence Transformers-compatible text embedders on PDF data!๐
Learn more in my LinkedIn post: https://www.linkedin.com/posts/astra-clelia-bertelli-583904297_python-embedders-semanticsearch-activity-7266754133557190656-j1e3
And on the GitHub repo: https://github.com/AstraBert/SenTrEv
Have fun!๐
If you're into biomedical sciences, you will know the pain that, sometimes, searching PubMed can be๐โโ๏ธ
For these purposes, I built a bot that scrapes PubMed for you, starting from the exact title of a publication or key word search - all beautifully rendered through Gradioโ
Find it here: as-cle-bert/BioMedicalPapersBot
And here's the GitHub repository๐ฑ: https://github.com/AstraBert/BioMedicalPapersBot
It's also available as a Docker image!๐ณ
docker pull ghcr.io/astrabert/biomedicalpapersbot:main
Best of luck with your research!
PS: in the very near future some AI summarization features will be included!
Are you working with Streamlit on Spaces and struggling with authentication and user management?๐ง
Well, you can check out my last community article (https://huggingface.co./blog/as-cle-bert/streamlit-supabase-auth-ui) on a new python package I've been working on, that connects Supabase to Streamlit UI, in order to create a seamless authentication for your seamless Streamlit apps!๐
You can find a demo of it on Spaces: as-cle-bert/streamlit-supabase-auth-ui
Have fun!๐
As you may have probably heard, in the past weeks three Tech Giants (Microsoft, Amazon and Google) announced that they would bet on nuclear reactors to feed the surging energy demand of data centers, driven by increasing AI data and computational flows.
I try to explain the state of AI energy consumptions, its environmental impact and the key points of "turning AI nuclear" in my last article on HF community blog: https://huggingface.co./blog/as-cle-bert/ai-is-turning-nuclear-a-review
Enjoy the reading!๐ฑ