sigridjineth/colbert-small-korean-20241212

sigridjineth/colbert-small-korean-20241212 is a Korean multi-vector reranker model, fine-tuned from answerai-colbert-small-v1 using a recipe inspired by JaColBERTv2.5. This model aims to deliver effective retrieval performance on Korean language content, achieving high-quality ranking metrics when integrated into a retrieval pipeline.

Compared to other ColBERT-based models tested (colbert-ir/colbertv2.0 and answerai/answerai-colbert-small-v1), sigridjineth/colbert-small-korean-20241212 demonstrates particularly strong results at top_k=3, surpassing others in Mean Reciprocal Rank (MRR) and Normalized Discounted Cumulative Gain (NDCG).

Model Comparison

The AutoRAG Benchmark serves as both the evaluation dataset and the toolkit for reporting these metrics.

Model	top_k	F1	MRR	NDCG
colbert-ir/colbertv2.0	1	0.2456	0.2456	0.2456
	3	0.3596	0.4459	0.5158
	5	0.3596	0.4459	0.5158
answerai/answerai-colbert-small-v1	1	0.2193	0.2193	0.2193
	3	0.3596	0.4240	0.4992
	5	0.3596	0.4240	0.4992
sigridjineth/colbert-small-korean-20241212	1	0.3772	0.3772	0.3772
	3	0.3596	0.5278	0.5769
	5	0.3596	0.5278	0.5769

Usage

Installation

This model integrates seamlessly with the latest ColBERT implementations and related RAG libraries:

pip install --upgrade ragatouille
pip install --upgrade colbert-ai
pip install --upgrade rerankers[transformers]

Using rerankers

from rerankers import Reranker

ranker = Reranker("sigridjineth/colbert-small-korean-20241212", model_type='colbert')
docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
query = '센과 치히로의 행방불명을 누가 감독했나요?'
ranked_docs = ranker.rank(query=query, docs=docs)

Using AGatouille

from ragatouille import RAGPretrainedModel

RAG = RAGPretrainedModel.from_pretrained("sigridjineth/colbert-small-korean-20241212")
docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']

RAG.index(docs, index_name="korean_cinema")

query = '센과 치히로의 행방불명을 누가 감독했나요?'
results = RAG.search(query)

Using Stanford ColBERT

Indexing:

from colbert import Indexer
from colbert.infra import ColBERTConfig

INDEX_NAME = "KO_MOVIES_INDEX"
config = ColBERTConfig(doc_maxlen=512, nbits=2)

indexer = Indexer(
    checkpoint="sigridjineth/colbert-small-korean-20241212",
    config=config
)

docs = ['이 영화는 미야자키 하야오가 감독하였습니다...', '월트 디즈니는 미국의 감독이자 ...']
indexer.index(name=INDEX_NAME, collection=docs)

Querying:

from colbert import Searcher
from colbert.infra import ColBERTConfig

config = ColBERTConfig(query_maxlen=32)
searcher = Searcher(index=INDEX_NAME, config=config)

query = '센과 치히로의 행방불명을 누가 감독했나요?'
results = searcher.search(query, k=10)

Extracting Vectors:

from colbert.modeling.checkpoint import Checkpoint
from colbert.infra import ColBERTConfig

ckpt = Checkpoint("sigridjineth/colbert-small-korean-20241212", colbert_config=ColBERTConfig())
embedded_query = ckpt.queryFromText(["하울의 움직이는 성 영어 더빙에 참여한 성우는 누구인가?"], bsize=16)

Referencing

If you use this model or other JaColBERTv2.5-based models, please cite:

@article{clavie2024jacolbertv2,
  title={JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources},
  author={Clavi{\'e}, Benjamin},
  journal={arXiv preprint arXiv:2407.20750},
  year={2024}
}

sigridjineth
/

colbert-small-korean-20241212