KGR10 word2vec Polish word embeddings

Distributional language models for Polish trained on the KGR10 corpora.

Models

In the repository you can find two selected models, that were selected after evaluation (see table below). A model that performed the best is the default model/config (see default_config.json).

method dimension hs mwe
cbow 300 false true <-- default
skipgram 300 true true

Usage

To use these embedding models easily, it is required to install embeddings.

pip install clarinpl-embeddings

Utilising the default model (the easiest way)

Word embedding:

from embeddings.embedding.auto_flair import AutoFlairWordEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairWordEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)

Document embedding (averaged over words):

from embeddings.embedding.auto_flair import AutoFlairDocumentEmbedding
from flair.data import Sentence

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")

embedding = AutoFlairDocumentEmbedding.from_hub("clarin-pl/word2vec-kgr10")
embedding.embed([sentence])

print(sentence.embedding)

Customisable way

Word embedding:

from embeddings.embedding.static.embedding import AutoStaticWordEmbedding
from embeddings.embedding.static.word2vec import KGR10Word2VecConfig
from flair.data import Sentence

config = KGR10Word2VecConfig(method='skipgram', hs=False)
embedding = AutoStaticWordEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

for token in sentence:
    print(token)
    print(token.embedding)

Document embedding (averaged over words):

from embeddings.embedding.static.embedding import AutoStaticDocumentEmbedding
from embeddings.embedding.static.word2vec import KGR10Word2VecConfig
from flair.data import Sentence

config = KGR10Word2VecConfig(method='skipgram', hs=False)
embedding = AutoStaticDocumentEmbedding.from_config(config)

sentence = Sentence("Myśl z duszy leci bystro, Nim się w słowach złamie.")
embedding.embed([sentence])

print(sentence.embedding)

Citation

Piasecki, Maciej; Janz, Arkadiusz; Kaszewski, Dominik; et al., 2017,  Word Embeddings for Polish, CLARIN-PL digital repository, http://hdl.handle.net/11321/442.

or

@misc{11321/442,	
 title = {Word Embeddings for Polish},	
 author = {Piasecki, Maciej and Janz, Arkadiusz and Kaszewski, Dominik and Czachor, Gabriela},	
 url = {http://hdl.handle.net/11321/442},	
 note = {{CLARIN}-{PL} digital repository},	
 copyright = {{GNU} {GPL3}},	
 year = {2017}	
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference API
Unable to determine this model's library. Check the docs .

Space using clarin-pl/word2vec-kgr10 1