LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language ⚖️

The library of Natural Language Processing for Brazilian legal language, LegalNLP, was born in a partnership between Brazilian researchers and the legal tech Tikal Tech based in São Paulo, Brazil. Besides containing pre-trained language models for the Brazilian legal language, LegalNLP provides functions that can facilitate the manipulation of legal texts in Portuguese and demonstration/tutorials to help people in their own work.

You can access our paper by clicking here.

If you use our library in your academic work, please cite us in the following way

@article{polo2021legalnlp,
  title={LegalNLP--Natural Language Processing methods for the Brazilian Legal Language},
  author={Polo, Felipe Maia and Mendon{\c{c}}a, Gabriel Caiaffa Floriano and Parreira, Kau{\^e} Capellato J and Gianvechio, Lucka and Cordeiro, Peterson and Ferreira, Jonathan Batista and de Lima, Leticia Maria Paz and Maia, Ant{\^o}nio Carlos do Amaral and Vicente, Renato},
  journal={arXiv preprint arXiv:2110.15709},
  year={2021}
}

Summary

Accessing the Language Models
Introduction / Installing package
Language Models (Details / How to use)
1. Word2Vec/Doc2Vec
Demonstrations / Tutorials
References

0. Accessing the Language Models

All our models can be found here.

Please contact [email protected] if you have any problem accessing the language models.

1. Introduction / Installing package

LegalNLP is promising given the scarcity of Natural Language Processing resources focused on the Brazilian legal language. It is worth mentioning that our library was made for Python, one of the most well-known programming languages for machine learning.

You first need to install the HuggingFaceHub library running the following command on terminal

$ pip install huggingface_hub

Import hf_hub_download:

from huggingface_hub import hf_hub_download

And then you can download our Word2Vec(SG)/Doc2Vec(DBOW) and Word2Vec(CBOW)/Doc2Vec(DM) by the following commands:

w2v_sg_d2v_dbow = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dbow_size_100_window_15_epochs_20")
w2v_cbow_d2v_dm = hf_hub_download(repo_id = "Projeto/LegalNLP", filename = "w2v_d2v_dm_size_100_window_15_epochs_20")

2. Model Languages

3.2. Word2Vec/Doc2Vec

Our first models for generating vector representation for tokens and texts (embeddings) are variations of the Word2Vec [1, 2] and Doc2Vec [3] methods. In short, the Word2Vec methods generate embeddings for tokens5 and that somehow capture the meaning of the various textual elements, based on the contexts in which these elements appear. Doc2Vec methods are extensions/modifications of Word2Vec for generating whole text representations.

Remember to at least make all letters lowercase. Please check our paper or Gensim page for more details. Preferably use Gensim version 3.8.3.

Below we have a summary table with some important information about the trained models:

Filenames	Doc2Vec	Word2Vec	Size	Windows
`w2v_d2v_dm*`	Distributed Memory (DM)	Continuous Bag-of-Words (CBOW)	100, 200, 300	15
`w2v_d2v_dbow*`	Distributed Bag-of-Words (DBOW)	Skip-Gram (SG)	100, 200, 300	15

Here we made available both models with 100 size and 15 window.

Using Word2Vec

Installing Gensim

!pip install gensim=='3.8.3'

Loading W2V:

from gensim.models import KeyedVectors

#Loading a W2V model
w2v=KeyedVectors.load(w2v_cbow_d2v_dm)
w2v=w2v.wv

Viewing the first 10 entries of 'juiz' vector

w2v['juiz'][:10]

array([ 6.570131  , -1.262787  ,  5.156106  , -8.943866  , -5.884408  ,
       -7.717058  ,  1.8819941 , -8.02803   , -0.66901577,  6.7223144 ],
      dtype=float32)

Viewing closest tokens to 'juiz'

w2v.most_similar('juiz')

[('juíza', 0.8210258483886719),
 ('juiza', 0.7306275367736816),
 ('juíz', 0.691645085811615),
 ('juízo', 0.6605231165885925),
 ('magistrado', 0.6213295459747314),
 ('mmª_juíza', 0.5510469675064087),
 ('juizo', 0.5494943261146545),
 ('desembargador', 0.5313084721565247),
 ('mmjuiz', 0.5277603268623352),
 ('fabíola_melo_feijão_juíza', 0.5043971538543701)]

Using Doc2Vec

Installing Gensim

!pip install gensim=='3.8.3'

Loading D2V

from gensim.models import Doc2Vec

#Loading a D2V model
d2v=Doc2Vec.load(w2v_cbow_d2v_dm)

Inferring vector for a text

txt='direito do consumidor origem : bangu regional xxix juizado especial civel ação : [processo] - - recte : fundo de investimento em direitos creditórios'
tokens=txt.split()

txt_vec=d2v.infer_vector(tokens, epochs=20)
txt_vec[:10]

array([ 0.02626514, -0.3876521 , -0.24873355, -0.0318402 ,  0.3343679 ,
       -0.21307918,  0.07193747,  0.02030687,  0.407305  ,  0.20065512],
      dtype=float32)

4. Demonstrations

For a better understanding of the application of these models, below are the links to notebooks where we apply them to a legal dataset using various classification models such as Logistic Regression and CatBoost:

BERT notebook :
Word2Vec notebook :
Doc2Vec notebook :

5. References

[1] Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

[2] Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

[3] Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.

[4] Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.

[5] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

[6] Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23