File size: 6,021 Bytes
e12fec4 e39d945 764c8e8 c13f24b 764c8e8 e39d945 ab62acf 5150818 ab62acf 207cb6d 4ad5711 5150818 207cb6d 7058879 757697d 9c9e770 757697d 9c9e770 75581a1 9c9e770 7f25f24 7058879 0e15756 e16bd36 ff8cd12 e16bd36 b3cc3d5 e16bd36 b3cc3d5 e16bd36 f8b0f89 ebbe4a8 dcb24e0 f8b0f89 0e15756 bbb8a26 ba6da3b 4c4ab7c 487a014 ba6da3b dcb24e0 f8b0f89 dcb24e0 f8b0f89 dcb24e0 f8b0f89 487a014 dcb24e0 bbb8a26 070a788 3de72c1 dd3c421 e49fc6e 0e15756 ff8cd12 f59e715 4c4ab7c 807160f dcb24e0 f59e715 dcb24e0 f59e715 dcb24e0 bbb8a26 d255354 0e15756 ab62acf dfba3f4 506e87a ab62acf e49fc6e 578fe9e 0e15756 e49fc6e 80c97ed e39d945 ba69171 e39d945 d6e8b20 e7f1142 d6e8b20 e39d945 ba69171 3ba3d91 314f4e9 3ba3d91 ba69171 d6e8b20 e39d945 68a79aa 76437d2 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
---
license: cc-by-nc-4.0
---
<!--- # ECAPA2 Speaker Embedding and Hierarchical Feature Extractor -->
# ECAPA2 Speaker Embedding Extractor
Link to paper: [ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings](https://arxiv.org/abs/2401.08342).
ECAPA2 is a hybrid neural network architecture and training strategy for generating robust speaker embeddings.
The provided pre-trained model has an easy-to-use API to extract speaker embeddings and other hierarchical features. More information can be found in our original ECAPA2 paper.
<!---
The speaker embeddings are recommended for tasks which rely directly on the identity of the speaker (e.g. speaker verification and speaker diarization).
The hierarchical features are most useful for tasks capturing intra-speaker variance (e.g. emotion recognition and speaker profiling) and prove complimentary with the speaker embedding in our experience. See our speaker profiling paper for an example usage of the hierarchical features.
-->
<p align="center">
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/cmd1Nvk6_WXInUKrjgXPj.png" width="300"/>
</p>
<!---
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/BORgtl2G6XUlWaZeMLGPc.png" width="300"/>
-->
<!---
<img src="https://cdn-uploads.huggingface.co/production/uploads/620f6a7d110b521c673c1914/ejHsEUnsWTehsIpOu7Rm_.png" width="700"/>
-->
## Usage Guide
### Download model
You need to install the `huggingface_hub` package to download the ECAPA2 model:
```bash
pip install --upgrade huggingface_hub
```
Or with Conda:
```bash
conda install -c conda-forge huggingface_hub
```
Download model:
```python
from huggingface_hub import hf_hub_download
# automatically checks for cached file, optionally set `cache_dir` location
model_file = hf_hub_download(repo_id='Jenthe/ECAPA2', filename='ecapa2.pt', cache_dir=None)
```
### Speaker Embedding Extraction
Extracting speaker embeddings is easy and only requires a few lines of code:
```python
import torch
import torchaudio
ecapa2 = torch.jit.load(model_file, map_location='cpu')
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected
embedding = ecapa2(audio)
```
For faster, 16-bit half-precision CUDA inference (recommended):
```python
import torch
import torchaudio
ecapa2 = torch.jit.load(model_file, map_location='cuda')
ecapa2.half() # optional, but results in faster inference
audio, sr = torchaudio.load('sample.wav') # sample rate of 16 kHz expected
embedding = ecapa2(audio)
```
The initial calls to the JIT-model can in some cases take a very long time because of optimization attempts of the compiler. If you have issues, the JIT-optimizer can be disabled as following:
```python
with torch.jit.optimized_execution(False):
embedding = ecapa2(audio)
```
There is no need for `ecapa2.eval()` or `torch.no_grad()`, this is done automatically.
<!--
### Hierarchical Feature Extraction
For the extraction of other hierachical features, the `label` argument can be used, which accepts a string containing the feature ids separated with '|':
```python
# default, only extract the embedding
feature = ecapa2(audio, label='embedding')
# concatenates the gfe_1, pool and embedding features
feature = ecapa2(audio, label='gfe_1|pool|embedding')
# returns the same output as previous example, concatenation always follows the order of the network
feature = ecapa2(audio, label='embedding|gfe_1|pool')
```
The following table describes the available features. All features consists of the mean and variance of the frame-level encodings at the indicated layer, except for the speaker embedding.
| Feature ID| Dimension | Description |
| ----------- | ----------- | ----------- |
| gfe_1 | 2048 | Mean and variance of frame-level features as indicated in the figure, extracted before ReLU and BatchNorm layer.
| gfe_2 | 2048 | Mean and variance of frame-level features as indicated in the figure, extracted before ReLU and BatchNorm layer.
| pool | 3072 | Pooled statistics before the bottleneck speaker embedding layer, extracted before ReLU layer.
| attention | 3072 | Same as the pooled statistics but with the attention weights applied.
| embedding | 192 | The standard ECAPA2 speaker embedding.
The following table describes the available features:
| Feature Type| Description | Usage | Labels |
| ----------- | ----------- | ----------- | ----------- |
| Local Feature | Non-uniform effective receptive field in the frequency dimension of each frame-level feature.| Abstract features, probably usefull in tasks less related to speaker characteristics. | lfe1, lfe2, lfe3, lfe4
| Global Feature | Uniform effective receptive field of each frame-level feature in the frequency dimension.| Generally capture intra-speaker variance better then speaker embeddings. E.g. speaker profiling, emotion recognition. | gfe1, gfe2, gfe3, pool
| Speaker Embedding | Uniform effective receptive field of each frame-level feature in the frequency dimension.| Best for tasks directly depending on the speaker identity (as opposed to speaker characteristics). E.g. speaker verification, speaker diarization. | embedding
-->
## Citation
<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
**BibTeX:**
```
@INPROCEEDINGS{ecapa2,
author={Jenthe Thienpondt and Kris Demuynck},
booktitle={2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
title={ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings},
year={2023},
volume={},
number={}
}
```
**APA:**
```
Jenthe Thienpondt, Kris Demuynck (2023). ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
```
## Contact
Name: Jenthe Thienpondt\
E-mail: [email protected] |