codebert commited on
Commit
5f0dc25
·
verified ·
1 Parent(s): 1e11483

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +219 -3
README.md CHANGED
@@ -1,3 +1,219 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ license: apache-2.0
5
+ ---
6
+
7
+ # Model Card for UniXcoder-base
8
+
9
+
10
+
11
+ # Model Details
12
+
13
+ ## Model Description
14
+ UniXcoder is a unified cross-modal pre-trained model that leverages multimodal data (i.e. code comment and AST) to pretrain code representation.
15
+
16
+ - **Developed by:** Microsoft Team
17
+ - **Shared by [Optional]:** Hugging Face
18
+ - **Model type:** Feature Engineering
19
+ - **Language(s) (NLP):** en
20
+ - **License:** Apache-2.0
21
+ - **Related Models:**
22
+ - **Parent Model:** RoBERTa
23
+ - **Resources for more information:**
24
+ - [Associated Paper](https://arxiv.org/abs/2203.03850)
25
+
26
+ # Uses
27
+
28
+ ## 1. Dependency
29
+
30
+ - pip install torch
31
+ - pip install transformers
32
+
33
+ ## 2. Quick Tour
34
+ We implement a class to use UniXcoder and you can follow the code to build UniXcoder.
35
+ You can download the class by
36
+ ```shell
37
+ wget https://raw.githubusercontent.com/microsoft/CodeBERT/master/UniXcoder/unixcoder.py
38
+ ```
39
+
40
+ ```python
41
+ import torch
42
+ from unixcoder import UniXcoder
43
+
44
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
45
+ model = UniXcoder("microsoft/unixcoder-base")
46
+ model.to(device)
47
+ ```
48
+
49
+ In the following, we will give zero-shot examples for several tasks under different mode, including **code search (encoder-only)**, **code completion (decoder-only)**, **function name prediction (encoder-decoder)** , **API recommendation (encoder-decoder)**, **code summarization (encoder-decoder)**.
50
+
51
+ ## 3. Encoder-only Mode
52
+
53
+ For encoder-only mode, we give an example of **code search**.
54
+
55
+ ### 1) Code and NL Embeddings
56
+
57
+ Here, we give an example to obtain code fragment embedding from CodeBERT.
58
+
59
+ ```python
60
+ # Encode maximum function
61
+ func = "def f(a,b): if a>b: return a else return b"
62
+ tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
63
+ source_ids = torch.tensor(tokens_ids).to(device)
64
+ tokens_embeddings,max_func_embedding = model(source_ids)
65
+
66
+ # Encode minimum function
67
+ func = "def f(a,b): if a<b: return a else return b"
68
+ tokens_ids = model.tokenize([func],max_length=512,mode="<encoder-only>")
69
+ source_ids = torch.tensor(tokens_ids).to(device)
70
+ tokens_embeddings,min_func_embedding = model(source_ids)
71
+
72
+ # Encode NL
73
+ nl = "return maximum value"
74
+ tokens_ids = model.tokenize([nl],max_length=512,mode="<encoder-only>")
75
+ source_ids = torch.tensor(tokens_ids).to(device)
76
+ tokens_embeddings,nl_embedding = model(source_ids)
77
+
78
+ print(max_func_embedding.shape)
79
+ print(max_func_embedding)
80
+ ```
81
+
82
+ ```python
83
+ torch.Size([1, 768])
84
+ tensor([[ 8.6533e-01, -1.9796e+00, -8.6849e-01, 4.2652e-01, -5.3696e-01,
85
+ -1.5521e-01, 5.3770e-01, 3.4199e-01, 3.6305e-01, -3.9391e-01,
86
+ -1.1816e+00, 2.6010e+00, -7.7133e-01, 1.8441e+00, 2.3645e+00,
87
+ ...,
88
+ -2.9188e+00, 1.2555e+00, -1.9953e+00, -1.9795e+00, 1.7279e+00,
89
+ 6.4590e-01, -5.2769e-02, 2.4965e-01, 2.3962e-02, 5.9996e-02,
90
+ 2.5659e+00, 3.6533e+00, 2.0301e+00]], device='cuda:0',
91
+ grad_fn=<DivBackward0>)
92
+ ```
93
+
94
+ ### 2) Similarity between code and NL
95
+
96
+ Now, we calculate cosine similarity between NL and two functions. Although the difference of two functions is only a operator (```<``` and ```>```), UniXcoder can distinguish them.
97
+
98
+ ```python
99
+ # Normalize embedding
100
+ norm_max_func_embedding = torch.nn.functional.normalize(max_func_embedding, p=2, dim=1)
101
+ norm_min_func_embedding = torch.nn.functional.normalize(min_func_embedding, p=2, dim=1)
102
+ norm_nl_embedding = torch.nn.functional.normalize(nl_embedding, p=2, dim=1)
103
+
104
+ max_func_nl_similarity = torch.einsum("ac,bc->ab",norm_max_func_embedding,norm_nl_embedding)
105
+ min_func_nl_similarity = torch.einsum("ac,bc->ab",norm_min_func_embedding,norm_nl_embedding)
106
+
107
+ print(max_func_nl_similarity)
108
+ print(min_func_nl_similarity)
109
+ ```
110
+
111
+ ```python
112
+ tensor([[0.3002]], device='cuda:0', grad_fn=<ViewBackward>)
113
+ tensor([[0.1881]], device='cuda:0', grad_fn=<ViewBackward>)
114
+ ```
115
+
116
+ ## 3. Decoder-only Mode
117
+
118
+ For decoder-only mode, we give an example of **code completion**.
119
+
120
+ ```python
121
+ context = """
122
+ def f(data,file_path):
123
+ # write json data into file_path in python language
124
+ """
125
+ tokens_ids = model.tokenize([context],max_length=512,mode="<decoder-only>")
126
+ source_ids = torch.tensor(tokens_ids).to(device)
127
+ prediction_ids = model.generate(source_ids, decoder_only=True, beam_size=3, max_length=128)
128
+ predictions = model.decode(prediction_ids)
129
+ print(context+predictions[0][0])
130
+ ```
131
+
132
+ ```python
133
+ def f(data,file_path):
134
+ # write json data into file_path in python language
135
+ data = json.dumps(data)
136
+ with open(file_path, 'w') as f:
137
+ f.write(data)
138
+ ```
139
+
140
+ ## 4. Encoder-Decoder Mode
141
+
142
+ For encoder-decoder mode, we give two examples including: **function name prediction**, **API recommendation**, **code summarization**.
143
+
144
+ ### 1) **Function Name Prediction**
145
+
146
+ ```python
147
+ context = """
148
+ def <mask0>(data,file_path):
149
+ data = json.dumps(data)
150
+ with open(file_path, 'w') as f:
151
+ f.write(data)
152
+ """
153
+ tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
154
+ source_ids = torch.tensor(tokens_ids).to(device)
155
+ prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
156
+ predictions = model.decode(prediction_ids)
157
+ print([x.replace("<mask0>","").strip() for x in predictions[0]])
158
+ ```
159
+
160
+ ```python
161
+ ['write_json', 'write_file', 'to_json']
162
+ ```
163
+
164
+ ### 2) API Recommendation
165
+
166
+ ```python
167
+ context = """
168
+ def write_json(data,file_path):
169
+ data = <mask0>(data)
170
+ with open(file_path, 'w') as f:
171
+ f.write(data)
172
+ """
173
+ tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
174
+ source_ids = torch.tensor(tokens_ids).to(device)
175
+ prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
176
+ predictions = model.decode(prediction_ids)
177
+ print([x.replace("<mask0>","").strip() for x in predictions[0]])
178
+ ```
179
+
180
+ ```python
181
+ ['json.dumps', 'json.loads', 'str']
182
+ ```
183
+
184
+ ### 3) Code Summarization
185
+
186
+ ```python
187
+ context = """
188
+ # <mask0>
189
+ def write_json(data,file_path):
190
+ data = json.dumps(data)
191
+ with open(file_path, 'w') as f:
192
+ f.write(data)
193
+ """
194
+ tokens_ids = model.tokenize([context],max_length=512,mode="<encoder-decoder>")
195
+ source_ids = torch.tensor(tokens_ids).to(device)
196
+ prediction_ids = model.generate(source_ids, decoder_only=False, beam_size=3, max_length=128)
197
+ predictions = model.decode(prediction_ids)
198
+ print([x.replace("<mask0>","").strip() for x in predictions[0]])
199
+ ```
200
+
201
+ ```python
202
+ ['Write JSON to file', 'Write json to file', 'Write a json file']
203
+ ```
204
+
205
+
206
+
207
+
208
+ # Reference
209
+ If you use this code or UniXcoder, please consider citing us.
210
+
211
+ <pre><code>@article{guo2022unixcoder,
212
+ title={UniXcoder: Unified Cross-Modal Pre-training for Code Representation},
213
+ author={Guo, Daya and Lu, Shuai and Duan, Nan and Wang, Yanlin and Zhou, Ming and Yin, Jian},
214
+ journal={arXiv preprint arXiv:2203.03850},
215
+ year={2022}
216
+ }</code></pre>
217
+
218
+
219
+