从预训练语言模型看MLM预测任务

prompt learning是当前nlp的一个重要话题，已经有许多文章进行论述。
从本质上来说，prompt learning 可以理解为一种下游任务的重定义方法，将几乎所有的下游任务均统一为预训练语言模型任务，从而避免了预训练模型和下游任务之间存在的 gap。
如此一来，几乎所有的下游 nlp 任务均可以使用，不需要训练数据，在小样本数据集的基础上也可以取得超越 fine-tuning 的效果，使得所有任务在使用方法上变得更加一致，而局限于字面意义上的理解还远远不够，我们可以通过一种简单、明了的方式进行讲述。
为了解决这一问题，本文主要从预训练语言模型看mlm预测任务、引入prompt_template的mlm预测任务、引入verblize类别映射的prompt-mlm预测、基于zero-shot的prompt情感分类实践以及基于zero-shot的promptner实体识别实践五个方面，进行代码介绍，供大家一起思考。
一、从预训练语言模型看mlm预测任务 mlm和nsp两个任务是目前bert等预训练语言模型预训任务，其中mlm要求指定周围词来预测中心词，其模型机构十分简单，如下所示：
import torch.nn as nn from transformers import bertmodel,bertformaskedlmclass bert_model(nn.module):    def __init__(self,  bert_path ,config_file ):        super(bert_model, self).__init__()        self.bert = bertformaskedlm.from_pretrained(bert_path,config=config_file)  # 加载预训练模型权重    def forward(self, input_ids, attention_mask, token_type_ids):        outputs = self.bert(input_ids, attention_mask, token_type_ids) #masked lm 输出的是 mask的值对应的ids的概率，输出会是词表大小，里面是概率         logit = outputs[0]  # 池化后的输出 [bs, config.hidden_size]        return logit 下面一段代码，简单的使用了hugging face中的bert-base-uncased进行空缺词预测，先可以得到预训练模型对指定[mask]位置上概率最大的词语【词语来自于预训练语言模型的词表】。
例如给定句子natural language processing is a [mask] technology.，要求预测出其中的[mask]的词：
>>> from transformers import pipeline>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')>>> unmasker(natural language processing is a [mask] technology.)[{'score': 0.18927036225795746, 'token': 3274, 'token_str': 'computer', 'sequence': 'natural language processing is a computer technology.'}, {'score': 0.14354903995990753, 'token': 4807, 'token_str': 'communication', 'sequence': 'natural language processing is a communication technology.'},{'score': 0.09429361671209335, 'token': 2047, 'token_str': 'new', 'sequence': 'natural language processing is a new technology.'}, {'score': 0.05184786394238472, 'token': 2653, 'token_str': 'language', 'sequence': 'natural language processing is a language technology.'}, {'score': 0.04084266722202301, 'token': 15078, 'token_str': 'computational', 'sequence': 'natural language processing is a computational technology.'}] 从结果中，可以显然的看到，[mask]按照概率从大到小排序后得到的结果是，computer、communication、new、language以及computational，这直接反馈出了预训练语言模型能够有效刻画出nlp是一种计算机、交流以及语言技术。
二、引入prompt_template的mlm预测任务因此，既然语言模型中的mlm预测结果能够较好地预测出指定的结果，那么其就必定包含了很重要的上下文知识，即上下文特征，那么，我们是否可以进一步地让它来执行文本分类任务？即使用[mask]的预测方式来预测相应分类类别的词，然后再将词做下一步与具体类别的预测？
实际上，这种思想就是prompt的思想，将下游任务对齐为预训练语言模型的预训练任务，如nps和mlm，至于怎么对齐，其中引入两个概念，一个是prompt_template，即提示模版，以告诉模型要生成与任务相关的词语。因此，将任务原文text和prompt_template进行拼接，就可以构造与预训练语言模型相同的预训练任务。
例如，
>>> from transformers import pipeline>>> unmasker = pipeline('fill-mask', model='bert-base-uncased')>>> text = i really like the film a lot.>>> prompt_template = because it was [mask].  >>> pred1 = unmasker(text + prompt_template)>>> pred1[{'score': 0.14730973541736603, 'token': 2307, 'token_str': 'great', 'sequence': 'i really like the film a lot. because it was great.'}, {'score': 0.10884211212396622, 'token': 6429, 'token_str': 'amazing', 'sequence': 'i really like the film a lot. because it was amazing.'}, {'score': 0.09781625121831894, 'token': 2204, 'token_str': 'good', 'sequence': 'i really like the film a lot. because it was good.'}, {'score': 0.04627735912799835, 'token': 4569, 'token_str': 'fun', 'sequence': 'i really like the film a lot. because it was fun.'}, {'score': 0.043138038367033005, 'token': 10392, 'token_str': 'fantastic', 'sequence': 'i really like the film a lot. because it was fantastic.'}]>>> text = this movie makes me very disgusting. >>> prompt_template = because it was [mask].  >>> pred2 = unmasker(text + prompt_template)>>> pred2[{'score': 0.05464331805706024, 'token': 9643, 'token_str': 'awful', 'sequence': 'this movie makes me very disgusting. because it was awful.'}, {'score': 0.050322480499744415, 'token': 2204, 'token_str': 'good', 'sequence': 'this movie makes me very disgusting. because it was good.'}, {'score': 0.04008950665593147, 'token': 9202, 'token_str': 'horrible', 'sequence': 'this movie makes me very disgusting. because it was horrible.'}, {'score': 0.03569378703832626, 'token': 3308, 'token_str': 'wrong', 'sequence': 'this movie makes me very disgusting. because it was wrong.'},{'score': 0.033358603715896606, 'token': 2613, 'token_str': 'real', 'sequence': 'this movie makes me very disgusting. because it was real.'}] 上面，我们使用了表达正面和负面的两个句子，模型得到最高的均是与类型相关的词语，这也验证了这种方法的可行性。
三、引入verblize类别映射的prompt-mlm预测与构造prompt-template之外，另一个重要的点是verblize，做词语到类型的映射，因为mlm模型预测的词语很不确定，需要将词语与具体的类别进行对齐，比如将great, amazing, good, fun, fantastic, better等词对齐到positive上，当模型预测结果出现这些词时，就可以将整个预测的类别设定为positive；
同理，将awful, horrible, bad, wrong, ugly等词映射为“negative”时，即可以将整个预测的类别设定为negative；
>>> verblize_dict = {pos: [great, amazing, good, fun, fantastic, better], neg: [awful, horrible, bad, wrong, ugly]... }>>> hash_dict = dict()>>> for k, v in verblize_dict.items():...     for v_ in v:...         hash_dict[v_] = k>>> hash_dict{'great': 'pos', 'amazing': 'pos', 'good': 'pos', 'fun': 'pos', 'fantastic': 'pos', 'better': 'pos', 'awful': 'neg', 'horrible': 'neg', 'bad': 'neg', 'wrong': 'neg', 'ugly': 'neg'} 因此，我们可以将这类方法直接加入到上面的预测结果当中进行修正，得到以下结果，
>>> [{label:hash_dict[i[token_str]], score:i[score]} for i in pred1][{'label': 'pos', 'score': 0.14730973541736603}, {'label': 'pos', 'score': 0.10884211212396622}, {'label': 'pos', 'score': 0.09781625121831894}, {'label': 'pos', 'score': 0.04627735912799835}, {'label': 'pos', 'score': 0.043138038367033005}]>>> [{label:hash_dict.get(i[token_str], i[token_str]), score:i[score]} for i in pred2][{'label': 'neg', 'score': 0.05464331805706024}, {'label': 'pos', 'score': 0.050322480499744415}, {'label': 'neg', 'score': 0.04008950665593147}, {'label': 'neg', 'score': 0.03569378703832626}, {'label': 'real', 'score': 0.033358603715896606}] 通过取top1，可直接得到类别分类结果，当然也可以综合多个预测结果，可以获top10中各个类别的比重，以得到最终结果：
{  text:i really like the film a lot., label: pos    text:this movie makes me very disgusting. , label:neg} 至此，我们可以大致就可以大致了解在zero-shot场景下，prompt的核心所在。而我们可以进一步的想到，如果我们有标注数据，又如何进行继续训练，如何更好的设计prompt-template以及做好这个词语映射词表，这也是prompt-learning的后续研究问题。
因此，我们可以进一步地形成一个完整的基于训练数据的prompt分类模型，其代码实现样例具体如下，从中我们可以大致在看出具体的算法思想，我们命名为prompt.py
from transformers import automodelformaskedlm , autotokenizerimport torchclass prompting(object):  def __init__(self, **kwargs):    model_path=kwargs['model']    tokenizer_path= kwargs['model']    if tokenizer in kwargs.keys():      tokenizer_path= kwargs['tokenizer']    self.model = automodelformaskedlm.from_pretrained(model_path)    self.tokenizer = autotokenizer.from_pretrained(model_path)  def prompt_pred(self,text):        输入带有[mask]的序列，输出lm模型vocab中的词语列表及其概率        indexed_tokens=self.tokenizer(text, return_tensors=pt).input_ids    tokenized_text= self.tokenizer.convert_ids_to_tokens (indexed_tokens[0])    mask_pos=tokenized_text.index(self.tokenizer.mask_token)    self.model.eval()    with torch.no_grad():      outputs = self.model(indexed_tokens)      predictions = outputs[0]    values, indices=torch.sort(predictions[0, mask_pos],  descending=true)    result=list(zip(self.tokenizer.convert_ids_to_tokens(indices), values))    self.scores_dict={a:b for a,b in result}    return result  def compute_tokens_prob(self, text, token_list1, token_list2):        给定两个词表，token_list1表示表示正面情感positive的词，如good, great，token_list2表示表示负面情感positive的词，如good, great，bad, terrible.       在计算概率时候，统计每个类别词所占的比例，score1/(score1+score2)并归一化，作为最终类别概率。        _=self.prompt_pred(text)    score1=[self.scores_dict[token1] if token1 in self.scores_dict.keys() else 0            for token1 in token_list1]    score1= sum(score1)    score2=[self.scores_dict[token2] if token2 in self.scores_dict.keys() else 0            for token2 in token_list2]    score2= sum(score2)    softmax_rt=torch.nn.functional.softmax(torch.tensor([score1,score2]), dim=0)    return softmax_rt  def fine_tune(self, sentences, labels, prompt= since it was [mask].,goodtoken=good,badtoken=bad):          对已有标注数据进行fine tune训练。        good=tokenizer.convert_tokens_to_ids(goodtoken)    bad=tokenizer.convert_tokens_to_ids(badtoken)    from transformers import adamw    optimizer = adamw(self.model.parameters(),lr=1e-3)    for sen, label in zip(sentences, labels):      tokenized_text = self.tokenizer.tokenize(sen+prompt)      indexed_tokens = self.tokenizer.convert_tokens_to_ids(tokenized_text)      tokens_tensor = torch.tensor([indexed_tokens])      mask_pos=tokenized_text.index(self.tokenizer.mask_token)      outputs = self.model(tokens_tensor)      predictions = outputs[0]      pred=predictions[0, mask_pos][[good,bad]]      prob=torch.nn.functional.softmax(pred, dim=0)      lossfunc = torch.nn.crossentropyloss()      loss=lossfunc(prob.unsqueeze(0), torch.tensor([label]))      loss.backward()      optimizer.step() 四、基于zero-shot的prompt情感分类实践下面我们直接以imdb中的例子进行zero-shot的prompt分类实践，大家可以看看其中的大致逻辑：
1、加入
>>from transformers import automodelformaskedlm , autotokenizer>>import torch>>model_path=bert-base-uncased>>tokenizer = autotokenizer.from_pretrained(model_path)>>from prompt import prompting>>prompting= prompting(model=model_path) 2、使用prompt_pred直接进行情感预测
>>prompt=because it was [mask].>>text=i really like the film a lot.>>prompting.prompt_pred(text+prompt)[:10][('great', tensor(9.5558)), ('amazing', tensor(9.2532)), ('good', tensor(9.1464)), ('fun', tensor(8.3979)), ('fantastic', tensor(8.3277)), ('wonderful', tensor(8.2719)), ('beautiful', tensor(8.1584)), ('awesome', tensor(8.1071)), ('incredible', tensor(8.0140)), ('funny', tensor(7.8785))]>>text=i did not like the film.>>prompting.prompt_pred(text+prompt)[:10][('bad', tensor(8.6784)), ('funny', tensor(8.1660)), ('good', tensor(7.9858)), ('awful', tensor(7.7454)), ('scary', tensor(7.3526)), ('boring', tensor(7.1553)), ('wrong', tensor(7.1402)), ('terrible', tensor(7.1296)), ('horrible', tensor(6.9923)), ('ridiculous', tensor(6.7731))] 2、加入neg/pos词语vervlize进行情感预测
>>text=not worth watching>>prompting.compute_tokens_prob(text+prompt, token_list1=[great,amazin,good], token_list2= [bad,awfull,terrible])tensor([0.1496, 0.8504])>>text=i strongly recommend that moview>>prompting.compute_tokens_prob(text+prompt, token_list1=[great,amazin,good], token_list2= [bad,awfull,terrible])tensor([0.9321, 0.0679])>>text=i strongly recommend that moview>>prompting.compute_tokens_prob(text+prompt, token_list1=[good], token_list2= [bad])tensor([0.9223, 0.0777]) 五、基于zero-shot的promptner实体识别实践进一步的，我们可以想到，既然分类任务可以进行分类任务，那么是否可以进一步用这种方法来做实体识别任务呢？
实际上是可行的，暴力的方式，通过获取候选span，然后询问其中实体所属的类型集合。
1、设定prompt-template
同样的，我们可以设定template，以一个人物为例，john是一个非常常见的名字，模型可以直接知道它是一个人，而不需要上下文
sentence. john is a type of [mask] 2、使用prompt_pred直接进行预测我们直接进行处理，可以看看效果：
>>prompting.prompt_pred(john went to paris to visit the university. john is a type of [mask].)[:5][('man', tensor(8.1382)), ('john', tensor(7.1325)), ('guy', tensor(6.9672)), ('writer', tensor(6.4336)), ('philosopher', tensor(6.3823))]>>prompting.prompt_pred(savaş went to paris to visit the university. savaş is a type of [mask].)[:5][('philosopher', tensor(7.6558)), ('poet', tensor(7.5621)), ('saint', tensor(7.0104)), ('man', tensor(6.8890)), ('pigeon', tensor(6.6780))] 2、加入类别词语vervlize进行情感预测
进一步的，我们加入类别词，进行预测，因为我们需要做的识别是人物person识别，因此我们可以将person类别相关的词作为token_list1，如[person,man]，其他类型的，作为其他词语，如token_list2为[location,city,place])，而在其他类别时，也可以通过构造wordlist字典完成预测。
>>> prompting.compute_tokens_prob(it is a type of [mask].,                              token_list1=[person,man], token_list2=[location,city,place])tensor([0.7603, 0.2397])>>> prompting.compute_tokens_prob(savaş went to paris to visit the parliament. savaş is a type of [mask].,                              token_list1=[person,man], token_list2=[location,city,place])//确定概率为0.76，将大于0.76的作为判定为person的概率tensor([9.9987e-01, 1.2744e-04]) 从上面的结果中，我们可以看到，利用分类方式来实现zero shot实体识别，是直接有效的，“savaş”判定为person的概率为0.99，
prompting.compute_tokens_prob(savaş went to laris to visit the parliament. laris is a type of [mask].,                              token_list1=[person,man], token_list2=[location,city,place])tensor([0.3263, 0.6737]) 而在这个例子中，将“laris”这一地点判定为person的概率仅仅为0.3263，也证明其有效性。
总结本文主要从预训练语言模型看mlm预测任务、引入prompt_template的mlm预测任务、引入verblize类别映射的prompt-mlm预测、基于zero-shot的prompt情感分类实践以及基于zero-shot的promptner实体识别实践五个方面，进行了代码介绍。
关于prompt-learning，我们可以看到，其核心就在于将下游任务统一建模为了预训练语言模型的训练任务，从而能够最大地挖掘出预训模型的潜力，而其中的prompt-template以及对应词的构造，这个十分有趣，大家可以多关注。

Google对其数字支付平台Google Pay进行了重大更改
欧拉将面向全场景融合共同构筑中国数字基础设施的软件基座
HDMI是什么 HDMI接口的作用
研究人员开发了光纤力传感器_应用于医疗系统和制造业
DTU连接不稳定是什么原因造成的
从预训练语言模型看MLM预测任务
改良迟滞控制算法，提供LED电流高精度
天合光能“天鳌双核”系列组件解读不同场景下的发电量增益情况到底如何
小米70英寸电视产品单一平台月销量可达10万
LED驱动电源电路
阵列接收器适合需要执行自动闭环或刺激响应类测试的测量系统？
努比亚Z18无边水滴屏有哪些亮点
片状电阻失效机理分析
魅族Pro7原型机渲染图曝光！全面屏时代来临！
智能锁价格为什么不一样？区别又在哪里？
54khz超声波电源发生器设计
新手入门应如何选择ARM开发板_选择开发板的注意事项
商汤科技SenseMARS AI数字人解决方案亮相
高通发布最新的金属外壳无线充电技术
人脸识别技术在校园管理中的应用