日韩性视频-久久久蜜桃-www中文字幕-在线中文字幕av-亚洲欧美一区二区三区四区-撸久久-香蕉视频一区-久久无码精品丰满人妻-国产高潮av-激情福利社-日韩av网址大全-国产精品久久999-日本五十路在线-性欧美在线-久久99精品波多结衣一区-男女午夜免费视频-黑人极品ⅴideos精品欧美棵-人人妻人人澡人人爽精品欧美一区-日韩一区在线看-欧美a级在线免费观看

歡迎訪問(wèn) 生活随笔!

生活随笔

當(dāng)前位置: 首頁(yè) > 编程资源 > 编程问答 >内容正文

编程问答

莫烦nlp-GPT 单向语言模型

發(fā)布時(shí)間:2023/12/20 编程问答 37 豆豆
生活随笔 收集整理的這篇文章主要介紹了 莫烦nlp-GPT 单向语言模型 小編覺(jué)得挺不錯(cuò)的,現(xiàn)在分享給大家,幫大家做個(gè)參考.

視頻鏈接:https://mofanpy.com/tutorials/machine-learning/nlp/gpt/

學(xué)習(xí)原因:

  • GPT比bert的學(xué)習(xí)效率高
  • 在莫煩代碼中,bert是繼承GPT的,學(xué)習(xí)GPT較快
  • 知識(shí)追蹤領(lǐng)域中,使用前一題預(yù)測(cè)后一題,不能對(duì)后面的預(yù)測(cè)泄露信息,屬于單向模型。
  • 那就開始我們的學(xué)習(xí)吧。

    模型Generative Pre-Training (GPT)

    ??模型越來(lái)越大的好處很顯而易見(jiàn),模型能用更多非線性能力處理更復(fù)雜的問(wèn)題。但是由此帶來(lái)另一個(gè)難題,就是難以訓(xùn)練。每訓(xùn)練一個(gè)超大模型, 我們都要消耗更多計(jì)算資源和更多時(shí)間。

    ??GPT主要的目標(biāo)還是當(dāng)好一個(gè)預(yù)訓(xùn)練模型該有的樣子。用非監(jiān)督的人類語(yǔ)言數(shù)據(jù),訓(xùn)練一個(gè)預(yù)訓(xùn)練模型,然后拿著這個(gè)模型進(jìn)行finetune, 基本上就可以讓你在其他任務(wù)上也表現(xiàn)出色。因?yàn)橄掠我猣inetune的任務(wù)千奇百怪,在這個(gè)教學(xué)中,我會(huì)更專注GPT模型本身。 告訴你GPT模型到底長(zhǎng)什么樣,又會(huì)有什么樣的特性。至于后續(xù)的finetune部分,其實(shí)比起模型本身,要容易不少。

    ??有人說(shuō)它是Transformer的Decoder,但是我覺(jué)得這可能并不準(zhǔn)確。 它更像是一種Transformer的Decoder與Encoder的結(jié)合。用著Decoder的 Future Mask (Look Ahead Mask),但結(jié)構(gòu)上又更像Encoder。

    這么設(shè)計(jì)就是為了讓GPT方便訓(xùn)練。用前文的信息預(yù)測(cè)后文的信息,所以用上了Future Mask。

    如果不用Future Mask, 又要做大量語(yǔ)料的非監(jiān)督學(xué)習(xí),很可能會(huì)讓模型在預(yù)測(cè)A時(shí)看到A的信息,從而有一種信息穿越的問(wèn)題。 具體解釋一下,因?yàn)門ransformer這種MultiHead Attention模式,每一個(gè)Head都會(huì)看到所有的文字內(nèi)容,如果用前文的信息預(yù)測(cè)后文內(nèi)容,又不用Future Mask時(shí), 模型是可以看到要預(yù)測(cè)的信息的,這種訓(xùn)練是無(wú)效的。 Future Mask的應(yīng)用,就是不讓模型看到被穿越的信息,用一雙無(wú)形的手,蒙蔽了它的透視眼。

    另外一個(gè)與Transformer Decoder的不同之處是,它沒(méi)有借用到Encoder提供的 self-attention 信息。所以GPT的Decoder要比Transformer少一些層,都是self-attention,沒(méi)有vanilla attention。 那么最終的模型乍一看的確和Transformer的某一部分很像,不過(guò)就是有兩點(diǎn)不同。

  • Decoder 少了一些連接 Encoder 的層;
  • 只使用Future Mask (Look ahead mask)做注意力。
  • 論文解讀Attention is all you need,這篇簡(jiǎn)潔、內(nèi)容較準(zhǔn)確。

    任務(wù),如何訓(xùn)練模型


    當(dāng)然task還能有很多。就看你的數(shù)據(jù)支持的是什么樣的task了。 多種task一起來(lái)訓(xùn)練一個(gè)模型,能讓這個(gè)模型在更多task上的泛化能力更強(qiáng)。

    讓模型訓(xùn)練兩個(gè)任務(wù),1. 非監(jiān)督的后文預(yù)測(cè),2. 是否是下一句。

    糾正:其實(shí)該數(shù)據(jù)集中string1和string2并不是上下文關(guān)系,

    along with human annotations indicating whether each pair captures a paraphrase/semantic equivalence relationship. Last published: March 3, 2005.

    結(jié)果分析

  • 因?yàn)閒uture mask的原因,GPT是沒(méi)辦法很好的預(yù)測(cè)句子的前半段的, 因?yàn)榍鞍攵蔚男畔⑻倭恕K晕覀儾耪f(shuō)GPT是單向語(yǔ)言模型。
  • ELMo的前向lstm也是這個(gè)問(wèn)題,推薦系統(tǒng)、知識(shí)追蹤也是。是一種冷啟動(dòng)的問(wèn)題。需要想想怎么解決

  • 莫煩見(jiàn)解:很多頭都會(huì)用最開始的。 很有可能是這時(shí)候模型并不需要注意什么,為了不注意,他們就將注意力分配到不重要的信息上,也就是這里的。

  • 普通NLP玩家充當(dāng)一下吃瓜群眾就好了

    數(shù)據(jù)處理

    utils.MRPCData():

    • seqs[:, :-1]是X input中的句子信息,[ [ string1,string2] ]
    • segs[:, :-1]是X input的前后句信息,判斷是否是前句還是后句。因?yàn)槲覀儠?huì)同時(shí)將前句和后句放在seqs中一起給模型,所以模型需要搞清楚他到底是前句還是后句。
    • seqs[:, 1:]是非監(jiān)督學(xué)習(xí)的Y信息,即標(biāo)簽,用前句預(yù)測(cè)后句。
    • nsp_labels是判斷輸入的兩句話是否是前后文關(guān)系。
      與bert相同

    GPT框架

    模型的架構(gòu)我們會(huì)使用到在Transformer中的Encoder代碼,因?yàn)樗麄兪峭ㄓ玫摹?/p>

  • 只是我們需要將Encoder中的Mask規(guī)則給替換掉。在bert中已經(jīng)為class GPT作了注解。
  • 定義好詞向量word_emb,片段向量segment_emb,位置向量position_emb, 這三個(gè)向量表達(dá),我們的輸入端就完成了, 接著就是直接套用Transformer的encoder。
  • 用call()做前向預(yù)測(cè)的時(shí)候,X數(shù)據(jù)過(guò)一遍所有的embedding,然后直接進(jìn)入Transformer的Encoder,拿到最后的注意后的結(jié)果。 最后經(jīng)過(guò)兩個(gè)輸出端 mlm (非監(jiān)督語(yǔ)言模型) 和 nsp (是否是前后句),完成兩個(gè)任務(wù)的預(yù)測(cè)。
  • future mask的效果如下圖所示
  • 代碼

    class GPT(keras.Model):def __init__(self, model_dim, max_len, n_layer, n_head, n_vocab, lr, max_seg=3, drop_rate=0.1, padding_idx=0):super().__init__()self.padding_idx = padding_idx #pad_id = 0self.n_vocab = n_vocab # len(self.v2i)self.max_len = max_len #72-1# I think task emb is not necessary for pretraining,# because the aim of all tasks is to train a universal sentence embedding# the body encoder is the same across all tasks,# and different output layer defines different task just like transfer learning.# finetuning replaces output layer and leaves the body encoder unchanged.# self.task_emb = keras.layers.Embedding(# input_dim=n_task, output_dim=model_dim, # [n_task, dim]# embeddings_initializer=tf.initializers.RandomNormal(0., 0.01),# )self.word_emb = keras.layers.Embedding(input_dim=n_vocab, output_dim=model_dim, # [n_vocab, dim]embeddings_initializer=tf.initializers.RandomNormal(0., 0.01),) #詞向量self.segment_emb = keras.layers.Embedding(input_dim=max_seg, output_dim=model_dim, # [max_seg, dim]embeddings_initializer=tf.initializers.RandomNormal(0., 0.01),) #片段向量,seg的值0:句子1 |1:句子2 |2:padding self.position_emb = self.add_weight(name="pos", shape=[1, max_len, model_dim], dtype=tf.float32, # [1, step, dim] 相加時(shí)broadcast第一維initializer=keras.initializers.RandomNormal(0., 0.01)) #位置向量,這里是自己學(xué)習(xí)參數(shù)。論文為固定的數(shù)學(xué)公式self.encoder = Encoder(n_head, model_dim, drop_rate, n_layer) # Transformer的內(nèi)容,可直接使用self.task_mlm = keras.layers.Dense(n_vocab) #task1 預(yù)測(cè)下一個(gè)詞self.task_nsp = keras.layers.Dense(2) #task2 是否上下句關(guān)系self.cross_entropy = keras.losses.SparseCategoricalCrossentropy(from_logits=True, reduction="none")# reduction=‘a(chǎn)uto’,這個(gè)參數(shù)是進(jìn)行最后的求平均,如果是設(shè)置為losses_utils.ReductionV2.None,就不會(huì)求平均了self.opt = keras.optimizers.Adam(lr)def call(self, seqs, segs, training=False): #traning參數(shù)控制dropout mask矩陣控制attentionembed = self.input_emb(seqs, segs) # [n, step, dim]z = self.encoder(embed, training=training, mask=self.mask(seqs)) # [n, step, dim]mlm_logits = self.task_mlm(z) # [n, step, n_vocab]nsp_logits = self.task_nsp(tf.reshape(z, [z.shape[0], -1])) # [n, n_cls]return mlm_logits, nsp_logitsdef step(self, seqs, segs, seqs_, nsp_labels):...def input_emb(self, seqs, segs):return self.word_emb(seqs) + self.segment_emb(segs) + self.position_emb # [n, step, dim]def mask(self, seqs):...@propertydef attentions(self):attentions = {"encoder": [l.mh.attention.numpy() for l in self.encoder.ls],}return attentionsm = GPT(model_dim=MODEL_DIM, max_len=d.max_len - 1, n_layer=N_LAYER, n_head=4, n_vocab=d.num_word,lr=LEARNING_RATE, max_seg=d.num_seg, drop_rate=0.2, padding_idx=d.pad_id)

    看注釋,bert已經(jīng)解釋過(guò)。bert重寫了gpt的step和mask函數(shù),下面看看有何不同:

    - step函數(shù)

    tf.math.not_equal Performs a broadcast with the arguments and then an element-wise inequality comparison, returning a Tensor of boolean values.

    def step(self, seqs, segs, seqs_, nsp_labels):with tf.GradientTape() as tape:mlm_logits, nsp_logits = self.call(seqs, segs, training=True)pad_mask = tf.math.not_equal(seqs_, self.padding_idx)# 非padding位置為True# mlm_logits [n, step, n_vocab]pred_loss = tf.reduce_mean(tf.boolean_mask(self.cross_entropy(seqs_, mlm_logits), pad_mask)) # 非padding位置都計(jì)算交叉熵# nsp_logits [n, n_cls]nsp_loss = tf.reduce_mean(self.cross_entropy(nsp_labels, nsp_logits))loss = pred_loss + 0.2 * nsp_lossgrads = tape.gradient(loss, self.trainable_variables)self.opt.apply_gradients(zip(grads, self.trainable_variables))return loss, mlm_logits

    - mask函數(shù)

    tf.linalg.band_part

    tf.linalg.band_part(tf.ones((5, 5)), -1, 0) Out[14]: <tf.Tensor: shape=(5, 5), dtype=float32, numpy= array([[1., 0., 0., 0., 0.],[1., 1., 0., 0., 0.],[1., 1., 1., 0., 0.],[1., 1., 1., 1., 0.],[1., 1., 1., 1., 1.]], dtype=float32)>

    transformer(一)有舉例子如何mask的

    def mask(self, seqs):"""abcd--a011111b001111c000111d000011-000011-000011force head not to see afterward. eg. 后面乘以負(fù)無(wú)窮a is a embedding for a---b is a embedding for ab--c is a embedding for abc-later, b embedding will + b another embedding from previous residual input to predict c"""mask = 1 - tf.linalg.band_part(tf.ones((self.max_len, self.max_len)), -1, 0)pad = tf.math.equal(seqs, self.padding_idx)# 3個(gè)句子,step為5#[3,5]->[3,1,1,5] |x:1(boardcast)| y:[1,1,5,5] |---> [3,1,5,5]mask = tf.where(pad[:, tf.newaxis, tf.newaxis, :], 1, mask[tf.newaxis, tf.newaxis, :, :])return mask # (step, step)

    mask的形狀應(yīng)該是(batch,1,step,step),而attention的形狀為# [batch, num_heads, q_step, step]
    這里q_step=step,因?yàn)閟elf-attention的矩陣是長(zhǎng)寬一樣

    - train函數(shù)

    GPT的標(biāo)簽容易,跟知識(shí)追蹤一樣(上一題預(yù)測(cè)下一題)

    def train(model, data, step=10000, name="gpt"):t0 = time.time()for t in range(step):seqs, segs, xlen, nsp_labels = data.sample(16)loss, pred = model.step(seqs[:, :-1], segs[:, :-1], seqs[:, 1:], nsp_labels)if t % 100 == 0:pred = pred[0].numpy().argmax(axis=1)t1 = time.time()print("\n\nstep: ", t,"| time: %.2f" % (t1 - t0),"| loss: %.3f" % loss.numpy(),"\n| tgt: ", " ".join([data.i2v[i] for i in seqs[0, 1:][:xlen[0].sum()+2]]),#二次篩選長(zhǎng)度 應(yīng)該+2:到<sep>結(jié)束符"\n| prd: ", " ".join([data.i2v[i] for i in pred[:xlen[0].sum()+2]]),)t0 = t1os.makedirs("./visual/models/%s" % name, exist_ok=True)model.save_weights("./visual/models/%s/model.ckpt" % name)

    運(yùn)行結(jié)果

    num word: 12880 max_len: 72step: 100 | time: 13.26 | loss: 7.495 | tgt: the unions also staged a five-day strike in march that forced all but one of yale 's dining halls to close . <SEP> the unions also staged a five-day strike in march ; strikes have preceded eight of the last <NUM> contracts . <SEP> | prd: the . the the the the the . . the . . the the the the . . the the the . . the the the . . . the . the the . . . . . the . the . . thestep: 4900 | time: 13.55 | loss: 1.047 | tgt: they were held under section <NUM> of the terrorism act <NUM> on suspicion of involvement in the commission , preparation or instigation of acts of terrorism . <SEP> badat was arrested under section <NUM> of the terrorism act a€? on suspicion of involvement in the commission , preparation or instigation of acts of terrorism , a€? scotland yard confirmed . <SEP> | prd: the were not today section <NUM> of the terrorism act <NUM> for suspicion of terrorism in the commission , preparation or instigation of terrorism of terrorism 's <SEP> badat was arrested under section <NUM> of the terrorism act a€? on suspicion of acts in the commission , preparation or instigation of acts of terrorism , a€? scotland yard confirmed . <SEP>step: 5000 | time: 13.63 | loss: 0.937 | tgt: michael mitchell , the chief public defender in baton rouge who is representing lee , did not answer his phone wednesday afternoon . <SEP> michael mitchell , the chief public defender in baton rouge who is representing lee , was not available for comment . <SEP> | prd: the mitchell , the chief justice defender in baton rouge who is representing lee , did not attempt his lawyer the afternoon . <SEP> michael mitchell , the chief justice defender in baton rouge who is representing lee , was not available for comment . <SEP>step: 9800 | time: 13.45 | loss: 0.211 | tgt: in <NUM> , president bush named kathy gregg to the student loan marketing association board of directors . <SEP> in <NUM> , president bush named her to the student loan marketing association , the largest u.s. lender for students . <SEP> | prd: the the , for bush named kathy gregg to the student loan marketing association board of directors . <SEP> in <NUM> , president bush named her to the student loan marketing association , the largest president lender for students . <SEP>step: 9900 | time: 13.28 | loss: 0.210 | tgt: the product also features an updated release of the apache web server , as well as apache tomcat and apache axis . <SEP> panther server also includes an updated release of apache , along with apache tomcat and apache axis for creating powerful web services . <SEP> | prd: <quote> first also features an updated release of the apache web server , as well as apache tomcat and apache axis . <SEP> panther server also includes an updated release of apache , along with apache tomcat and apache axis for creating powerful web services . <SEP>total time: 22 min 28 second

    總結(jié)

    以上是生活随笔為你收集整理的莫烦nlp-GPT 单向语言模型的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。

    如果覺(jué)得生活随笔網(wǎng)站內(nèi)容還不錯(cuò),歡迎將生活随笔推薦給好友。