當前位置：首頁 > 前端技术 > javascript >内容正文

javascript

tensorflow 加载bert_用NodeJS/TensorFlowJS调用BERT实现文本分类

發布時間：2025/3/19 javascript 29 豆豆

生活随笔收集整理的這篇文章主要介紹了 tensorflow 加载bert_用NodeJS/TensorFlowJS调用BERT实现文本分类小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

題圖 "JavaScript Logo"byb0neskullis licensed underCC BY-NC-SA 2.0

幾個提前知識

TensorFlowJS可以簡單認為有Browser和NodeJS兩個版本，前者可以運行在瀏覽器，后者可以運行在NodeJS環境下

NodeJS版本的模型推理速度比Python快哦！參考官方這個博客 https://blog.tensorflow.org/2020/05/how-hugging-face-achieved-2x-performance-boost-question-answering.html

NodeJS版本理論上也是可以用GPU的

文本以NodeJS為基礎，給出一個文本分類例子?

按照當前的情況，NodeJS版本其實更適合直接調用Python訓練好的模型使用，因為加載速度和推理速度都比Python版本快的原因，如果不是必須要用GPU的話對于小模型更是可以適合配合FaaS等工具更好的實現云AI函數

更多內容和代碼可以參考這個REPO https://github.com/qhduan/bert-model/

TensorFlowJS/NodeJS的限制

一些算子不支持，例如python版本有的tf.strings.*下面的算子

雖然NodeJS版本可以加載TensorFlow 2.x saved model格式，但是不能繼續訓練（python是可以的）

訓練速度還是比python的慢

測試環境準備

數據方面這里我們用之前ChineseGLUE https://github.com/ChineseGLUE/ChineseGLUE 的測試數據機LCQMC。這是一個判斷兩個問題是否等價的數據集，例如“喜歡打籃球的男生喜歡什么樣的女生”和“愛打籃球的男生喜歡什么樣的女生”等價。

注：新版本ChineseGLUE已經變為CLUEBenchmark https://github.com/CLUEbenchmark/CLUE ，并沒有這個數據集了。

$ curl --output train.json https://deepdialog.coding.net/p/dataset/d/dataset/git/raw/master/LCQMC/train.json $ curl --output dev.json https://deepdialog.coding.net/p/dataset/d/dataset/git/raw/master/LCQMC/dev.json

下載中文BERT的詞表，幾乎所有的中文BERT都是基于最開始谷歌發布的詞表的，所以沒什么區別

$ curl --output vocab.txt https://deepdialog.coding.net/p/zh-roberta-wwm/d/zh-roberta-wwm/git/raw/master/vocab.txt

下載模型，并解壓到bert目錄

$ mkdir -p bert $ cd bert $ curl --output bert.tar.gz https://deepdialog.coding.net/p/zh-roberta-wwm/d/zh-roberta-wwm/git/raw/master/zh-roberta-wwm-L12.tar.gz $ tar xvzf bert.tar.gz $ cd ..

安裝Node依賴

npm i install @tensorflow/tfjs-node tokenizers

代碼

const fs = require('fs') const tf = require('@tensorflow/tfjs-node') // huggingface的bert分詞包 const BertWordPieceTokenizer = require('tokenizers').BertWordPieceTokenizer/*** 構建文本分類模型* 輸入的是BERT輸出的sequence_output序列* 輸出2分類softmax*/ function buildModel() {const input = tf.input({shape: [null, 768], dtype: 'float32'})// 這里之所以用rnn對bert輸出序列進行訓練，而不是直接針對[CLS]輸出進行訓練// 是因為如果不fine-tune bert的參數的話，只用[CLS]效果會差一點const rnn = tf.layers.bidirectional({layer: tf.layers.lstm({units: 128, returnSequences: false})})// masking很重要，我封裝的模型padding的部分會輸出 0.0 （有可能是 -0.0，但是也可以被mask）const mask = tf.layers.masking({maskValue: 0.0})const dense = tf.layers.dense({units: 2, activation: 'softmax'})const output = dense.apply(rnn.apply(mask.apply(input)))const model = tf.model({inputs: input, outputs: output})model.compile({optimizer: 'adam',loss: 'sparseCategoricalCrossentropy',metrics: ['acc'],})return model }(async () => {// 加載詞表/分詞和BERTconst wordPieceTokenizer = await BertWordPieceTokenizer.fromOptions({ vocabFile: "./vocab.txt" })const bert = await tf.node.loadSavedModel('./bert')// 構建數據流// 文本輸入會經過tokenizers// 然后用bert計算出sequence_output// 不更新bert的參數是因為nodejs現在還無法訓練讀取的模型function makeGenerator(objs, batchSize) {function* dataGenerator() {let xs = []let ys = []for (const obj of objs) {xs.push(obj['tokens'])ys.push(Number.parseInt(obj['label']))if (xs.length == ys.length && xs.length == batchSize) {// 下面幾行，是對數據進行padding到一樣長度，補足的部分使用空字符串const maxLength = Math.max.apply(Math,xs.map(x => x.length))xs = xs.map(x => {while(x.length < maxLength) {x = x.concat([''])}return x})xs = tf.tensor(xs)// 這一步是得到bert的輸出結果// 如果輸入是dict格式，輸出也會是dict格式，可以參考tfjs的源代碼// 這一步也可以單獨用，就類似bert-as-a-service一樣xs = bert.predict({input_1: xs})['sequence_output']ys = tf.tensor(ys)// bert的輸出作為文本分類模型的輸入(xs)// 標簽作為文本分類模型的目標(ys)yield {xs, ys}xs = []ys = []}}}return dataGenerator}// 數據集，格式是jsonl，所以用這種方法讀取console.log('Read dataset')const trainObjs = fs.readFileSync('train.json',{encoding: 'utf-8'}).split(/n/).map(JSON.parse)const devObjs = fs.readFileSync('dev.json',{encoding: 'utf-8'}).split(/n/).map(JSON.parse)// 這里先對分詞，是因為分詞是async異步函數，而tensorflowjs的generator不支持異步yieldconsole.log('Tokenize train dataset')for (const obj of trainObjs) {obj['tokens'] = (await wordPieceTokenizer.encode(obj['sentence1'], obj['sentence2'])).tokens}console.log('Tokenize dev dataset')for (const obj of devObjs) {obj['tokens'] = (await wordPieceTokenizer.encode(obj['sentence1'], obj['sentence2'])).tokens}console.log('Start training')const batchSize = 32const dsTrain = tf.data.generator(makeGenerator(trainObjs, batchSize)).repeat()const dsDev = tf.data.generator(makeGenerator(devObjs, batchSize)).repeat()const model = buildModel()model.fitDataset(dsTrain, {batchesPerEpoch: Math.floor(trainObjs.length / batchSize),epochs: 1,batch_size: batchSize,validationData: dsDev,validationBatches: Math.floor(devObjs.length / batchSize),})model.evaluateDataset(dsDev, {batches: Math.floor(devObjs.length / batchSize),})})()

總結

以上是生活随笔為你收集整理的tensorflow 加载bert_用NodeJS/TensorFlowJS调用BERT实现文本分类的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇： python中del和remove的区别
下一篇： java定时器结合springboot_