基于bert的语义匹配_构建基于BERT的语义搜索系统…针对“星际迷航”
基于bert的語義匹配
If you read my previous article on Towards Data Science you’ll know I’m a bit of a Star Trek nerd. There’s only one thing I like more than Star Trek, and that’s building cool new stuff with AI. So I thought I’d combine the two yet again!
如果您閱讀我以前的《邁向數(shù)據(jù)科學》的文章,您會知道我有點像《星際迷航》的書呆子。 除了星際迷航,我只喜歡一件事,而這正是AI創(chuàng)造的新奇事物。 所以我想我又將兩者結合了!
In this tutorial we’re going to build our own search engine to search all the lines from Star Trek: The Next Generation. We’ll be using Jina, a neural search framework which uses deep learning to power our NLP search, though we could easily use it for image, audio or video search if we wanted to.
在本教程中,我們將構建自己的搜索引擎來搜索《 星際迷航:下一代》中的所有文章 。 我們將使用Jina ,這是一個神經搜索框架,該框架使用深度學習為NLP搜索提供支持,但如果愿意,我們可以輕松地將其用于圖像,音頻或視頻搜索。
We’ll cover:
我們將介紹:
- Basic setup 基本設定
- Running a demo of our app (yes, even before we code it) 運行我們的應用程序的演示(是的,甚至在我們編寫代碼之前)
- Using cookiecutter to create project and boilerplate code 使用cookiecutter創(chuàng)建項目和樣板代碼
- Downloading our Star Trek dataset 下載我們的星際迷航數(shù)據(jù)集
- Loading, indexing, and searching our dataset 加載,索引和搜索我們的數(shù)據(jù)集
- A deeper look behind the scenes 深入了解幕后情況
- What to do if things go wrong 如果出問題了該怎么辦
If you’re new to AI or search, don’t worry. As long as you have some knowledge of Python and the command line you’ll be fine. If it helps, think of yourself as Lieutenant Commander Data Science.
如果您不熟悉AI或搜索,請不要擔心。 只要您對Python和命令行有一定的了解,就可以了。 如果有幫助,可以將自己視為數(shù)據(jù)科學中尉。
Giphy吉菲試試看 (Try It Out)
Before going through the trouble of downloading, configuring and testing your search engine, let’s get an idea of the finished product. In this case, it’s exactly the same as what we’re building, but with lines from South Park instead of Star Trek:
在解決下載,配置和測試搜索引擎的麻煩之前,讓我們先了解一下成品。 在這種情況下,它與我們正在建造的建筑物完全相同,但是使用的是南方公園的線路,而不是星際迷航:
JinaboxJinaboxJina has a pre-built Docker image with indexed data from South Park. You can run it with:
Jina有一個預先構建的Docker映像,其中包含來自South Park的索引數(shù)據(jù)。 您可以使用以下命令運行它:
docker run -p 45678:45678 jinaai/hub.app.distilbert-southparkAfter getting Docker up and running, you can start searching for those South Park lines.
在啟動并運行Docker之后,您可以開始搜索這些南方公園線。
用Jinabox查詢 (Query with Jinabox)
Jinabox is a simple web-based front-end for neural search. You can see it in the graphic at the top of this tutorial.
Jinabox是用于神經搜索的基于Web的簡單前端。 您可以在本教程頂部的圖形中看到它。
Go to jinabox in your browser
在瀏覽器中轉到jinabox
Ensure you have the server endpoint set to http://localhost:45678/api/search
確保將服務器端點設置為http://localhost:45678/api/search
Note: If it times out the first time, that’s because the query system is still warming up. Try again in a few seconds!
注意:如果第一次超時,那是因為查詢系統(tǒng)仍在預熱。 請在幾秒鐘后再試一次!
用curl查詢 (Query with curl)
Alternatively, you can open your shell and check the results via the RESTful API. The matched results are stored in topkResults.
或者,您可以打開外殼并通過RESTful API檢查結果。 匹配的結果存儲在topkResults 。
curl --request POST -d '{"top_k": 10, "mode": "search", "data": ["text:hey, dude"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:45678/api/search'You’ll see the results output in JSON format. Each result looks like:
您將看到JSON格式的結果輸出。 每個結果看起來像:
Now go back to your terminal running Docker and hit Ctrl-C (or Command-C on Mac) a few times to ensure you've stopped everything.
現(xiàn)在回到運行Docker的終端,然后Ctrl-C (或Mac上的Command-C )幾次,以確保已停止所有操作。
從事! (Engage!)
Now that you know what we’re building, let’s get started!
現(xiàn)在您知道我們正在構建什么,讓我們開始吧!
You will need:
你會需要:
- A basic knowledge of Python Python基本知識
Python 3.7 or higher installed, and pip
已安裝Python 3.7或更高版本,并pip
- A Mac or Linux computer (Jina doesn’t currently support Windows) Mac或Linux計算機(Jina當前不支持Windows)
- 8 gigabytes or more of RAM 8 GB或更多的RAM
克隆倉庫 (Clone the Repo)
Let’s get the basic files we need to get moving:
讓我們獲取移動所需的基本文件:
git clone git@github.com:alexcg1/my-first-jina-app.gitcd my-first-jina-app
運行Cookiecutter (Run Cookiecutter)
Giphy吉菲 pip install -U cookiecuttercookiecutter gh:jina-ai/cookiecutter-jina
We use cookiecutter to spin up a basic Jina app and save you having to do a lot of typing and setup.
我們使用cookiecutter來啟動基本的Jina應用程序,從而省去了很多打字和設置過程。
For our Star Trek example, use the following settings:
對于我們的《星際迷航》示例,請使用以下設置:
project_name: Star Trek
project_name : Star Trek
project_slug: star_trek (default value)
project_slug : star_trek (默認值)
task_type: nlp
task_type : nlp
index_type: strings
index_type : strings
public_port: 65481 (default value)
public_port : 65481 (默認值)
Just use the defaults for all other fields. After cookiecutter has finished, let’s have a look at the files it created:
僅將默認值用于所有其他字段。 cookiecutter完成后,讓我們看一下它創(chuàng)建的文件:
cd star_trekls
You should see a bunch of files:
您應該看到一堆文件:
app.py - The main Python script where you initialize and pass data into your Flow
app.py主Python腳本,您可以在其中初始化并將數(shù)據(jù)傳遞到Flow
Dockerfile - Lets you spin up a Docker instance running your app
Dockerfile讓您啟動運行您的應用程序的Docker實例
flows/ - Folder to hold your Flows
flows/ -存放流量的文件夾
pods/ - Folder to hold your Pods
pods/ -存放pods/文件夾
README.md - An auto-generated README file
README.md自動生成的README文件
requirements.txt - A list of required Python packages
requirements.txt所需的Python軟件包列表
In the flows/ folder we can see index.yml and query.yml - these define the indexing and querying Flows for your app.
在flows/文件夾中,我們可以看到index.yml和query.yml它們定義了應用程序的索引和查詢流。
In pods/ we see chunk.yml, craft.yml, doc.yml, and encode.yml - these Pods are called from the Flows to process data for indexing or querying.
在pods/我們看到chunk.yml , craft.yml , doc.yml和encode.yml -從流程中調用這些encode.yml來處理用于索引或查詢的數(shù)據(jù)。
More on Flows and Pods later!
稍后更多關于Flows和Pod的信息!
安裝要求 (Install Requirements)
In your terminal run this command to download and install all the required Python packages:
在您的終端中,運行以下命令以下載并安裝所有必需的Python軟件包:
pip install -r requirements.txt下載數(shù)據(jù)集 (Download Dataset)
Our goal is to find out who said what in Star Trek episodes when a user queries a phrase. The Star Trek dataset from Kaggle contains all the scripts and individual character lines from Star Trek: The Original Series all the way through Star Trek: Enterprise.
我們的目標是找出當用戶查詢詞組時誰在星際迷航情節(jié)中說了什么。 Kaggle的“ 星際迷航”數(shù)據(jù)集包含《 星際迷航:原始系列》中的所有腳本和單個字符行,一直到《 星際迷航:企業(yè)》 。
We’re just using a subset in this example, containing the characters and lines from Star Trek: The Next Generation. This has also been converted from JSON to CSV format, which is more suitable for Jina to process.
在此示例中,我們僅使用一個子集,其中包含《 星際迷航:下一代》中的字符和線條。 這也已從JSON轉換為CSV格式,更適合Jina處理。
Now let’s ensure we’re back in our base folder and download the dataset by running:
現(xiàn)在,確保我們回到基本文件夾中,并通過運行以下命令下載數(shù)據(jù)集:
Once that’s finished downloading, let’s get back into the star_trek directory and make sure our dataset has everything we want:
下載完成后,讓我們回到star_trek目錄,并確保我們的數(shù)據(jù)集包含我們想要的一切:
cd star_trekhead data/startrek_tng.csv
You should see output consisting of characters (like MCCOY), a separator, (!), and the lines spoken by the character ( What about my age?):
您應該看到由字符(如MCCOY ),分隔符( ! )和字符所講的行組成的輸出( What about my age? ):
BAILIFF!The prisoners will all stand.BAILIFF!All present, stand and make respectful attention to honouredJudge.
BAILIFF!Before this gracious court now appear these prisoners to answer for the multiple and grievous savageries of their species. How plead you, criminal?
BAILIFF!Criminals keep silence!
BAILIFF!You will answer the charges, criminals.
BAILIFF!Criminal, you will read the charges to the court.
BAILIFF!All present, respectfully stand. Q
BAILIFF!This honourable court is adjourned. Stand respectfully. Q MCCOY!Hold it right there, boy.
MCCOY!What about my age?
Note: Your character lines may be a little different. That’s okay!
注意:您的字符行可能有所不同。 沒關系!
載入資料 (Load Data)
Now we we need to pass startrek_tng.csv into app.py so we can index it. app.py is a little too simple out of the box, so let's make some changes:
現(xiàn)在我們需要將startrek_tng.csv傳遞到app.py以便我們對其進行索引。 app.py有點開箱即用,所以讓我們進行一些更改:
Open app.py in your editor and check the index function, we currently have:
在編輯器中打開app.py并檢查index功能,我們目前有:
As you can see, this indexes just 3 strings. Let’s load up our Star Trek file instead with the filepath parameter. Just replace the last line of the function:
如您所見,此索引僅包含3個字符串。 讓我們使用filepath參數(shù)加載星際迷航文件。 只需替換函數(shù)的最后一行:
索引更少的文件 (Index Fewer Documents)
While we’re here, let’s reduce the number of documents we’re indexing, just to speed things up while we’re testing. We don’t want to spend ages indexing only to have issues later on!
當我們在這里時,讓我們減少索引的文檔數(shù)量,只是為了加快測試過程。 我們不想花費年齡索引只是為了以后有問題!
In the section above the config function, let's change:
在config函數(shù)上方的部分中,我們進行更改:
to:
至:
That should speed up our testing by a factor of 100! Once we’ve verified everything works we can set it back to 50000 to index more of our dataset.
這樣可以將我們的測試速度提高100倍! 驗證一切正常后,我們可以將其設置回50000以索引更多數(shù)據(jù)集。
Now that we’ve got the code to load our data, we’re going to dive into writing our app and running our Flows! Flows are the different tasks our app performs, like indexing or searching the data.
現(xiàn)在我們已經有了加載數(shù)據(jù)的代碼,我們將深入研究編寫應用程序并運行Flows! 流程是我們的應用執(zhí)行的不同任務,例如索引或搜索數(shù)據(jù)。
索引編制 (Indexing)
First up we need to build up an index of our file. We’ll search through this index when we use the query Flow later.
首先,我們需要建立文件索引。 稍后當我們使用查詢Flow時,將搜索該索引。
python app.py indexYour app will show a lot of output in the terminal, but you’ll know it’s finished when you see the line:
您的應用程序將在終端中顯示很多輸出,但是當您看到該行時,您將知道它已完成:
Flow@133216[S]:flow is closed and all resources should be released already, current build level is 0This may take a little while the first time, since Jina needs to download the language model and tokenizer to process the data. You can think of these as the brains behind the neural network that powers the search.
第一次可能要花一點時間,因為Jina需要下載語言模型和令牌生成器來處理數(shù)據(jù)。 您可以將它們視為推動搜索的神經網(wǎng)絡背后的大腦。
正在搜尋 (Searching)
To start search mode run:
要啟動搜索模式,請運行:
python app.py searchAfter a while you should see the terminal stop scrolling and display output like:
一段時間后,您應該看到終端停止?jié)L動并顯示如下輸出:
Flow@85144[S]:flow is started at 0.0.0.0:65481, you can now use client to send request!?? Be sure to note down the port number. We’ll need it for curl and jinabox! In our case we'll assume it's 65481, and we use that in the below examples. If your port number is different, be sure to use that instead.
Be?請務必記下端口號。 我們需要它來curl和彈力盒! 在我們的例子中,我們假設它是65481 ,我們在以下示例中使用它。 如果您的端口號不同,請確保使用該端口號。
?? python app.py search doesn't pop up a search interface - for that you'll need to connect via curl, Jinabox, or another client.
??python python app.py search不會彈出搜索界面-為此,您需要通過curl ,Jinabox或其他客戶端進行連接。
用Jinabox搜索 (Search with Jinabox)
JinaboxJinaboxGo to jinabox in your browser
在瀏覽器中轉到jinabox
Ensure you have the server endpoint set to http://localhost:65481/api/search
確保將服務器端點設置為http://localhost:65481/api/search
卷曲搜索 (Search with curl)
curl --request POST -d '{"top_k": 10, "mode": "search", "data": ["text:picard to riker"]}' -H 'Content-Type: application/json' 'http://0.0.0.0:65481/api/search'curl will spit out a lot of information in JSON format - not just the lines you're searching for, but all sorts of metadata about the search and the lines it returns. Look for the lines starting with "matchDoc" to find the matches, like:
curl將以JSON格式吐出很多信息-不僅是您要搜索的行,還包括有關搜索及其返回行的各種元數(shù)據(jù)。 查找以"matchDoc"開頭的行以找到匹配項,例如:
Congratulations! You’ve just built your very own search engine!
恭喜你! 您已經建立了自己的搜索引擎!
實際如何運作? (How Does it Actually Work?)
For a more general overview of what neural search is and how it works, check one of my other previous articles. Jina itself is just one way to build a neural search engine, and it has a couple of important concepts: Flows and Pods:
有關什么是神經搜索及其工作原理的更一般的概述,請查看我之前的其他文章之一 。 Jina本身只是構建神經搜索引擎的一種方法,它具有兩個重要概念: Flows和Pods :
The Flow tells Jina what tasks to perform on the dataset, like indexing or searching. Each Flow is built from individual Pods.
該流程告訴Jina在數(shù)據(jù)集上執(zhí)行哪些任務,例如索引或搜索。 每個流都是從單獨的Pod構建的。
The Pods comprise the Flow and tell Jina how to perform each task step by step, like breaking text into chunks, indexing it, and so on. They define the actual neural networks we use in neural search, namely the language models like distilbert-base-cased. (Which we can see in pods/encode.yml)
Pods由Flow組成,并告訴Jina 如何逐步執(zhí)行每個任務,例如將文本分成大塊,對其進行索引等等。 他們定義了我們在神經搜索中使用的實際神經網(wǎng)絡,即像distilbert-base-cased這樣的語言模型。 (我們可以在pods/encode.yml看到)
流量 (Flows)
Jina 101Jina 101Just as a plant manages nutrient flow and growth rate for its branches, a Flow manages the states and context of a group of Pods, orchestrating them to accomplish one task. Whether a Pod is remote or running in Docker, one Flow rules them all!
正如植物管理其分支機構的養(yǎng)分流量和生長速率一樣,流程管理著一組豆莢的狀態(tài)和環(huán)境,將它們編排在一起以完成一項任務。 無論Pod是遠程的還是在Docker中運行,一個Flow都將它們統(tǒng)治!
We define Flows in app.py to index and query the content in our Star Trek dataset.
我們在app.py定義Flows以索引和查詢“星際迷航”數(shù)據(jù)集中的內容。
In this case our Flows are written in YAML format and loaded into app.py with:
在這種情況下,我們的流程以YAML格式編寫,并通過以下方式加載到app.py中:
It really is that simple! Alternatively you can build Flows in app.py itself without specifying them in YAML.
真的就是這么簡單! 或者,您可以在app.py本身中構建Flow, 而無需在YAML中指定它們 。
No matter whether you’re dealing with text, graphics, sound, or video, all datasets need to be indexed and queried, and the steps for doing each (chunking, vector encoding) are more or less the same (even if how you perform each step is different — that’s where Pods come in!)
無論你在處理文字,圖形,聲音或視頻是否所有數(shù)據(jù)集需要進行索引和查詢,并做每一個步驟(分塊,矢量編碼)或多或少相同(即使你表現(xiàn)如何每個步驟都是不同的-這就是Pod進來的地方!)
索引編制 (Indexing)
Every Flow has well, a flow to it. Different Pods pass data along the Flow, with one Pod’s output becoming another Pod’s input. Look at our indexing Flow as an example:
每個流程都有一個良好的流程。 不同的Pod沿著流傳遞數(shù)據(jù),其中一個Pod的輸出成為另一Pod的輸入。 以我們的索引流為例:
Jina DashboardJina儀表板If you look at startrek_tng.csv you'll see it's just one big text file. Our Flow processes it into something more suitable for Jina, which is handled by the Pods in the Flow. Each Pod performs a different task.
如果您查看startrek_tng.csv您會發(fā)現(xiàn)它只是一個大文本文件。 我們的流程將其處理成更適合Jina的內容,由流程中的Pod處理。 每個Pod執(zhí)行不同的任務。
You can see the following Pods in flows/index.yml:
您可以在flows/index.yml看到以下flows/index.yml :
crafter - Split the Document into Chunks
crafter -將文檔拆分為塊
encoder - Encode each Chunk into a vector
encoder -將每個塊編碼為向量
chunk_idx - Build an index of Chunks
chunk_idx建立塊的索引
doc_idx - Store the Document content
doc_idx存儲文檔內容
join_all - Join the chunk_idx and doc_idx pathways
join_all加入chunk_idx和doc_idx路徑
The full file is essentially just a list of Pods with parameters and some setup at the top of the file:
完整的文件實際上只是一個Pod列表,帶有參數(shù)和文件頂部的一些設置:
Luckily, YAML is pretty human-readable. I regularly thank the Great Bird of the Galaxy it’s not in Klingon, or even worse, XML!
幸運的是,YAML非常易于閱讀。 我經常感謝銀河大鳥(Great Bird of the Galaxy),它不在Klingon中,甚至不在XML中!
- The first couple of lines initialize the Flow and enable the logserver (which we’re not using in this tutorial). 前兩行初始化Flow并啟用日志服務器(本教程中未使用)。
- After that we can see the list of Pods, with their own YAML path and extra parameters being passed to each one. 之后,我們可以看到Pod列表,它們具有自己的YAML路徑和額外的參數(shù)傳遞給每個Pod。
So, is that all of the Pods? Not quite! We always have another Pod working in silence — the gateway pod. Most of the time we can safely ignore it because it basically does all the dirty orchestration work for the Flow.
那是所有的豆莢嗎? 不完全的! 我們總是有另一個Pod在靜默工作- gateway Pod。 大多數(shù)時候,我們可以放心地忽略它,因為它基本上完成了Flow的所有骯臟編排工作。
正在搜尋 (Searching)
Jina DashboardJina儀表板In the query Flow we’ve got the following Pods:
在查詢流中,我們具有以下Pod:
chunk_seg - Segments the user query into meaningful Chunks
chunk_seg將用戶查詢細分為有意義的塊
tf_encode - Encode each word of the query into a vector
tf_encode將查詢的每個單詞編碼為向量
chunk_idx - Build an index for the Chunks for fast lookup
chunk_idx為塊建立索引以快速查找
ranker - Sort results list
ranker排序結果列表
doc_idx - Store the Document content
doc_idx存儲文檔內容
Again, flows/query.yml gives some setup options and lists the Pods in order of use:
同樣, flows/query.yml提供了一些設置選項,并按使用順序列出了flows/query.yml :
When we were indexing we broke the Document into Chunks to index it. For querying we do the same, but this time the Document is the query the user types in, not the Star Trek dataset. We’ll use many of the same Pods, but there are a few differences to bear in mind. In the big picture:
當我們建立索引時,我們將文檔分解為多個塊以對其進行索引。 對于查詢,我們執(zhí)行相同的操作,但是這次文檔是用戶鍵入的查詢,而不是Star Trek數(shù)據(jù)集。 我們將使用許多相同的Pod,但要記住一些差異。 在大局中:
- Index has a two-pathway design which deals with both Document and Chunk indexing in parallel, which speeds up message passing Index具有兩種途徑的設計,可同時處理Document和Chunk索引,從而加快了消息傳遞的速度
- Query has a single pipeline 查詢只有一個管道
And digging into the flows/query.yml, we can see it has an extra Pod and some more parameters compared to flows/index.yml:
并深入到flows/query.yml ,我們可以看到,與flows/index.yml相比,它具有一個額外的Pod和更多參數(shù):
rest_api:true - Use Jina's REST API, allowing clients like jinabox and curl to connect
rest_api:true使用Jina的REST API,允許像jinabox和curl這樣的客戶端進行連接
port_expose: $JINA_PORT - The port for connecting to Jina's API
port_expose: $JINA_PORT用于連接Jina API的端口
polling: all - Setting polling to all ensures all workers poll the message
polling: all -將polling設置為all可確保所有工作人員輪詢消息
reducing_yaml_path: _merge_topk_chunks - Use _merge_topk_chunks to reduce results from all replicas
reducing_yaml_path: _merge_topk_chunks使用_merge_topk_chunks減少所有副本的結果
ranker: - Rank results by relevance
ranker: -按相關性對結果進行排名
How does Jina know whether it should be indexing or searching? In our RESTful API we set the mode field in the JSON body and send the request to the corresponding API:
Jina如何知道應該索引還是搜索? 在我們的RESTful API中,我們在JSON主體中設置mode字段,并將請求發(fā)送到相應的API:
api/index - {"mode": "index"}
api/index {"mode": "index"}
api/search - {"mode": "search"}
api/search {"mode": "search"}
豆莢 (Pods)
Jina 101Jina 101As we discussed above, a Flow tells Jina what task to perform and is comprised of Pods. And a Pod tells Jina how to perform that task (i.e. what the right tool for job is). Both Pods and Flows are written in YAML.
正如我們上面討論的,Flow告訴Jina要執(zhí)行什么任務,并由Pod組成。 一個Pod告訴Jina 如何執(zhí)行該任務(即,正確的工作工具是什么)。 Pods和Flow均以YAML編寫。
Let’s start by looking at a Pod in our indexing Flow, flows/index.yml. Instead of the first Pod crafter, let's look at encoder which is a bit simpler:
讓我們從索引流Flow flows/index.yml中的Pod開始。 代替第一個Pod crafter ,讓我們看一下encoder ,它稍微簡單一些:
As we can see in the code above, the encoder Pod’s YAML file is stored in pods/encode.yml, and looks like:
正如我們在上面的代碼中看到的那樣, encoder Pod的YAML文件存儲在pods/encode.yml ,如下所示:
The Pods uses the built-in TransformerTorchEncoder as its Executor. Each Pod has a different Executor based on its task, and an Executor represents an algorithm, in this case encoding. The Executor differs based on what's being encoded. For video or audio you'd use a different one. The with field specifies the parameters passed to TransformerTorchEncoder.
Pod使用內置的TransformerTorchEncoder作為其執(zhí)行程序。 每個Pod根據(jù)其任務都有一個不同的執(zhí)行器,執(zhí)行器代表一種算法,在這種情況下為編碼。 執(zhí)行器因要編碼的內容而異。 對于視頻或音頻,您將使用另一種。 with字段指定傳遞給TransformerTorchEncoder的參數(shù)。
pooling_strategy - Strategy to merge word embeddings into chunk embedding
pooling_strategy將單詞嵌入合并為塊嵌入的策略
model_name - Name of the model we're using
model_name我們正在使用的模型的名稱
max_length - Maximum length to truncate tokenized sequences to
max_length將標記化序列截斷為的最大長度
When the Pod runs, data is passed in from the previous Pod, TransformerTorchEncoder encodes the data, and the Pod passes the data to the next Pod in the Flow.
當Pod運行時,數(shù)據(jù)從上一個Pod傳入, TransformerTorchEncoder對數(shù)據(jù)進行編碼,然后Pod將數(shù)據(jù)傳遞到Flow中的下一個Pod。
For a deeper dive on Pods, Flows, Executors and everything else, you can refer to Jina 101.
要深入了解Pod,Flows,Executors和其他所有內容,可以參考Jina 101 。
故障排除 (Troubleshooting)
Giphy吉菲找不到模塊 (Module not found)
Be sure to run pip install -r requirements.txt before beginning, and ensure you have lots of RAM/swap and space in your tmp partition (see below issues). This may take a while since there are a lot of prerequisites to install.
開始之前,請確保運行pip install -r requirements.txt ,并確保tmp分區(qū)中有大量RAM /交換空間(請參閱以下問題)。 由于要安裝許多先決條件,因此可能需要一段時間。
If this error keeps popping up, look into the errors that were output onto the terminal to try to find which module is missing, and then run:
如果此錯誤不斷彈出,請查看輸出到終端上的錯誤以嘗試查找缺少的模塊,然后運行:
pip install <module_name>
pip install <module_name>
我的電腦掛了 (My computer hangs)
Machine learning requires a lot of resources, and if your machine hangs this is often due to running out of memory. To fix this, try creating a swap file if you use Linux. This isn’t such an issue on macOS, since it allocates swap automatically.
機器學習需要大量資源,并且如果機器掛起,通常是由于內存不足。 要解決此問題,如果使用Linux,請嘗試創(chuàng)建交換文件 。 在macOS上,這不是問題,因為它會自動分配交換。
ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device (ERROR: Could not install packages due to an EnvironmentError: [Errno 28] No space left on device)
This is often due to your /tmp partition running out of space so you'll need to increase its size.
這通常是由于/tmp分區(qū)空間不足,因此您需要增加其大小 。
command not found (command not found)
For this error you’ll need to install the relevant software package onto your system. In Ubuntu this can be done with:
對于此錯誤,您需要將相關的軟件包安裝到系統(tǒng)上。 在Ubuntu中,可以通過以下方式完成此操作:
sudo apt-get install <package_name>恭喜你! 我們做到了! (Congratulations! We Did It!)
Giphy吉菲In this tutorial you’ve learned:
在本教程中,您學習了:
- How to install the Jina neural search framework 如何安裝Jina神經搜索框架
- How to load and index text data from files 如何從文件加載和索引文本數(shù)據(jù)
How to query data with curl and Jinabox
如何使用curl和Jinabox查詢數(shù)據(jù)
- The nitty-gritty behind Jina Flows and Pods Jina Flows和Pod背后的本質
- What do if it all goes wrong 如果一切出錯怎么辦
Now that you have a broad understanding of how things work, you can try out some of more example tutorials to build image or video search, or stay tuned for our next set of tutorials that build upon your Star Trek app.
現(xiàn)在,您對事物的工作方式有了廣泛的了解,可以嘗試一些其他示例教程來構建圖像或視頻搜索,或者繼續(xù)關注基于您的Star Trek應用程序構建的下一組教程。
Got an idea for a tutorial covering Star Trek and/or neural search? My commbadge is out of order right now, but you can leave a comment or note on this article for me to assimilate!
對涵蓋《星際迷航》和/或神經搜索的教程有想法嗎? 我的命令目前無法正常使用,但是您可以在這篇文章上留下評論或注釋,以供我吸收!
Alex C-G is the Open Source Evangelist at Jina AI, and a massive Star Trek geek.
Alex CG是Jina AI的開源傳播者,也是星際迷航的極客。
翻譯自: https://towardsdatascience.com/build-a-bert-based-semantic-search-system-for-star-trek-7d7d28414cd8
基于bert的語義匹配
總結
以上是生活随笔為你收集整理的基于bert的语义匹配_构建基于BERT的语义搜索系统…针对“星际迷航”的全部內容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 360软件管家下载的软件存放位置在哪里
- 下一篇: 人类视觉系统_对人类视觉系统的对抗攻击