當前位置：首頁 > 编程资源 > 编程问答 >内容正文

编程问答

用Apache Hadoop和Apache Solr处理和索引医学图像

發布時間：2025/7/25 编程问答 33 豆豆

生活随笔收集整理的這篇文章主要介紹了用Apache Hadoop和Apache Solr处理和索引医学图像小編覺得挺不錯的,現在分享給大家,幫大家做個參考.

原文鏈接：Processing and Indexing Medical Images With Apache Hadoop and Apache Solr
作者：Justin Kestelyn
譯者：郭芮（guorui@csdn.net）

你還在為大規模圖像管理感到頭疼嗎?讀下去,看看這個團隊是如何使用開源產品來更有效地索引和存儲高分辨率醫學圖像的。

時下，醫學影像迅速地成為了一種評估病人狀況，以及確定是否存在醫療條件的最好非侵入性方法。多數情況下，用來協助診斷的影像是構建現代醫學體系的第一步，而成像技術的進步也使我們能夠收集到更詳細的、分辨率更高的2D、3D、4D以及顯微圖像，從而幫助更快診斷和治療某些復雜的情況。

在現實生活中，人腦的高分辨率顯微掃描大小可以達到恐怖的66TB。在一些3D模式的例子中（例如計算機斷層掃描），每幅圖像大約4MB（2048×2048像素）。就目前的情況來看，美國所有資源庫中有大約750PB的醫學影像，并且預計在2016年底能達到1EB的儲存量。基于這樣的事實，很顯然可以預感到醫療成像必將成為一個大數據問題。

目前,圖像存檔和通信系統(PACS)已經成為了醫學影像存儲和文件檢索的行業標準,提供了專用的數據格式和對象。而在這篇文章中,我們分享了一個替代性的解決方案用來存儲和檢索醫學影像文件，它利用一個Apache Hadoop集群(CDH)提供高性能和具有成本效益的分布式圖像處理，其優點包括:　

讓醫學影像數據接近其他已經在使用的數據源，例如基因組學、電子病歷、藥品、穿戴設備數據等等,減少數據存儲和遷移成本　　
通過使用開源軟件和行業標準服務器降低解決方案成本　
允許在同一個集群內結合其他形勢的數據做醫學影像分析

本文將努力重現英特爾的工作，并將此作為可以在CDH集群上利用Apache Hadoop和Apache Solr索引DICOM（Digital Imaging and Communications in Medicine）圖像，并實現存儲、管理和檢索功能的佐證。

解決方案概述

下面就來關注一個簡單但重要的示例：存儲和檢索DICOM圖像的能力，并且這個用例不依賴于其他現有的影像庫。醫院和成像設施將DICOM圖像的副本發送到本地與HIPPA兼容的云或托管服務器集群并存儲。在需要時，醫生或護理者應該能夠從連接到遠程集群的本地機器上查看、搜索和檢索病人的DICOM圖像，如圖1所示。

圖1

詳細說明

DICOM圖像包含兩個部分:文本標題和二進制圖像。要確保圖像是像圖2和圖3所示的那樣可存儲和可被搜索，需使用下面幾個步驟。

圖2

圖3

要注意,圖2是從軟件開發人員的角度來分析,而圖3則是終端用戶和解決方案交互的角度。作為軟件開發人員我們必須親身經歷這個過程，來為最終用戶更好地驗證功能。

1、從DICOM圖像中提取元數據。
2、HDFS中存儲原始的DICOM圖像。
3、利用元數據生成一個索引文件(也將被存儲在HDFS里面)。
4、搜索使用Hue接口來檢索圖像。
5、下載圖片到本地系統并使用DICOM查看器打開視圖。

與高性能服務器節點的成本相比，行業標準服務器在性能和存儲上更具靈活性。我們的服務器節點采用雙CPU結構的英特爾至強E5家族處理器，4TB×12（48TB）存儲和192GB內存。在必要時處理和存儲的需求可以通過橫向擴展來滿足——即增加集群中的節點數。

系統設置和配置

以下是我們測試6節點CDH集群和1個邊緣節點的關鍵指標。

軟件需求

為了測試這個解決方案，我們啟用Cloudera管理器中的CDH5.4服務:

HDFS
- 存儲需要被索引的DICOM XML文件
- 輸出索引結果并存儲
Apache Solr (Solrcloud模式)
- 通過SCHEMA.XML索引給定的DICOM XML文件
Apache ZooKeeper
- 在SolrCloud模式下使用ZooKeeper實現分布式索引
Hue（啟用搜索特性）
- 在圖形界面查看Solr的索引結果，也可以搜索基于索引字段的DICOM圖像
Cloudera搜索
- 通過MapReduceIndexerTool做索引的離線批處理

數據集

我們的測試數據包含了從Visible Human Project下載的DICOM CT圖像。

工作流

圖4描述了工作流路徑。

圖4

第1步：將所有的DICOM圖像都存儲在本地文件夾，然后使用DCM工具包（DCMTK）與dcm2xml功能從圖像中提取DICOM元數據，并將其以XML格式存儲。（參見附錄1中的例子。）

Example: dcm2xml <input-file-path> <output-file-path> ./dcm2xml source.dcm source.xml

在上面的例子中，要想運行dcm2xml功能，必須事先在本地機器上下載好DCMTK，然后在.bashrc中設置如下路徑。

Example: export DCMDICTPATH="/home/root/Dicom_indexing/dicom-script/dcmtk-3.6.0/s

第2步：將已轉換的DICOM圖像導入到HDFS的at/user/hadoop/input-dir和create/user/hadoop/output-dir中并存儲索引結果。

第3步：做ETL處理。這一步建議使用Morphlines配置文件，然后按照要求解析和提取所需內容并同時為Solr建立可索引文件。

第4步（參見下列過程）：利用Solr中的schema.xml配置，從給定的XML文件字段中建立索引，同時還可以將MapReduceIndexerTool用于離線批處理索引。（見附件2）。

首先，請確認Solr服務（以SolrCloud模式）在集群上成功啟動并運行（訪問http://＜your –solr-server-name＞:8983/solr）。
然后使用solrctl生成配置文件，包括要索引的schema.xml字段（使用 - ZK選項將提供ZooKeeper的主機名，還可以在Cloudera Manager’s ZooKeeper中發現這個信息）。但是要注意IP地址的最后一項需要ZooKeeper端口，而且ZooKeeper主機IP會代替主機IP。

solrctl instancedir --zk hostip1,hostip2,hostip3:2181/solr --generate $HOME/solr_config cd $HOME/solr_config/conf

接下來需要在本地計算機中下載schema.xml文件并進行編輯，包括要索引的所有字段名，且字段名稱要與XML索引文件的名稱屬性相匹配。這個例子中，DICOM XML文件僅需要索引成百上千個字段中的10-15個字段，同時還要記得將修改的schema.xml文件在/ conf文件夾中更新（見附錄1查看生成的DICOM XML文件，見附錄2查看定制的schema.xml文件）。
清理所有的收藏欄和現有的ZooKeeper實例目錄：

solrctl --zk hostip1,hostip2,hostip3:2181/solr collection --delete demo-collection >& /dev/null solrctl --zk hostip1,hostip2,hostip3:2181/solr instancedir --delete demo-collection >& /dev/null

上傳Solr配置到SolrCloud。

solrctl --zk hostip1,hostip2,hostip3:2181/solr instancedir --create ＜strong＞demo-collection＜/strong＞$HOME/solr_configs

創建一個命名為demo-collection的Solr集合，-s2則表示該集合有兩個碎片。

solrctl --zk hostip1,hostip2,hostip3:2181/solr collection --create demo-collection -s 2 -r 1 -m 2

創建一個空白目錄，并將MapReduceBatchIndexer結果寫到里面。

hadoop fs -rm -f -skipTrash -r output-dir hadoop fs -mkdir -p output-dir

ETL再次處理，建議使用Morphlines配置文件（見附錄3）。
利用MapReduceIndexerTool索引數據并現場演示，而且${DICOM_WORKINGDIR}所在的位置還可以找到log4j.properties和morphlines.conf。

${HDFS_DICOM_OUTDIR} - Location of output dir folder on hdfs (ex: /user/hadoop/output-dir) ${HDFS_DICOM_INDIR} – Location of input dir folder on hdfs (ex: /user/hadoop/input-dir/)

Solr會在上面所創建的輸出目錄中存儲索引結果。

第5步：利用Hue查看索引結果；DICOM影像URL可用于本地下載以供觀看。

測試

測試索引結果首先需要登錄Hue界面（假設已獲得并啟用了Hue的搜索功能。）

1、點擊搜索和導航>索引>演示系列>搜索。下面是索引結果的默認視圖。

2、鍵入一位病人的姓名或身份證號，也可以是其他任何已編入索引的字段數據。在這個例子中，下面截圖顯示的是一位病人的名字和其他標識。

3、當你展開一個單一結果時，可以看到如下圖所示的元數據字段。

4、上面圖片中的DICOM URL點擊無效，所以要使用Hue中的圖形控件創建一個良好的儀表板并添加一個可點擊的URL。

5、點擊DICOM的URL ，可以選擇將.dcm文件下載到本地計算機。在該版本中我們將它下載到本地計算機中，并用名為MicroDicom Viewer的開源工具查看。

6、使用MicroDicom Viewer查看圖像。

未來工作

我們計劃繼續開發這個參考架構，在此基礎上利用插件提供更精簡的方法，并力爭做到允許用戶直接下載瀏覽器內的DICOM文件。我們也將致力于更好的可視化能力研究，來支持多個圖像同時下載。

文章貢獻者：
KARTHIK Vadla，供職于Intel，Big Data Solutions Pathfinding Group部門軟件工程師。
Abhi Basu，供職于Intel，Data Solutions Pathfinding Group部門軟件架構師。
Monica Martinez-Canales，供職于Intel，Data Solutions Pathfinding Group部門首席工程師。

附錄1

<?xml version="1.0"?> <file-format> <meta-header xfer="1.2.840.10008.1.2.1" name="Little Endian Explicit"> <element tag="0002,0000" vr="UL" vm="1" len="4" name="FileMetaInformationGroupLength">216</element> <element tag="0002,0001" vr="OB" vm="1" len="2" name="FileMetaInformationVersion" binary="hidden"></element> <element tag="0002,0002" vr="UI" vm="1" len="28" name="MediaStorageSOPClassUID">1.2.840.10008.5.1.4.1.1.6.1</element> <element tag="0002,0003" vr="UI" vm="1" len="58" name="MediaStorageSOPInstanceUID">1.2.826.0.1.3680043.2.307.111.48712655111.78571.301.34207</element> <element tag="0002,0010" vr="UI" vm="1" len="22" name="TransferSyntaxUID">1.2.840.10008.1.2.4.70</element> <element tag="0002,0012" vr="UI" vm="1" len="38" name="ImplementationClassUID">1.2.826.0.1.3680043.1.2.100.5.6.2.160</element> <element tag="0002,0013" vr="SH" vm="1" len="16" name="ImplementationVersionName">DicomObjects.NET</element> </meta-header> <data-set xfer="1.2.840.10008.1.2.4.70" name="JPEG Lossless, Non-hierarchical, 1st Order Prediction"> <element tag="0008,0008" vr="CS" vm="2" len="16" name="ImageType">ORIGINAL\PRIMARY</element> <element tag="0008,0012" vr="DA" vm="1" len="8" name="InstanceCreationDate">20091111</element> <element tag="0008,0013" vr="TM" vm="1" len="10" name="InstanceCreationTime">164835.000</element> <element tag="0008,0014" vr="UI" vm="1" len="30" name="InstanceCreatorUID">1.2.826.0.1.3680043.2.307.111</element> <element tag="0008,0016" vr="UI" vm="1" len="28" name="SOPClassUID">1.2.840.10008.5.1.4.1.1.6.1</element> <element tag="0008,0018" vr="UI" vm="1" len="58" name="SOPInstanceUID">1.2.826.0.1.3680043.2.307.111.48712655111.78571.301.34207</element> <element tag="0008,0020" vr="DA" vm="1" len="8" name="StudyDate">20010215</element> <element tag="0008,0023" vr="DA" vm="1" len="8" name="ContentDate">20010215</element> <element tag="0008,0030" vr="TM" vm="0" len="0" name="StudyTime"></element> <element tag="0008,0033" vr="TM" vm="1" len="10" name="ContentTime">093006.000</element> <element tag="0008,0050" vr="SH" vm="0" len="0" name="AccessionNumber"></element> <element tag="0008,0060" vr="CS" vm="1" len="2" name="Modality">US</element> <element tag="0008,0070" vr="LO" vm="0" len="0" name="Manufacturer"></element> <element tag="0008,0090" vr="PN" vm="0" len="0" name="ReferringPhysicianName"></element> <element tag="0008,1030" vr="LO" vm="1" len="12" name="StudyDescription">CLR Standard</element> <element tag="0008,2111" vr="ST" vm="1" len="66" name="DerivationDescription">From DSR by TomoVision's DICOMatic 2.0 rev-2e (conversion module)</element> <element tag="0008,2124" vr="IS" vm="0" len="0" name="NumberOfStages"></element> <element tag="0008,212a" vr="IS" vm="0" len="0" name="NumberOfViewsInStage"></element> <element tag="0010,0010" vr="PN" vm="1" len="12" name="PatientName">BURRUS^NOLA</element> <element tag="0010,0020" vr="LO" vm="1" len="6" name="PatientID">655111</element> <element tag="0010,0030" vr="DA" vm="0" len="0" name="PatientBirthDate"></element> <element tag="0010,0040" vr="CS" vm="0" len="0" name="PatientSex"></element> <element tag="0018,0010" vr="LO" vm="0" len="0" name="ContrastBolusAgent"></element> <element tag="0018,1030" vr="LO" vm="1" len="12" name="ProtocolName">CLR Standard</element> <element tag="0018,5100" vr="CS" vm="0" len="0" name="PatientPosition"></element> <sequence tag="0018,6011" vr="SQ" card="4" len="784" name="SequenceOfUltrasoundRegions"> ……………………………………… </item> </sequence> </data-set> </file-format>

附錄2

<field name="SOPInstanceUID" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="PatientID" type="string" indexed="true" stored="true" multiValued="false" /> <field name="StudyDescription" type="string" indexed="true" stored="true"/> <field name="PatientName" type="string" indexed="true" stored="true" /> <field name="DicomUrl" type="string" stored="true"/> <field name="ImageType" type="string" indexed="true" stored="true"/> <field name="InstanceCreationDate" type="string" indexed="true" stored="true"/> <field name="InstanceCreationTime" type="string" indexed="true" stored="true"/> <field name="StudyDate" type="string" indexed="true" stored="true"/> <field name="ContentDate" type="string" indexed="true" stored="true"/> <field name="DerivationDescription" type="string" indexed="true" stored="true"/> <field name="ProtocolName" type="string" indexed="true" stored="true"/> Mention the unique key along with this <uniqueKey><code>SOPInstanceUID</code></uniqueKey> (Remove any previously existing unique key tag and replace with this tag.)

附錄3

SOLR_LOCATOR : {#This is the name of the collection which we created with solrctl utility in our earlier stepscollection : demo-collection #Zookeeper host names, you will find this information in Cloudera Manager at ZooKeeper service zkHost : "hostip1:2181, hostip2:2181, hostip3:2181/solr" } And include this specific XQuery inside the commands tag of morphlines xquery {fragments : [{fragmentPath : "/"queryString : """for $data in /file-format/data-setreturn<record><SOPInstanceUID>{$data/element[@name='SOPInstanceUID']}</SOPInstanceUID><ImageType>{$data/element[@name='ImageType']}</ImageType><InstanceCreationDate>{$data/element[@name='InstanceCreationDate']}</InstanceCreationDate><InstanceCreationTime>{$data/element[@name='InstanceCreationTime']}</InstanceCreationTime><StudyDate>{$data/element[@name='StudyDate']}</StudyDate><ContentDate>{$data/element[@name='ContentDate']}</ContentDate><DerivationDescription>{$data/element[@name='DerivationDescription']}</DerivationDescription><ProtocolName>{$data/element[@name='ProtocolName']}</ProtocolName><PatientID>{$data/element[@name='PatientID']}</PatientID><PatientName>{$data/element[@name='PatientName']}</PatientName><StudyDescription>{$data/element[@name='StudyDescription']}</StudyDescription><DicomUrl>{$data/element[@name='DicomUrl']}</DicomUrl></record>"""}]} }

總結

以上是生活随笔為你收集整理的用Apache Hadoop和Apache Solr处理和索引医学图像的全部內容，希望文章能夠幫你解決所遇到的問題。

如果覺得生活随笔網站內容還不錯，歡迎將生活随笔推薦給好友。

上一篇：深度学习指南：基于Ubuntu从头开始搭
下一篇： Facebook发布人工智能产品Deep