cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度
cnn圖像進(jìn)行預(yù)測(cè)
In this article, we will be using the image of the polymer structure to predict its glass transition temperature. This article follows a similar methodology as published in one of the recent research papers by Luis.A. Miccio from Materials Physics Center and Donostia International Physics Center in Spain.
在本文中,我們將使用聚合物結(jié)構(gòu)的圖像預(yù)測(cè)其玻璃化轉(zhuǎn)變溫度。 本文采用與Luis.A.最近的一篇研究論文中發(fā)表的方法類似的方法。 來(lái)自西班牙材料物理中心和Donostia國(guó)際物理中心的Miccio 。
介紹: (Introduction:)
Glass Transition temperature is one of the crucial properties of polymers. It marks the temperature range below which the atoms of a supercooled liquid are temporarily frozen (without crystallizing) upon cooling. Predicting glass transition temperature (Tg) provides valuable insights into polymer properties whose synthesis may otherwise be costly and time-consuming. Scientists have always been more keener to develop machine learning models qualitatively(For instance, using several other properties to predict its tensile strength). During the last few years, the major emphasis has been given to Quantitative Structure-Property Relationships. This opens the possibility of predicting various properties with just the Structure of the molecular (i.e just the image) compound avoiding requirement of any additional experimental properties or tedious calculations. In this article, we will be using Convolutional Neural Networks to predict Tg of unknow polymer compounds, using the image of the polymer. This sounds so cool, this literally means that if you just draw the image of the monomeric unit on a whiteboard that would be enough to predict its Tg. We do not need any other external information or properties for the polymer.
玻璃化轉(zhuǎn)變溫度是聚合物的關(guān)鍵性能之一。 它標(biāo)志著溫度范圍,在該溫度范圍內(nèi),過(guò)冷液體的原子在冷卻后會(huì)暫時(shí)凍結(jié)(不結(jié)晶)。 預(yù)測(cè)玻璃化轉(zhuǎn)變溫度(Tg)提供了有價(jià)值的洞察力,以了解聚合物的性能,否則其合成可能既昂貴又費(fèi)時(shí)。 科學(xué)家一直更熱衷于定性開(kāi)發(fā)機(jī)器學(xué)習(xí)模型(例如,使用其他一些屬性來(lái)預(yù)測(cè)其拉伸強(qiáng)度)。 在過(guò)去的幾年中,主要的重點(diǎn)是定量結(jié)構(gòu)與屬性的關(guān)系。 這開(kāi)辟了僅用分子的結(jié)構(gòu)(即僅圖像)預(yù)測(cè)各種性質(zhì)的可能性,從而無(wú)需任何其他實(shí)驗(yàn)性質(zhì)或繁瑣的計(jì)算。 在本文中,我們將使用卷積神經(jīng)網(wǎng)絡(luò)通過(guò)聚合物圖像來(lái)預(yù)測(cè)未知聚合物的Tg。 這聽(tīng)起來(lái)很酷,從字面上看,這意味著,如果僅在白板上繪制單體單元的圖像,就足以預(yù)測(cè)其Tg。 我們不需要該聚合物的任何其他外部信息或特性。
導(dǎo)入相關(guān)包 (Importing Relevant Packages)
數(shù)據(jù)集 (Dataset)
The dataset used in our study was gathered from a popular polymer database. The dataset for this study comprises of 351 polymers along with their smiles codes, molecular names as input attributes and glass transition temperatures as the output variable. Subsets of 300 polymers and their Tg values were used for training validating the dataset, whereas the rest 51 unseen polymers were used to test the results for both the models, the CNN and the proposed ANN. The figure below shows the top 5 rows of the dataset. The dataset for this study can be found here.
我們研究中使用的數(shù)據(jù)集是從流行的聚合物數(shù)據(jù)庫(kù)中收集的。 這項(xiàng)研究的數(shù)據(jù)集包括351種聚合物及其微笑代碼,分子名稱作為輸入屬性和玻璃化轉(zhuǎn)變溫度作為輸出變量。 300個(gè)聚合物的子集及其Tg值用于訓(xùn)練驗(yàn)證數(shù)據(jù)集,而其余51個(gè)看不見(jiàn)的聚合物用于測(cè)試模型,CNN和擬議的ANN的結(jié)果。 下圖顯示了數(shù)據(jù)集的前5行。 這項(xiàng)研究的數(shù)據(jù)集可以在這里找到。
Reading and Cleaning the Dataset讀取和清理數(shù)據(jù)集 Top 5 rows of the Data Frame數(shù)據(jù)框的前5行聚合物分類 (Classifying Polymers)
The dataset was manually explored using the Pandas library in python and was classified into eight different classes of polymers- acrylates, styrenes, amides, alkenes, ether, amides, carbonates, and others.
使用python中的Pandas庫(kù)手動(dòng)瀏覽了數(shù)據(jù)集,并將其分為八類不同的聚合物:丙烯酸酯,苯乙烯,酰胺,烯烴,醚,酰胺,碳酸鹽等。
探索性數(shù)據(jù)分析 (Exploratory Data Analysis)
Pie Plot: The pie plot in Figure shows the exact composition of the dataset, with acrylates and styrenes being the highest contributors
餅圖:圖中的餅圖顯示了數(shù)據(jù)集的確切組成,其中丙烯酸酯和苯乙烯的貢獻(xiàn)最大
Box Plot: The box plot is used to show the scatter plot of the underlying Tg distributions for each class of polymers. It can be seen that styrenes tend to have higher Tg whereas acrylates have a fairly mixed distribution.
箱形圖:箱形圖用于顯示每類聚合物的基礎(chǔ)Tg分布的散點(diǎn)圖。 可以看出,苯乙烯傾向于具有較高的Tg,而丙烯酸酯具有相當(dāng)混合的分布。
單體中電荷的分布 (Distribution of Charge in Monomer)
The open-source RDKit [1] python package was used to visualize the molecular structure of polymers from the dataset into drawings. One such function in RDKit[1] module employed to compute the Gasteiger partial charges for monomeric units.
開(kāi)源RDKit [1] python軟件包用于可視化從數(shù)據(jù)集中到圖紙的聚合物分子結(jié)構(gòu)。 RDKit [1]模塊中的一種此類函數(shù)用于計(jì)算單體單元的Gasteiger部分電荷。
特征工程 (Feature Engineering)
Engineering features have been vital in preparing the data for modeling and presenting attributes in machine-readable form. As per the problem statement, the Tg prediction was to be based on images of the polymer chemical structure fed into the CNN architecture in encoded form using SMILES line notations. The main aim of feature engineering for this problem is to incorporate the chemical structure as well as the chemical composition of the monomeric unit to predict Tg. This is achieved using the SMILES line notations [2].
工程特征對(duì)于準(zhǔn)備數(shù)據(jù)建模和以機(jī)器可讀形式顯示屬性至關(guān)重要。 根據(jù)問(wèn)題陳述,Tg預(yù)測(cè)將基于使用SMILES線符號(hào)以編碼形式饋入CNN體系結(jié)構(gòu)的聚合物化學(xué)結(jié)構(gòu)圖像。 針對(duì)此問(wèn)題的特征工程的主要目的是結(jié)合單體單元的化學(xué)結(jié)構(gòu)以及化學(xué)組成來(lái)預(yù)測(cè)Tg。 這是使用SMILES線路符號(hào)[2]實(shí)現(xiàn)的。
Introduction to SMILES Notation: SMILES stands for Simplified molecular-input line-entry system. This is basically a way of describing the chemical structure in the form of a line notation using different characters. The image below shows the SMILES notation for the given chemical structure.
SMILES簡(jiǎn)介表示法: SMILES表示簡(jiǎn)化的分子輸入行輸入系統(tǒng)。 這基本上是一種使用不同字符以行符號(hào)形式描述化學(xué)結(jié)構(gòu)的方法。 下圖顯示了給定化學(xué)結(jié)構(gòu)的SMILES符號(hào)。
Molecular Structure to Image Encoding: We first defined a list containing all the unique characters that can be present in any given SMILES linear string for a polymer.
圖像編碼的分子結(jié)構(gòu):我們首先定義一個(gè)列表,其中包含可以在聚合物的任何給定SMILES線性字符串中出現(xiàn)的所有唯一字符。
Further, as the second step, the linear string of polymers in the form of line notation are 1-hot encoded in machine-readable form through binary images by using this list of unique SMILES characters. The resulting transformation is an n-dimensional matrix consisting of binary images that can be fed into the CNN architecture. Each binary image is a matrix of the dimensions m × n, where n represents the number of characters in the unique SMILES list and m is the number of characters present in polymer with the longest smiles code. The figure given below depicts the encoding process visually for 1 polymer example -Poly(4-biphenyl acrylate).
此外,作為第二步,使用該唯一的SMILES字符列表,通過(guò)二進(jìn)制圖像以機(jī)器可讀的形式對(duì)線型聚合物形式的線性聚合物線進(jìn)行1-hot編碼。 最終的變換是一個(gè)n維矩陣,其中包含可被饋入CNN體系結(jié)構(gòu)的二進(jìn)制圖像。 每個(gè)二進(jìn)制圖像都是一個(gè)尺寸為m×n的矩陣,其中n表示唯一的SMILES列表中的字符數(shù),m是具有最長(zhǎng)笑碼的聚合物中存在的字符數(shù)。 下圖給出了1個(gè)聚合物實(shí)例-聚(丙烯酸4-聯(lián)苯酯)的可視化編碼過(guò)程。
Molecular Structure to Image Encoding Process分子結(jié)構(gòu)到圖像編碼過(guò)程The generated one hot encoded image takes into account the chemical structure and the composition of the monomeric unit. We can see that the encoded image tells us the number of each kind of atoms present in the monomeric structure in a binary form along with the alignment structure of atoms in the polymeric chain relative to each other.
生成的一張熱編碼圖像考慮了單體單元的化學(xué)結(jié)構(gòu)和組成。 我們可以看到,編碼圖像告訴我們以二元形式存在于單體結(jié)構(gòu)中的每種原子的數(shù)目以及聚合物鏈中原子彼此之間的排列結(jié)構(gòu)。
The figure given below generates the encoded image of the top five polymers in the data frame.
下圖顯示了數(shù)據(jù)框中前五種聚合物的編碼圖像。
Encoded Images for first five polymers in Data frame數(shù)據(jù)框中前五種聚合物的編碼圖像模型實(shí)施 (Model Implementation)
The image encoded molecular structure was fed as an input to the CNNs and the target variable was Tg of the given polymers, which is a continuous variable. This model was implemented using the Keras library which serves as an Application Programming Interface (API) for Tensorflow.
將圖像編碼的分子結(jié)構(gòu)作為輸入輸入CNN,目標(biāo)變量是給定聚合物的Tg,它是一個(gè)連續(xù)變量。 該模型是使用Keras庫(kù)實(shí)現(xiàn)的,該庫(kù)充當(dāng)Tensorflow的應(yīng)用程序編程接口(API)。
Schematic of CNN [2]CNN的示意圖[2]Proposed Architecture: The choice of the final hyper-parameters has been made by incorporating various combinations of all the different hyper-parameters. The best-observed configuration uses filter size of 64 with a window size of (5,5) in the first layer and size (3,3) with 32 filters in the second layer. This is followed by a max-pooling layer with a window size of (3,3). Post the max-pooling layer, we have three dense layers with 32, 10, and 1 neurons respectively, with the final dense layer being the output of our proposed ANN model. ReLu activation function was used by all layers with l2 regularization. The model achieved its best generalization by training up to 180 epochs with a batch size of 64 and a learning rate of 0.03. A validation split of 0.1 and drop out probability of 0.1 was used in training the network to perform cross-validation.
建議的體系結(jié)構(gòu):通過(guò)合并所有不同超參數(shù)的各種組合來(lái)選擇最終的超參數(shù)。 最佳觀察配置在第一層中使用的過(guò)濾器大小為64,窗口大小為(5,5),在第二層中使用的過(guò)濾器大小為(3,3)32個(gè)過(guò)濾器。 接下來(lái)是窗口大小為(3,3)的最大池化層。 在最大池化層之后,我們有三個(gè)密集層,分別具有32、10和1個(gè)神經(jīng)元,最后一個(gè)密集層是我們提出的ANN模型的輸出。 ReLu激活功能已用于所有帶有12正則化的層。 該模型通過(guò)訓(xùn)練多達(dá)180個(gè)時(shí)期(批大小為64,學(xué)習(xí)率為0.03)達(dá)到了最佳概括。 在訓(xùn)練網(wǎng)絡(luò)執(zhí)行交叉驗(yàn)證時(shí),使用了0.1的驗(yàn)證拆分和0.1的丟失概率。
結(jié)果 (Results)
The figures given below show the experimental and the predicted values of the glass transition temperatures for the training and the unseen test sets. For an ideal model, we would expect the real values to be perfectly equal to the predicted Tg values. This will result in a straight line passing throw the origin.
下圖顯示了訓(xùn)練和看不見(jiàn)的測(cè)試裝置的玻璃化轉(zhuǎn)變溫度的實(shí)驗(yàn)值和預(yù)測(cè)值。 對(duì)于理想模型,我們希望實(shí)際值完全等于預(yù)測(cè)的Tg值。 這將導(dǎo)致直線經(jīng)過(guò)并拋出原點(diǎn)。
Real vs predicted Tg values for the Training set訓(xùn)練集的實(shí)際Tg值與預(yù)測(cè)Tg值 Real vs predicted Tg values for Unseen Test set看不見(jiàn)的測(cè)試集的真實(shí)Tg值與預(yù)測(cè)Tg值It can be perceived that most of the examples show very accurate prediction when compared to the real Tg values. However, there are a few polymers contributing to a significant level of uncertainty in prediction due to their lack of sufficient training data. These polymers belong to the minority classes of esters and ethers and due to insufficient training for either, the Tg of these polymers is not being learned effectively.
可以看出,與真實(shí)Tg值相比,大多數(shù)示例都顯示出非常準(zhǔn)確的預(yù)測(cè)。 但是,由于缺乏足夠的訓(xùn)練數(shù)據(jù),有幾種聚合物在預(yù)測(cè)中具有很大的不確定性。 這些聚合物屬于酯和醚的少數(shù)種類,并且由于對(duì)它們的訓(xùn)練不足,因此無(wú)法有效地學(xué)習(xí)這些聚合物的Tg。
Loss Metrics: We used the mean absolute loss function while training the neural networks. But for our final evaluation, we have used mean relative % error as the evaluation metrics for our model. This can be represented as follows-
損失指標(biāo):我們?cè)谟?xùn)練神經(jīng)網(wǎng)絡(luò)時(shí)使用了平均絕對(duì)損失函數(shù)。 但是對(duì)于最終評(píng)估,我們使用平均相對(duì)誤差百分比作為模型的評(píng)估指標(biāo)。 可以表示為
where Ai is the actual Tg value and Pi is the predicted Tg value. The average of this relative % error was taken over the full dataset of m polymers. Post the training process we computed the respective training and testing mean relative errors.
其中Ai是實(shí)際Tg值,Pi是預(yù)測(cè)Tg值。 相對(duì)誤差百分比的平均值取自m個(gè)聚合物的完整數(shù)據(jù)集。 在訓(xùn)練過(guò)程之后,我們計(jì)算了各自的訓(xùn)練和測(cè)試平均相對(duì)誤差。
The table given below shows the real and predicted Tg values for 4 unseen polymers. We can see that our predictions lie very close to the experimental Tg values signifying that the proposed model shows an excellent generalization ability.
下表列出了4種看不見(jiàn)的聚合物的實(shí)際Tg值和預(yù)測(cè)值。 我們可以看到我們的預(yù)測(cè)非常接近于實(shí)驗(yàn)Tg值,這表明所提出的模型具有出色的泛化能力。
結(jié)論 (Conclusions)
In this study, we demonstrated the feasibility of CNN to predict the Tg of the polymer by taking into account the molecular structure and chemical composition of the monomeric units in the polymer. We were able to achieve a relative error of 6% and 7% training and test set respectively. In my next article, I will be using fully connected neural networks to predict the glass transition temperature. This new model will incorporate all kinds of intra-molecular interactions along with the chemical composition and molecular structure to predict Tg.
在這項(xiàng)研究中,我們通過(guò)考慮聚合物中單體單元的分子結(jié)構(gòu)和化學(xué)組成,證明了CNN預(yù)測(cè)聚合物Tg的可行性。 我們分別獲得了6%的相對(duì)誤差和7%的訓(xùn)練集和測(cè)試集的相對(duì)誤差。 在我的下一篇文章中,我將使用完全連接的神經(jīng)網(wǎng)絡(luò)來(lái)預(yù)測(cè)玻璃化轉(zhuǎn)變溫度。 這個(gè)新模型將結(jié)合各種分子內(nèi)相互作用以及化學(xué)組成和分子結(jié)構(gòu)來(lái)預(yù)測(cè)Tg。
學(xué)分 (Credits)
Special Thanks to Danish for contributing to this project.
特別感謝Danish為這個(gè)項(xiàng)目做出的貢獻(xiàn)。
1-G. Landrum et al., “Rdkit: cheminformatics and machine learning software,” RDKIT. ORG, 2013.
1-G。 Landrum等人,“ Rdkit:化學(xué)信息學(xué)和機(jī)器學(xué)習(xí)軟件”,RDKIT。 組織,2013年。
2-Luis A. Miccio, Gustavo A. Schwartz, “From chemical structure to quantitative polymer properties prediction through convolutional neural networks”, Polymer, 2018
2-Luis A.Miccio,Gustavo A.Schwartz,“通過(guò)卷積神經(jīng)網(wǎng)絡(luò)從化學(xué)結(jié)構(gòu)到定量聚合物性質(zhì)預(yù)測(cè)”,聚合物,2018年
謝謝您的閱讀!!! (Thank you for reading!!!!)
If you like my work and want to support me:
如果您喜歡我的工作并希望支持我:
1-The BEST way to support me is by following me on Medium.
1-支持我的最佳方法是在Medium上關(guān)注我。
2-Follow me on LinkedIn.
2-在LinkedIn上關(guān)注我。
翻譯自: https://towardsdatascience.com/cnn-approach-using-image-of-the-polymer-to-predict-its-glass-transition-temperature-4a64ee450450
cnn圖像進(jìn)行預(yù)測(cè)
總結(jié)
以上是生活随笔為你收集整理的cnn图像进行预测_CNN方法:使用聚合物图像预测其玻璃化转变温度的全部?jī)?nèi)容,希望文章能夠幫你解決所遇到的問(wèn)題。
- 上一篇: 微商怎么做?微商营销的10种方法
- 下一篇: 透过性别看世界_透过树林看森林