當(dāng)前位置：首頁 >

cnn图像进行预测_CNN方法：使用聚合物图像预测其玻璃化转变温度

發(fā)布時間：2023/12/15 53 豆豆

生活随笔收集整理的這篇文章主要介紹了 cnn图像进行预测_CNN方法：使用聚合物图像预测其玻璃化转变温度小編覺得挺不錯的,現(xiàn)在分享給大家,幫大家做個參考.

cnn圖像進行預(yù)測

In this article, we will be using the image of the polymer structure to predict its glass transition temperature. This article follows a similar methodology as published in one of the recent research papers by Luis.A. Miccio from Materials Physics Center and Donostia International Physics Center in Spain.

在本文中，我們將使用聚合物結(jié)構(gòu)的圖像預(yù)測其玻璃化轉(zhuǎn)變溫度。本文采用與Luis.A.最近的一篇研究論文中發(fā)表的方法類似的方法。來自西班牙材料物理中心和Donostia國際物理中心的Miccio 。

介紹： (Introduction:)

Glass Transition temperature is one of the crucial properties of polymers. It marks the temperature range below which the atoms of a supercooled liquid are temporarily frozen (without crystallizing) upon cooling. Predicting glass transition temperature (Tg) provides valuable insights into polymer properties whose synthesis may otherwise be costly and time-consuming. Scientists have always been more keener to develop machine learning models qualitatively(For instance, using several other properties to predict its tensile strength). During the last few years, the major emphasis has been given to Quantitative Structure-Property Relationships. This opens the possibility of predicting various properties with just the Structure of the molecular (i.e just the image) compound avoiding requirement of any additional experimental properties or tedious calculations. In this article, we will be using Convolutional Neural Networks to predict Tg of unknow polymer compounds, using the image of the polymer. This sounds so cool, this literally means that if you just draw the image of the monomeric unit on a whiteboard that would be enough to predict its Tg. We do not need any other external information or properties for the polymer.

玻璃化轉(zhuǎn)變溫度是聚合物的關(guān)鍵性能之一。它標(biāo)志著溫度范圍，在該溫度范圍內(nèi)，過冷液體的原子在冷卻后會暫時凍結(jié)(不結(jié)晶)。預(yù)測玻璃化轉(zhuǎn)變溫度(Tg)提供了有價值的洞察力，以了解聚合物的性能，否則其合成可能既昂貴又費時。科學(xué)家一直更熱衷于定性開發(fā)機器學(xué)習(xí)模型(例如，使用其他一些屬性來預(yù)測其拉伸強度)。在過去的幾年中，主要的重點是定量結(jié)構(gòu)與屬性的關(guān)系。這開辟了僅用分子的結(jié)構(gòu)(即僅圖像)預(yù)測各種性質(zhì)的可能性，從而無需任何其他實驗性質(zhì)或繁瑣的計算。在本文中，我們將使用卷積神經(jīng)網(wǎng)絡(luò)通過聚合物圖像來預(yù)測未知聚合物的Tg。這聽起來很酷，從字面上看，這意味著，如果僅在白板上繪制單體單元的圖像，就足以預(yù)測其Tg。我們不需要該聚合物的任何其他外部信息或特性。

導(dǎo)入相關(guān)包 (Importing Relevant Packages)

數(shù)據(jù)集 (Dataset)

The dataset used in our study was gathered from a popular polymer database. The dataset for this study comprises of 351 polymers along with their smiles codes, molecular names as input attributes and glass transition temperatures as the output variable. Subsets of 300 polymers and their Tg values were used for training validating the dataset, whereas the rest 51 unseen polymers were used to test the results for both the models, the CNN and the proposed ANN. The figure below shows the top 5 rows of the dataset. The dataset for this study can be found here.

我們研究中使用的數(shù)據(jù)集是從流行的聚合物數(shù)據(jù)庫中收集的。這項研究的數(shù)據(jù)集包括351種聚合物及其微笑代碼，分子名稱作為輸入屬性和玻璃化轉(zhuǎn)變溫度作為輸出變量。 300個聚合物的子集及其Tg值用于訓(xùn)練驗證數(shù)據(jù)集，而其余51個看不見的聚合物用于測試模型，CNN和擬議的ANN的結(jié)果。下圖顯示了數(shù)據(jù)集的前5行。這項研究的數(shù)據(jù)集可以在這里找到。

Reading and Cleaning the Dataset讀取和清理數(shù)據(jù)集 Top 5 rows of the Data Frame數(shù)據(jù)框的前5行

聚合物分類 (Classifying Polymers)

The dataset was manually explored using the Pandas library in python and was classified into eight different classes of polymers- acrylates, styrenes, amides, alkenes, ether, amides, carbonates, and others.

使用python中的Pandas庫手動瀏覽了數(shù)據(jù)集，并將其分為八類不同的聚合物：丙烯酸酯，苯乙烯，酰胺，烯烴，醚，酰胺，碳酸鹽等。

探索性數(shù)據(jù)分析 (Exploratory Data Analysis)

Pie Plot: The pie plot in Figure shows the exact composition of the dataset, with acrylates and styrenes being the highest contributors

餅圖：圖中的餅圖顯示了數(shù)據(jù)集的確切組成，其中丙烯酸酯和苯乙烯的貢獻最大

Box Plot: The box plot is used to show the scatter plot of the underlying Tg distributions for each class of polymers. It can be seen that styrenes tend to have higher Tg whereas acrylates have a fairly mixed distribution.

箱形圖：箱形圖用于顯示每類聚合物的基礎(chǔ)Tg分布的散點圖。可以看出，苯乙烯傾向于具有較高的Tg，而丙烯酸酯具有相當(dāng)混合的分布。

單體中電荷的分布 (Distribution of Charge in Monomer)

The open-source RDKit [1] python package was used to visualize the molecular structure of polymers from the dataset into drawings. One such function in RDKit[1] module employed to compute the Gasteiger partial charges for monomeric units.

開源RDKit [1] python軟件包用于可視化從數(shù)據(jù)集中到圖紙的聚合物分子結(jié)構(gòu)。 RDKit [1]模塊中的一種此類函數(shù)用于計算單體單元的Gasteiger部分電荷。

特征工程 (Feature Engineering)

Engineering features have been vital in preparing the data for modeling and presenting attributes in machine-readable form. As per the problem statement, the Tg prediction was to be based on images of the polymer chemical structure fed into the CNN architecture in encoded form using SMILES line notations. The main aim of feature engineering for this problem is to incorporate the chemical structure as well as the chemical composition of the monomeric unit to predict Tg. This is achieved using the SMILES line notations [2].

工程特征對于準(zhǔn)備數(shù)據(jù)建模和以機器可讀形式顯示屬性至關(guān)重要。根據(jù)問題陳述，Tg預(yù)測將基于使用SMILES線符號以編碼形式饋入CNN體系結(jié)構(gòu)的聚合物化學(xué)結(jié)構(gòu)圖像。針對此問題的特征工程的主要目的是結(jié)合單體單元的化學(xué)結(jié)構(gòu)以及化學(xué)組成來預(yù)測Tg。這是使用SMILES線路符號[2]實現(xiàn)的。

Introduction to SMILES Notation: SMILES stands for Simplified molecular-input line-entry system. This is basically a way of describing the chemical structure in the form of a line notation using different characters. The image below shows the SMILES notation for the given chemical structure.

SMILES簡介表示法： SMILES表示簡化的分子輸入行輸入系統(tǒng)。這基本上是一種使用不同字符以行符號形式描述化學(xué)結(jié)構(gòu)的方法。下圖顯示了給定化學(xué)結(jié)構(gòu)的SMILES符號。

Molecular Structure to Image Encoding: We first defined a list containing all the unique characters that can be present in any given SMILES linear string for a polymer.

圖像編碼的分子結(jié)構(gòu)：我們首先定義一個列表，其中包含可以在聚合物的任何給定SMILES線性字符串中出現(xiàn)的所有唯一字符。

Further, as the second step, the linear string of polymers in the form of line notation are 1-hot encoded in machine-readable form through binary images by using this list of unique SMILES characters. The resulting transformation is an n-dimensional matrix consisting of binary images that can be fed into the CNN architecture. Each binary image is a matrix of the dimensions m × n, where n represents the number of characters in the unique SMILES list and m is the number of characters present in polymer with the longest smiles code. The figure given below depicts the encoding process visually for 1 polymer example -Poly(4-biphenyl acrylate).

此外，作為第二步，使用該唯一的SMILES字符列表，通過二進制圖像以機器可讀的形式對線型聚合物形式的線性聚合物線進行1-hot編碼。最終的變換是一個n維矩陣，其中包含可被饋入CNN體系結(jié)構(gòu)的二進制圖像。每個二進制圖像都是一個尺寸為m×n的矩陣，其中n表示唯一的SMILES列表中的字符數(shù)，m是具有最長笑碼的聚合物中存在的字符數(shù)。下圖給出了1個聚合物實例-聚(丙烯酸4-聯(lián)苯酯)的可視化編碼過程。

Molecular Structure to Image Encoding Process分子結(jié)構(gòu)到圖像編碼過程

The generated one hot encoded image takes into account the chemical structure and the composition of the monomeric unit. We can see that the encoded image tells us the number of each kind of atoms present in the monomeric structure in a binary form along with the alignment structure of atoms in the polymeric chain relative to each other.

生成的一張熱編碼圖像考慮了單體單元的化學(xué)結(jié)構(gòu)和組成。 我們可以看到，編碼圖像告訴我們以二元形式存在于單體結(jié)構(gòu)中的每種原子的數(shù)目以及聚合物鏈中原子彼此之間的排列結(jié)構(gòu)。

The figure given below generates the encoded image of the top five polymers in the data frame.

下圖顯示了數(shù)據(jù)框中前五種聚合物的編碼圖像。

Encoded Images for first five polymers in Data frame數(shù)據(jù)框中前五種聚合物的編碼圖像

模型實施 (Model Implementation)

The image encoded molecular structure was fed as an input to the CNNs and the target variable was Tg of the given polymers, which is a continuous variable. This model was implemented using the Keras library which serves as an Application Programming Interface (API) for Tensorflow.

將圖像編碼的分子結(jié)構(gòu)作為輸入輸入CNN，目標(biāo)變量是給定聚合物的Tg，它是一個連續(xù)變量。該模型是使用Keras庫實現(xiàn)的，該庫充當(dāng)Tensorflow的應(yīng)用程序編程接口(API)。

Schematic of CNN [2]CNN的示意圖[2]

Proposed Architecture: The choice of the final hyper-parameters has been made by incorporating various combinations of all the different hyper-parameters. The best-observed configuration uses filter size of 64 with a window size of (5,5) in the first layer and size (3,3) with 32 filters in the second layer. This is followed by a max-pooling layer with a window size of (3,3). Post the max-pooling layer, we have three dense layers with 32, 10, and 1 neurons respectively, with the final dense layer being the output of our proposed ANN model. ReLu activation function was used by all layers with l2 regularization. The model achieved its best generalization by training up to 180 epochs with a batch size of 64 and a learning rate of 0.03. A validation split of 0.1 and drop out probability of 0.1 was used in training the network to perform cross-validation.

建議的體系結(jié)構(gòu)：通過合并所有不同超參數(shù)的各種組合來選擇最終的超參數(shù)。最佳觀察配置在第一層中使用的過濾器大小為64，窗口大小為(5,5)，在第二層中使用的過濾器大小為(3,3)32個過濾器。接下來是窗口大小為(3,3)的最大池化層。在最大池化層之后，我們有三個密集層，分別具有32、10和1個神經(jīng)元，最后一個密集層是我們提出的ANN模型的輸出。 ReLu激活功能已用于所有帶有12正則化的層。該模型通過訓(xùn)練多達(dá)180個時期(批大小為64，學(xué)習(xí)率為0.03)達(dá)到了最佳概括。在訓(xùn)練網(wǎng)絡(luò)執(zhí)行交叉驗證時，使用了0.1的驗證拆分和0.1的丟失概率。

結(jié)果 (Results)

The figures given below show the experimental and the predicted values of the glass transition temperatures for the training and the unseen test sets. For an ideal model, we would expect the real values to be perfectly equal to the predicted Tg values. This will result in a straight line passing throw the origin.

下圖顯示了訓(xùn)練和看不見的測試裝置的玻璃化轉(zhuǎn)變溫度的實驗值和預(yù)測值。對于理想模型，我們希望實際值完全等于預(yù)測的Tg值。這將導(dǎo)致直線經(jīng)過并拋出原點。

Real vs predicted Tg values for the Training set訓(xùn)練集的實際Tg值與預(yù)測Tg值 Real vs predicted Tg values for Unseen Test set看不見的測試集的真實Tg值與預(yù)測Tg值

It can be perceived that most of the examples show very accurate prediction when compared to the real Tg values. However, there are a few polymers contributing to a significant level of uncertainty in prediction due to their lack of sufficient training data. These polymers belong to the minority classes of esters and ethers and due to insufficient training for either, the Tg of these polymers is not being learned effectively.

可以看出，與真實Tg值相比，大多數(shù)示例都顯示出非常準(zhǔn)確的預(yù)測。但是，由于缺乏足夠的訓(xùn)練數(shù)據(jù)，有幾種聚合物在預(yù)測中具有很大的不確定性。這些聚合物屬于酯和醚的少數(shù)種類，并且由于對它們的訓(xùn)練不足，因此無法有效地學(xué)習(xí)這些聚合物的Tg。

Loss Metrics: We used the mean absolute loss function while training the neural networks. But for our final evaluation, we have used mean relative % error as the evaluation metrics for our model. This can be represented as follows-

損失指標(biāo)：我們在訓(xùn)練神經(jīng)網(wǎng)絡(luò)時使用了平均絕對損失函數(shù)。但是對于最終評估，我們使用平均相對誤差百分比作為模型的評估指標(biāo)。可以表示為

where Ai is the actual Tg value and Pi is the predicted Tg value. The average of this relative % error was taken over the full dataset of m polymers. Post the training process we computed the respective training and testing mean relative errors.

其中Ai是實際Tg值，Pi是預(yù)測Tg值。相對誤差百分比的平均值取自m個聚合物的完整數(shù)據(jù)集。在訓(xùn)練過程之后，我們計算了各自的訓(xùn)練和測試平均相對誤差。

The table given below shows the real and predicted Tg values for 4 unseen polymers. We can see that our predictions lie very close to the experimental Tg values signifying that the proposed model shows an excellent generalization ability.

下表列出了4種看不見的聚合物的實際Tg值和預(yù)測值。我們可以看到我們的預(yù)測非常接近于實驗Tg值，這表明所提出的模型具有出色的泛化能力。

結(jié)論 (Conclusions)

In this study, we demonstrated the feasibility of CNN to predict the Tg of the polymer by taking into account the molecular structure and chemical composition of the monomeric units in the polymer. We were able to achieve a relative error of 6% and 7% training and test set respectively. In my next article, I will be using fully connected neural networks to predict the glass transition temperature. This new model will incorporate all kinds of intra-molecular interactions along with the chemical composition and molecular structure to predict Tg.

在這項研究中，我們通過考慮聚合物中單體單元的分子結(jié)構(gòu)和化學(xué)組成，證明了CNN預(yù)測聚合物Tg的可行性。我們分別獲得了6％的相對誤差和7％的訓(xùn)練集和測試集的相對誤差。在我的下一篇文章中，我將使用完全連接的神經(jīng)網(wǎng)絡(luò)來預(yù)測玻璃化轉(zhuǎn)變溫度。這個新模型將結(jié)合各種分子內(nèi)相互作用以及化學(xué)組成和分子結(jié)構(gòu)來預(yù)測Tg。

學(xué)分 (Credits)

Special Thanks to Danish for contributing to this project.

特別感謝Danish為這個項目做出的貢獻。

1-G. Landrum et al., “Rdkit: cheminformatics and machine learning software,” RDKIT. ORG, 2013.

1-G。 Landrum等人，“ Rdkit：化學(xué)信息學(xué)和機器學(xué)習(xí)軟件”，RDKIT。組織，2013年。

2-Luis A. Miccio, Gustavo A. Schwartz, “From chemical structure to quantitative polymer properties prediction through convolutional neural networks”, Polymer, 2018

2-Luis A.Miccio，Gustavo A.Schwartz，“通過卷積神經(jīng)網(wǎng)絡(luò)從化學(xué)結(jié)構(gòu)到定量聚合物性質(zhì)預(yù)測”，聚合物，2018年

謝謝您的閱讀！！！ (Thank you for reading!!!!)

If you like my work and want to support me:

如果您喜歡我的工作并希望支持我：

1-The BEST way to support me is by following me on Medium.

1-支持我的最佳方法是在Medium上關(guān)注我。

2-Follow me on LinkedIn.

2-在LinkedIn上關(guān)注我。

翻譯自: https://towardsdatascience.com/cnn-approach-using-image-of-the-polymer-to-predict-its-glass-transition-temperature-4a64ee450450