Anticancer Peptides Prediction: A Deep Learning Approach

: Anticancer peptides play a vital role in the treatment of cancer, due to that it has gained a lot of attention. Several machine learning and deep learning algorithms were developed for the prediction of anticancer peptides. Machine learning algorithms involves features extraction from the dataset and then model is trained to make predictions. In machine learning algorithms features extraction and the training of the model takes a lot of time and efforts, this is a complex process for biologists and biochemists. On the other hand deep learning algorithms require a large amount of dataset for training and accurate predictions. This study has proposed a deep learning algorithm which can be trained on smaller dataset because it uses hyperparameter optimization framework for the accurate predictions of anticancer peptides. The deep learning model has outperformed all the other algorithms and achieved the optimal 99% Acc and 0.982 MCC on Main dataset, 98% Acc and 0.972 MCC on Alternative dataset. The code is available at Github for validation purposes [33]. Optimization.


Introduction
Anticancer peptides (ACPs) are small groups of amino Acids (10-60 mostly) joined by peptide bonds showing discriminating and lethal properties towards cancer cells. Due to their innate properties like high perforation, high selectivity and ease of moderation and low manufacturing costs synthetic peptide based medicines and immunogens represent an encouraging class/group of therapeutic agents which shows decreased drug resistance and also suppress angiogenesis of cancer cells. One such example is their increasing use in treatment of hepatocellular carcinoma (HCC). ACPs that are specially formulated can improve affinity, selectivity, and with consistency for more tumor cell eradication. In order to perform cell permeability, the influence of amino acid residues on the anti-cancer action of ACPs is based on cationic, hydrophobic, and amphiphilic qualities with double stranded structure. Particularly, negatively charged amino acid residues (such as glutamic and aspartic acids) perform anti-accretion activity against tumor/cancer cells, whereas positively charged amino acid residues (such as lysine, arginine, and histidine) can destroy and enter cancer cell membrane to cause cytotoxicity. Additionally, hydrophobic amino acid residues, such as tryptophan, phenylalanine, and tyrosine, exert their impact on the cytotoxic action of malignancy. Additionally, the secondary structure of ACPs, which are composed of both positively charged and hydrophobic amino acids, is crucial for the interlinkage of peptides with cancer cell membranes. [1][2][3][4][5]. Artificial intelligence is when machines can learn and do tasks that are not possible without human involvement. Machine learning is when machines can obtain skills and learn things. Deep learning is a branch of machine learning which uses Artificial Neural Network (ANN) to obtain skills from dataset. In ML, features should be extracted from the data so it can be passed to machine and can obtain skills. In deep learning, we have to pass raw data and it generates features by itself. Neural network is a kind of network which is inspired by human brain [6]. In deep learning model is trained to perform specific tasks by taking raw inputs from labeled data. The dataset could contain sound, text, sequences or images. The results that are achieved by deep learning were never possible before. These trained model of deep learning can even exceed human level performance due to ultra-high accuracy. It is because deep learning models uses multiple layered neural network architecture and large labeled dataset [7]. Following Figure shows the architecture of deep learning model. Several machine learning and deep learning algorithms are developed to classify ACP from non-ACP. Ankur et al conducted an experiment on prediction of cell penetrating peptides [8]. Support vector machine (SVM) algorithm is used for the prediction of cell penetrating peptides. The dataset used in the study is consist of 708 peptides. During the experiment several motifs are identified in cell penetrating peptides, based on this a hybrid model is developed for the prediction. The model achieved 81.31% accuracy and 0.63 MCC (Mathew's Correlation Coefficient). Yu et al conducted an experiment on prediction of therapeutic peptides by using novel features encoding and adaptive feature learning [9]. The dataset which is used in the experiment is consist on eight therapeutic peptides. RandomForest classifier is used for the prediction. The model achieved 98% accuracy for ABP therapeutic peptide because its training samples are 1600 and both the classes are in balance. Shaherin et al conducted a study on Machine intelligence in peptide therapeutics [10]. The study analyzed the existing prediction algorithms by using well-constructed dataset. Their results are compared their accuracy and prediction scores. The study provides a brief explanation on how to build an accurate algorithms for anticancer peptides prediction. Atul et al conducted an experiment on anticancer peptides by using In Silico model (AntiCP) [11]. The dataset which is used in the experiment was derived from SwisProt. Multiple feature extraction method were used in the experiment like Amino Acid Composition, dipeptides composition and binary profile. Support vector machine (SVM) algorithm is used for the prediction. Binary profile based SVM model achieved maximum 91.44% accuracy and 0.83 MCC. Zohre et al conducted a study on the prediction of anticancer peptides by using Chou's amino acid composition [12]. Support vector machine (SVM) algorithm is used in the experiment. The model achieved maximum accuracy of 89.7% based on local alignment kernel method. Saravanan et al conducted a study on the prediction of anticancer peptides (ACPP) [13]. Multiple datasets were used in the study including a newly developed dataset. SVM classifier was used for the prediction, the model achieved 96% accuracy and 0.97 MCC. Wei et al conducted an experiment on identifying the anticancer peptides by using a sequence based tool (iACP) [14]. The dataset which is used in the experiment consist on 150 positive and 150 negative samples for anticancer peptides. G-gap dipeptide was used for features extraction and SVM algorithm was used for the prediction. The model achieved 92.67% accuracy and 0.85 MCC on independent dataset. Feng et al conducted an experiment on identifying the anticancer peptides by using improved hybrid composition [15]. Amino acid composition (AAC), average chemical shifts (acACS) and reduced amino acid composition (RAAC) were used to extract the features, after that SVM algorithm was used for the prediction. The jackknief testing achieved 93.61% accuracy. To reach that accuracy the all the extracted features were fused together. Shahid et al conducted an experiment on anticancer peptides prediction by using hybrid feature space (iACP-GAEnsC) [16]. Amino acid composition (AAC), dipeptides composition (DPC) and reduced amino acid composition (RAAC) were used to extract features. The features were tested separately as well as fused by using genetic algorithm based ensemble classification. The model achieved 96.45\% accuracy. Muhammad et al conducted an experiment on the discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information (TargetACP) [17]. SVM algorithm was used in the experiment. After jackknife cross-validation on the benchmark dataset, the model achieved 98.78% accuracy. The model achieved 94.66% accuracy on independent dataset. Nalini et al conducted an experiment on prediction and analysis of anticancer peptides by using a computational tool (ACPred) [18]. SVM and RandomForest classifiers were used in the experiment. After jackknife cross-validation testing, the model achieved 95.61% accuracy on identifying the anticancer peptides. Balachandran et al conducted an experiment on anticancer peptides by using machine learning algorithms (MLACP) [19]. Amino acid composition (AAC), dipeptide composition (DPC), atomic composition (AC), and physicochemical properties were used for the extraction of the features. SVM and RandomForest classifiers were used for the prediction. The model achieved 88.7% accuracy and 0.78 MCC. Chuanyan et al conducted an experiment on the prediction of therapeutic peptides by using deep learning and word2vec (PTPD) [20]. Two datasets were used in the experiment, independent dataset was consist of 138 ACPs and 206 non-ACPs while virulent protein dataset was consist of 225 ACPs and 2250 random protein from the SwisProt. The model achieved 96% accuracy on independent anticancer peptide dataset and 94% on virulent protein dataset. Leyi et al conducted an experiment on the prediction of anticancer peptides by using sequence based predictor (ACPred-FL) [21]. SVM algorithm was used in the experiment and a new dataset was constructed from identifying the ACPs. The model achieved significantly higher accuracy on both 10-fold cross-validation testing and independent testing. Leyi et al conducted another experiment on prediction of therapeutic peptides by using adaptive feature learning (PEPred-Suite) [22]. Eight benchmark datasets were collected and used in the study. RandomForest classifier was used for the prediction, eight models of Ran-domForest classifier were trained by using the learnt representative features. The model achieved best AUC of 93.6% on AVP dataset. Hai et al conducted an experiment on anticancer peptides using long short-term memory (LSTM) deep learning based classifier (ACP-DL) [23]. Two dataset ACP740 and ACP240 were used in the experiment. On 5-Fold cross-validation testing the LSTM model achieved 0.69 MCC for ACP740 and 0.71 for ACP240. Bing et al conducted an experiment on the prediction of anticancer peptides by using fusing multi-view information (ACPred-Fuse) [24]. Machine learning algorithm was used for the prediction. Results of multiview features and existing feature descriptors were compared together, the multiview can better discriminate the characteristics of ACPs. Piyush et al conducted an experiment on anticancer peptides prediction using updated model (AntiCP 2.0) [25]. Various features were used for the machine learning models. Two datasets were used in the experiment, Main and alternative datasets. For Main dataset. Dipeptide composition (DPC) achieved maximum 0.51 MCC for ExtraTree classifier and for alternative dataset, amino acid composition (AAC) based ExtraTree classifier achieved 0.80 MCC. Phasit et al conducted a study on anticancer peptides by using novel flexing scoring card method iACP-FSCM [26]. Benchmark dataset which was used in AntiCP 2.0 [25], was used in the experiment. Three features extraction method were used in the experiment, amino acid composition (AAC), dipeptide composition and composition of terminal region. After features extraction the model was trained to make prediction on anticancer peptides. On independent testing, the classifier achieved the accuracy of 0.825% on Main dataset and 0.910% on Alternative dataset. Our study has proposed a deep learning based approach which also outperformed STALLION [27]. STALLION is model that is used to predict the Kace sites.

Materials and Methods
The benchmark dataset is collected from AntiCP 2.0 [25]. The benchmark dataset is used for fair comparison of the proposed model. In first experiment, machine learning algorithm were used for anticancer peptides prediction. Machine learning algorithms require features extraction from the dataset. IFeature [28] live server is used for features extraction from protein sequences. Multiple features extraction methods were used to extract the features from the dataset. After that Lazypredict [29] was used for prediction. Lazypredict is an open source python library which uses 40 machine learning algorithms for classification. Top two algorithms were selected from Lazypredict, the experiment was conducted with separate features as well as fused features. Data fusion is a process in which all the extracted features are combined together and passed to the Machine Learning classifier. After data fusion the model performed really well but failed to outperform iACP-FCSM. After that deep learning model was used in the experiment, deep learning model outperformed all the other predictors in anticancer peptides prediction.
KerasTuner [30] is used to tune the model and which uses TensorFlow at the back-end. KerasTuner is hyper-parameter optimization framework which finds the best hyperparameter values for your model. In KerasTuner multiple units are passed to the model, so it can train and test the model according to the given units. Deep learning models takes raw input sequences, calculates the features by itself and then it train itself to make predictions. Following Figure 2 shows the proposed deep learning model's architecture.

Computational Environment
Google Colaboratory is used in the development of the suggested strategy. It is an online Jupyter Notebook platform that uses the Python programming language and is cloud-based. Because python libraries are already available on the server and we don't need to manually install them, we utilized Google Colaboratory. The graphics processing unit (GPU) offered by Google Colaboratory also helps deep learning algorithms train and test very quickly.2.2. Architecture of Proposed Deep Learning model, than best performance is selected based on those given units. Table 1 contain parameters used in KerasTuner. with 20 epochs as configured in the script. Following Figure 3 shows the training accuracy of the proposed deep learning model. The model's training accuracy changes as the units are changed.

Results
In the first part, machine learning algorithms were used for the prediction of anticancer peptides. The fused features were passed to LGBM and ExtraTree classifier. On independent testing, LGBM classifier achieved 0.64 MCC on Main dataset and 0.67 MCC on Alternative dataset. After hyper-parameter optimization of machine learning classifier, LGBM classifier achieved 0.65 MCC on Main dataset and 0.69 MCC on Alternative dataset. In the second experiment, the datasets are split into training and testing samples for Main and Alternative datasets. 5 fold cross-validation testing is applied on the datasets to make sure every portion of the datasets is passed to the deep learning model for training and testing. After that mean is calculated.
The proposed deep learning model outperformed iACP-FSCM on both main and alternative datasets. The following Table 2 shows the comparison between KerasTuner and iACP-FSCM.3.1.

Computational Environment
In this section, same experimental setting is used to identify ACPs from non-ACPs. KerasTuner is used to tune the deep learning model, it is a hyperparamter optimization framework which finds the best parameters from the data to predict accurately. It uses Hyperband, Random Search and Bayesian Optimization algorithms, Random Search is used for the prediction of anticancer peptides. The proposed deep learning model is trained on the training dataset and tested on the independent dataset. The model out-

Conclusion
In this study, a deep learning based hyperparameter optimization framework is proposed for the accurate predictions of anticancer peptides. Previously, several machine learning and deep learning model were used for the prediction of anticancer peptides. Machine learning algorithms require features extraction from the dataset which is a complex and time taking process, multiple features extraction methods were used in this study. After extracting the features, Lazypredict was used to check the highest accuracy classifier.
LGBM and ExtraTree classifiers achieved the maximum accuracy but they both failed to outperform iACP-FCSM. In this study, we found out that machine learning approaches requires features extraction from the protein sequences, data fusion and model training, this is a complex and time taking process also the results were not satisfactory. On the other hand deep learning algorithms require a large amount of dataset for training of the classifier but currently large datasets are not available on anticancer peptides. KerasTuner was used to solve this problem, KerasTuner is a hyperparamter optimization algorithm which uses TensorFlow at its backend. KerasTuner outperformed all the other predictors that were developed for anticancer peptides prediction. The deep learning model has outperformed all the other predictors developed for the prediction of anticancer peptides.