Cancer Epitope Classification

Main Article Content

Manon Boonbangyang
Sarayut Nonsiri


Cancer is a leading cause of death in the world. In 2020 World Health Organization (WHO) reported that approximately 10 million deaths caused by cancer and will increase for the coming years. This research paper aims to study the prediction of cancer epitope using machine learning for classifying between cancer cell surface and epitope on healthy cell surface. The comparison between the different machine learning algorithms is presented. This work can help to training T-cell for recognizing cancer cell and release enzyme to kill cancer cell (Targeted Therapy). The experiment results shown that imbalance data the model from Support Vector Machine (SVM) calculated based on Dipeptide Composition (DPC) feature achieved the best accuracy of 79% Sensitivity 16% and Specificity 100% on test dataset. While balance data with SMOTE Random Forest (RF) calculated based on Dipeptide Composition (DPC) feature achieved the best accuracy of 80% Sensitivity 28% and Specificity 96% on the same test dataset. In conclusion, Support Vector Machine (SVM) and Random Forest (RF) calculated based on Dipeptide Composition (DPC) feature can employ these models for predicting the cancer epitope in imbalance dataset and balanced dataset.

Article Details

How to Cite
M. Boonbangyang and S. Nonsiri, “Cancer Epitope Classification”, JIST, vol. 11, no. 2, pp. 72–83, Dec. 2021.
Academic Article: Soft Computing (Detail in Scope of Journal)


World Health Organization, “Cancer,” 21 September 2021. [Online]. Available: news-room/fact-sheets/detail/cancer. [Accessed Sep. 20, 2021].

Bray F, Ferlay J, Soerjomataram I, Siegel RL, Torre LA, Jemal A., “Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries,’ CA Cancer J Clin, 68(6), pp. 394–424, 2018.

Shahid Akbar, Ateeq Ur Rahman, Maqsood Hayat, Mohammad Sohail, “cACP: Classifying anticancer peptides using discriminative intelligent model via Chou’s 5-step rules and general pseudo components,” Chemometrics and Intelligent Laboratory Systems, Volume 196, 103912, ISSN 0169-7439, 2020

L. Breiman, J. H. Friedman, R. Olshen and C. J. Stone, “Classification and Regression Trees,” Wadsworth International Group, Belmont, California, 1984.

Friedman, J. H., “Greedy Function Approximation: A Gradient Boosting Machine,” Annals of Statistics, 29, pp. 1189-1232, 2000.

Quinlan, J. R. (1986). Induction of decision trees. Machinelearning, 1(1), 81-106, 1986.

Quinlan, J. R. (1993). C4. 5: “programs for machine learning,” (Vol. 1). Morgan kaufmann, 1993.

Vita R, Mahajan S, Overton JA, Dhanda SK, Martini S, Cantrell JR, Wheeler DK, Sette A, Peters B., “The Immune Epitope Database (IEDB): 2018 update,” Nucleic Acids Res. 2018 Oct 24. doi: 10.1093/nar/gky1006. [Epub ahead of print] PubMed PMID: 30357391, 2018.

UniProt, “The universal protein knowledgebase,” Nucleic Acids Res. 45, D158–D169, 2016.

Peri S, Navarro JD, Kristiansen TZ, Amanchy R, Surendranath V, Muthusamy B, Gandhi TK, Chandrika KN, Deshpande N, Suresh S, et al: “Human protein reference database as a discovery resource for proteomics.” Nucleic Acids Res, 32 Database: D497-501, 2004.

Yu Wan1, Zhuo Wang and Tzong‑Yi Lee1, “Incorporating support vector machine with sequential minimal optimization to identify anticancer peptides: Wan et al. BMC Bioinformatics,” 22:286, 2021.

Wang, L.; Niu, D.; Wang, X.; Khan, J.; Shen, Q.; Xue, Y., “A Novel Machine Learning Strategy for the Prediction of Antihypertensive Peptides Derived from Food with High Efficiency.” Foods, 10, 550, 2021.

Akshara Pande, Sumeet Patiyal, Anjali Lathwal, Chakit Arora, Dilraj Kaur, Anjali Dhall, Gaurav Mishra, Harpreet Kaur, Neelam Sharma, Shipra Jain, Salman Sadullah Usmani, Piyush Agrawal, Rajesh Kumar, Vinod Kumar, Gajendra P.S.Raghava: “Computing wide range of protein/peptide features from their sequence and structure : biorxiv,” April 04 2019.

Onkar Singh, Wen‑Lian Hsu and Emily Chia‑Yu Su., “Co‑AMPpred for in silico‑aided predictions of antimicrobial peptides by integrating composition‑based features,” Singh et al. BMC Bioinformatics, 22:389, 2021.

Lei J, Sun L, Huang S, Zhu C, Li P, He J, Mackey V, Coy DH, He Q., “The antimicrobial peptides and their potential clinical applications,” Am J Transl Res. 11(7):3919–31, 2019.

Muthuirulan Pushpanathan, Paramasamy Gunasekaran and Jeyaprakash Rajendhran, Antimicrobial Peptides: Versatile Biological Properties,” Hindawi Publishing Corporation International Journal of Peptides, Volume 2013, Article ID 675391, 15 pages,, 2013.

Usmani SS, Bhalla S, Raghava GPS., “Prediction of antitubercular peptides from sequence information using ensemble classifier and hybrid features,” Front Pharmacol, 9:954, 2018.

Kao HJ, Nguyen VN, Huang KY, Chang WC, Lee TY., “SuccSite: incorporating amino acid composition and informative k‑spaced amino acid pairs to identify protein succinylation sites,” Genomics Proteomics Bioinform, 18(2):208–19, 2020.

Huang CH, Su MG, Kao HJ, Jhong JH, Weng SL, Lee TY., “UbiSite: incorporating two‑layered machine learning method with substrate motifs to predict ubiquitin‑conjugation site on lysines,” BMC Syst Biol, 10(Suppl 1):6, 2016.

Chen SA, Lee TY, Ou YY., “Incorporating significant amino acid pairs to identify O‑linked glycosylation sites on trans‑membrane proteins and non‑transmembrane proteins,” BMC Bioinform, 11:536, 2010.

Chou KC., “Prediction of protein cellular attributes using pseudo‑amino acid composition,” Proteins Struct Funct Bioinform. 43(3):246–55, 2001.

Chou K‑C., “Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes,” Bioinformatics, 21(1):10–9, 2005.

Hanley JA, McNeil BJ., “The meaning and use of the area under a receiver operating characteristic (ROC) curve,” Radiol‑ogy, 143(1), pp. 29–36, 1982.

Epitope human publications, 2021. [Online]. Available: human_papers.git. [Accessed Sep. 20, 2021].

Breiman, L., “Random Forests,” Machine Learning 45, 5–32, 2001.

Cortes, Corinna; Vapnik, Vladimir N., “Support-vector networks,” Machine Learning. 20 (3), pp. 273–297, 1995.

Sayamon Hongjaisee, Chanin Nantasenamat, Tanawan Samleerat Carraway, Watshara Shoombuatong, “HIVCoR: A sequence-based tool for predicting) HIV-1 CRF01_AE coreceptor usage,” Computational Biology and Chemistry, Volume 80, Pages 419-432, ISSN 1476-9271, 2019.