FluLeap provides host tropism prediction for 11 influenza proteins including glycoproteins HA and NA, nucleoprotein NP, both matrix proteins M1 and M2, both non-structural proteins NS1 and NS2, as well as the rest of the viral polymerase proteins PA, PB1, PB1-F2, and PB2 as well as influenza virus strains. The prediction models were constructed using the machine learning algorithm Random Forest by training datasets with influenza protein sequences isolated from avian and human samples.
Influenza protein sequences were obtained from Influenza Research Database (IRD). The dataset consisted of sequences isolated from avian and human samples. Protein sequences were then transformed into feature vectors from their amino acid composition and physicochemical properties. The total numbers of positive and negative samples for each prediction model can be accessed here.
The training datasets for each influenza protein were then trained using Random Forest with optimized parameters, which includes number of trees and number of features. All training of prediction models were conducted using 10-fold cross-validation to prevent overfitting. Further information about Random Forest can be found here.
For the construction of prediction model for virus strains, information from protein sequences of all influenza proteins were combined. Feature selection was performed to select the relevant feature vectors for each protein, and they were then combined into a dataset for the training of the prediction model.
Performance of prediction models
The prediction models were first evaluated from their performance in 10-fold cross-validation training. All models achieved outstanding predictive performance, the lowest performer for the protein prediction models being NS2 model with 96.57% accuracy (AUC=0.980; MCC=0.916), while HA model achieved the best predictive performance of 98.62% accuracy (AUC=0.998; MCC=0.972). The combined prediction model for strain host tropism prediction achieved 99.72% accuracy (AUC=0.999; MCC=0.994).
Performance of the prediction models were further independently validated with separate testing datasets, with sequences not used in the training process. All prediction models performed similarly well, with the lowest accuracy by M2 model at 97.09% (AUC=0.993; MCC=0.939) and the highest accuracy by HA model with 98.78% accuracy (AUC=0.997; MCC=0.976). Performance of the combined prediction model was also further validated with a separate independent testing dataset, achieving 99.83% accuracy (AUC=0.998; MCC=0.997). This reaffirms the models' ability for accurate prediction of novel sequences.
Further details on the performance of the prediction models can be found here.