Intellegens has applied our proprietary neural network approach to compound-protein bioactivity prediction. The data contained incomplete data for the empirical activity changes of 10,019 proteins measured against 2672814 compounds. The empirical data was originally 0.032% complete – so over 99.9% of the values for compound-protein activity were missing. After the neural network was trained on the data set it was applied to fill in the missing data predicting the likelihood for protein activity changes for all compounds. 20% of the missing values were filled in, representing an increase in the available data of ×625 for the client. In the following two sections we first analyze the distribution of predicted likelihood of protein activity and secondly perform a cross-validation analysis to probe the accuracy of the predictions.
The neural network code is probabilistic so we can select a level of reliability for the predicted points so accepting a lower level of reliability allows more points to be filled in. In this section we analyze the distribution of the predicted levels of protein likelihood in the completed data set. In the following cumulative distribution function we show the fraction of compounds/proteins with likelihood of activity exceeding a certain value:
This shows that if the likelihood threshold for activity is set to 1 then less than 1% of samples will be selected as being active, if the likelihood threshold for activity is set to 0.5 then 65% of samples will be put forward as being active. The bias toward activity reflects the training data that contained 71% active and just 29% inactive proteins, which also means that the cumulative fraction curve has a convex shape.:
To validate the accuracy of the results we performed a 4-fold cross-validation test. The data set was split in four, and then each quarter is withheld for validation, and the other three quarters used to train the neural network. The accuracy of the neural network was then assessed by comparing its predictions against the unseen data. The procedure was repeated for each quarter of the data to give four sets of predictions of accuracy The data is ordered by likelihood of protein activity/inactivity, that is whose values are nearest to either1 or 0. The fraction of data (set by the x-axis of the graph below) that are most likely to beactive/inactive are taken and compared to the original data set. In the graph below we show the fraction of the activity entries that are present in the unseen data that were correctly predicted Results are averaged over the four cross-validation runs. The line shows the mean fraction predicted and the light blue lines the standard deviation predicted from the variance in the four cross-validation runs.:
The left-hand side of the graph show that if the tolerance to accept a point has being active/inactive isset low so that just 0.005 of the points are completed, then all of the points predicted to be active/inactive match the original data set. As the tolerance to accept a point is increased so that all data is binned in either active of inactive the fraction of correctly completed points falls to 75% as some of the points predicted to be active are in fact inactive and vice versa. The accuracy determined by this testing procedure should immediately carry across to the unknown data. This predicts that if the data set is completed to 1% predictions should be 88% accurate, and 10% predictions should by 80% accurat The uncertainty is ~±3% showing that the results from the four cross-validation runs are consistent so giving confidence in the accuracy of the predictions.