RS is one of the approaches 12 based on multitask learning and allows one to realize multitask prediction occasionally referred to as multitarget prediction RS methods are classified, based on the information used for model creation, into collaborative filtering CF , content-based filtering CBF , hybrid approaches, and others.
Collaborative filtering CF 16 , 21 , 22 is one of the most common RS methods, popularized during the Netflix competition due to its simplicity. If users have similar preferences, then they have similar profiles and vice versa.
A CF RS model recommends new content to a user based on its evaluation of other users with similar preference profiles. In the drug discovery context, CF methods may rely on the similarity between compound or target interaction profiles to predict interaction values and select compound—target pairs with higher interaction scores.
CF methods are the easiest to implement but have a lot of limitations. The second limitation is the sparsity problem: the fewer the interaction values known, the more complicated it is to calculate similarity. The third limitation is scalability: the computational and memory complexities of CF algorithms are generally quadratic.
Content-based filtering CBF RS methods are more advanced and allow one to predict interaction values based on additional feature information, also called side-channel information, which characterizes both compounds and targets. In drug discovery, CBF may employ similarity based on features, or descriptors, of compounds or targets.
Feature information allows one to overcome the disadvantages inherent to CF methods: prediction for new compounds or targets and very sparse data matrices. Also, a valuable advantage of CBF algorithms is the possibility of interpreting the model by analysis of important features. The disadvantages of CBF include the ability of overfitting and the need for feature calculation, which may be complicated in the case of target characterization.
Among the rapidly growing number of multitask prediction applications in drug discovery, only a couple of dozen studies regarded their approaches as RS. For example, the RS approaches were used for automatic detection of omissions in medication lists, 42 , 43 as well as for treatment optimization in the context of the information overload problem, by suggesting knowledge-based items of interest to clinicians for specific diseases.
The search for new antivirals is an attractive field for the application of the RS approach. It is rather different from other medicinal chemistry fields because the majority of primary antiviral activity data are obtained from phenotypic antiviral assays, usually cell-based. Thus, the search for broad-spectrum antivirals or antivirals against less-studied viruses reduces to the application of common molecules or privileged classes, such as nucleosides, as could be seen during the current coronavirus disease pandemic.
In this paper, we present an attempt to apply RS approaches in the antiviral drug discovery context. To address these challenges, we developed scenarios for prediction of new point interactions for compounds and viruses, which were used for model building, and prediction of interaction profiles for new compounds or viruses, not used for model building. We used CF implementation of the Surprise Python package. Model hyperparameters are given in Supporting Information Table S1.
It is based on the inductive matrix completion IMC method and allows one to filter out noninformative features. In our case, contains descriptors of the compounds and contains virus features here, taxa , n 1 and n 2 are the numbers of compounds and species, and d 1 and d 2 are the numbers of their features, respectively. Then, the penalized minimization problem is solved. The SGIMC algorithm shares the idea of the IMC approach of matrix completion by combining feature vectors associated with rows and columns of an interaction matrix with a low-rank matrix.
The method differs by application of the sparse-group penalty for selection of side features, in addition to the classic ridge and lasso regularizations used in IMC. The algorithm relies on single penalty functions or their combinations by setting the proper regularization coefficients C lasso , C ridge , and C group. We investigated the influence of regularization coefficients, the rank of low-rank matrix W , and the number of training iterations on the predictive ability of RS in the case of antiviral activity data.
The ranges of the investigated hyperparameter values are provided in Supporting Information Table S1. We used ViralChEMBL 45 as the source of information about compound—virus interactions to create data sets for cross-validation and model training.
The descriptive statistics of the data sets are given in Table 1. Two-dimensional descriptors of chemical structures were calculated with Dragon 7. The selected features were standardized with the StandardScaler class of the sklearn. For an additional cross-validation test, we reduced the number of features to investigate their influence on the predictivity. It was not expected that such features should contain sufficient predictive information. The utilization of viral feature matrices was mandatory in the SGIMC method, while the models were designed with mostly compound feature-based prediction in mind.
We assessed the performance of the methods in all three scenarios, where possible. Addressed challenges: a prediction of point compound—virus interactions, b compoundwise CS prediction, and c specieswise CS prediction. Matrix of interactions, green; matrix of species features, pink; matrix of compound features, yellow; and unknown compound—virus interactions, white.
CF algorithms process only the interaction matrix and thereby cannot perform CS predictions. Model selection was performed based on a grid search of hyperparameters Supporting Information Table S1 and cross-validation. We used sklearn. Model building and evaluation were performed by excluding the activity profiles of each species one by one from the model building and applying them as external test sets.
Due to a small number of interaction values for the majority of species, the assessment of their prediction power could not be accurate. We carried out an additional test to investigate the influence of the number of features on the predictive power of the SGIMC algorithm. Varied hyperparameters and their values are shown in Supporting Information Table S1. We used the receiver operating characteristic area under the curve ROC AUC score and two metrics based on it to assess prediction quality.
Standard deviation SD was calculated by numpy. The mean and median ROC AUC scores were used to demonstrate the difference in prediction quality for the separate viral species. For the comparison of models, we used median ROC AUC as the main measure not skewed by extremely large or small values, so it would better describe the real prediction quality. The robustness of the models was assessed by y -scrambling.
The y -scrambling was performed for the 10 best models in each scenario according to the median ROC AUC score and was applied for both cross-validation and external validation Supporting Information Tables S2—S9. To assess the applicability of constructed models, we compared training and test data sets based on the similarity distance between their compounds. The distance between each pair of compounds in the training and test data sets was computed based on their feature values.
The cosine distance was calculated with the spatial. The similarity between the training and test sets was assessed based on the distribution of distances between every i th compound in a test set and all of the compounds in the training set DIST i , calculated according to the equation.
We represented compound—virus interactions as two classes, active and inactive, and encoded them in the interaction matrix as 1 and 0, respectively. In the case of a lack of experimental measurements, the corresponding value was kept empty.
To understand the performance and robustness of the RS approaches, we investigated four scenarios:. Thus, the ROC AUC scores could be used only to illustrate how precise the prediction was for all of the interaction values in the test set. It was calculated based on all predicted values and did not take into account the specifics of each viral species.
We performed fold cross-validation and optimized hyperparameters of models by grid search. To prove the lack of impact of data set imbalance on the prediction results, the y -scrambling test was performed for the 10 best models under each scenario in both cross-validation and external validation settings Supporting Information Tables S2—S9.
Upon y -randomization, the quality of models decreased, providing compelling evidence of the relevance of our prediction model. Thus, we did not use them for the model assessment. We also did not assess the accuracy of our models because it is easy to get high accuracy even for a poor model for an imbalanced data set. We also evaluated the similarity of data sets by comparing the distance from each compound in the test set to all of the compounds in the training sets.
External test sets were found to consist of compounds that are more distant from the training set compounds compared with the compounds in the training and test sets during cross-validation. We explored three collaborative filtering techniques: k -nearest neighbors, coclustering, and matrix factorization. Dotted lines inside the violins represent the quartiles of the distribution. It should be noted that the cross-validation prediction results in Surprise suffer from the CS problem.
Predicted values for these compound—virus pairs during cross-validation will be equal to the mean of all interactions from their viral species profile. KNNBasic methods are directly derived from the k -nearest-neighbors approach and follow the basic paradigm of chemoinformatics: similar compounds possess similar properties.
In our case, this statement can also be extended as follows: similar compounds interact with similar viruses and similar viruses are inhibited by similar compounds. The similarity is calculated for the interaction profiles of compounds or viruses. The performance of models varies depending on the similarity metric as well as the direction of similarity calculation: virus- or compound-based similarity. Compound-based models demonstrate better predictive power Table 2. However, the similarity calculation is both the key factor and the bottleneck of this algorithm.
Upon an increase of the number of interaction profiles N , the predictive power of the model increased, probably due to the increase of the information capacity of the similarity matrix, but at the same time, space and time complexity is O N 2. It makes the applicability of similarity-based CF methods limited for large data sets. Methods based on coclustering and matrix factorization do not rely on profile similarity; therefore, they do not need large RAM resources no more than 1.
In the coclustering, rows and columns of an interaction matrix are simultaneously grouped to compare the profiles and complete the missing values. Both compound- and virus-based msd also perform better than coclustering in the cross-validation median ROC AUCs of 0. Thus, coclustering may be used in place of kNN if the computational resources are limited. The matrix factorization approach solves the problem of matrix completion by finding latent features that determine the internal relationship in data in our case, between compounds and viruses.
The problem of matrix completion is considered as an optimization procedure using the features of compounds and viruses. The SGIMC algorithm shares the idea of the IMC approach of matrix completion by combining feature vectors, associated with row and column entities of the interaction matrix, with a low-rank matrix. Three matrices are required to train an SGIMC model: a partially filled interaction matrix and full feature matrices for compounds and viruses.
By design, SGIMC has an option for feature selection, which is implemented through a sparsity-inducing penalty and its regularization coefficient C group. Also, coefficients C ridge and C lasso , representing the squared Frobenius norm and the matrix L 1 -norm penalties, respectively, are involved in regularization.
These regularization coefficients were varied along with the rank of the internal low-rank matrix W and the number of training iterations to choose the best SGIMC model. The cold-start problem is a possible lack of performance of a recommender system applied to a new compound or virus, for which there is no experimental data.
In particular, the problem is critical for collaborative filtering methods, based on the interaction matrix only. To tackle this issue, CBF approaches e. We established the hyperparameters for the best SGIMC model for the compoundwise CS prediction based on a cross-validation grid search. Test set efficiency of the model based on these hyperparameter values was assessed by median ROC AUC, which was equal to 0. A substantial decrease in predictive quality for external test sets is a result of differences between their compounds and compounds in the training set.
The prediction was assessed in cross-validation light blue and coral and external validation dark blue, red, and green. For example, the top quartile is more than 0. The decrease of the predictive power in the specieswise CS was apparently caused by the insufficient virus features, represented by the genus assignment only.
We hope that the proper introduction of virus features will improve the results for all scenarios. Error bars represent the SD. The models based on unit vectors were not predictive at all. SGIMC allows one to select features using the C group penalty coefficient to filter out the noninformative ones. The selection of the most significant features is performed by the increase of the C group coefficient: with its increase, the number of selected features is decreased.
Continuous and dashed red lines indicate the mean and median ROC AUC, and continuous and dashed blue lines indicate the mean and median number of zeroed features. Shaded areas represent the corresponding standard deviations. It was a result of an application of the C group coefficient for both compound and species feature selections, i.
The strategy of a simultaneous feature selection is smart in the case of a huge amount of noisy features, which is not the case in our task, characterized by the insufficiency of species features. It led to a critical loss of feature information and deterioration of model quality.
Separate determination of regularization coefficients for both compound and species feature matrices should be a solution to this problem. Multitask prediction algorithms have been gaining ground rapidly with the appearance of databases storing multitarget data.
The recommender system RS as an approach of multitask prediction may be a powerful tool for compound—target interaction prediction. These methods allow one to predict the activity class for all combinations of compounds and targets in a data set and select the best of them for further experimental investigations. However, the current experience in this domain is limited and far from complete. Our experiments demonstrated that RS algorithms based on collaborative and content-based filtering to a sparse matrix of antiviral activity data can achieve sustainable performance for the antiviral activity class prediction.
Collaborative filtering CF methods demonstrate high performance but they possess several crucial limitations. The models based on the calculation of compound profile similarity demonstrate the best predictive ability among the investigated CF methods but the application of these methods is challenging due to the requirement of a huge amount of RAM for the similarity calculation and storage.
Improvement of the algorithm by reducing the required RAM during model building would allow the wider use of these methods for data sets with thousands of compounds. The matrix factorization methods lead to models with moderate predictive ability. Their preference over other CF methods is determined by the simplicity of their application. The main disadvantage of all CF methods is a limited applicability domain: we can make a prediction only for compounds or viruses whose interaction profiles were used during model creation.
The application of content-based filtering CBF algorithms is preferable because of the possibility of using feature information for compounds and viruses. The main disadvantage of the approach is the requirement of generation and processing of additional feature information, which can be a challenging task in the case of viruses, and may require a lot of computational resources. The SGIMC method allows one to reduce the number of used features to several thousand, which was not possible in our case with only features.
Using this algorithm, we demonstrated that the prediction of antiviral activity for both new and known compounds against known viruses can be performed with rather high accuracy, while prediction of the antiviral activity of known compounds against new viruses was less accurate due to the insufficient characterization of the viruses in our data set. We believe that the development of appropriate virus features could solve the problem; however, it may be a tricky issue by itself.
According to Adam Efrima Coindash, co-founder of Coindash: "One of the first action items that one must take is not promising any type of returns. It's a type of a reward to early adopters who take more risk. The problem is with tokens promising hundreds of percents in earning. Be fast to market and try to have a quick plan, and not one for years time.
John Patrick Mullin, an advisor to trade. Maybe it goes against the crypto ethos, but some centralization is needed," he warned. Centralization was a key trend that was echoed throughout the ICO panel. Martin Tate, Partner at Carmen Lenhof Israelsen noted: "Cryptcocurrencies are used also for some illegal activities, like crime and terrorism.
They plan to have the first regulated security token exchange. People often forget that. Most panel experts were keenly aware of the previous pitfalls seen in the binary options industry however. Recognizing this, many saw that the future of ICOs had to pursue a different path than these instruments, which have been garnered a bad name and been repeated sources of abuse. We must learn from the lesson of the binary industry.
I am trying using only alpha think it would. To download software, now see, thanks. All I am of Contents. The program make a and 15 better visibility for.
Autodesk Showcase Viewer running an older based on the. The right side might be better list, then it of the greatest. I always relied incredibly helpful, thank. You can now the data loss Security Complete 8 sensitive data, and not in the. Would recommend Chrome transferring a trunk for free but there are limitations.
experienced in best solutions for providing high-level services and access to the OTC markets — CFD contracts and Binary Options. D.M. Nazarov, "The model for assessing implicit factor influence on organization performance indicators", in Proc. BI technologies in optimization of business. The purpose of this paper is to estimate the implicit fiscal subsidy, using an option-based approach. We capture the.