As a biological point of view, only a small set of genes are related to disease. Therefore, data related to the majority of genes actually have noisy background role, which can fade the effect of that small veliparib clinical trial but important subset. Hence, concentration on smaller sets of gene expression data results in a better explanation of the role of informative genes. There is also a major problem named “multicollinearity” in the data matrix with highly correlated features. If there is no linear
relationship between the regressors, they are said to be orthogonal. Multicollinearity is a case of multiple regression in which the predictor variables are themselves highly correlated. If the goal is to understand how the various X variables impact Y, then multicollinearity is a big problem. Multicollinearity is a matter of degree, not a matter of presence or absence.[7] The first important step to analyze the microarray data is reducing the noninformative genes or on the other hand, genes selection for the classification task. In general, three features (gene) selection models exist.[8] The first model is filter model that carries out the features selection and classification in two separated steps. This model selects the genes as effective genes, that have high discriminative ability.
It is independent of classification or training algorithm and also is simple and fast. The second model is wrapper model that carries out the features selection and classification in one process. This model uses the classifier during the effective genes selecting process. In other words,
the wrapper model uses the training algorithm to test the selected gene subset. The accuracy of wrapper model is more than filter one. Different methods are represented for selecting the appropriate subsets based on wrapper model in literatures. Evolutionary algorithms are used with K-neighborhood nearest classifier for this aim.[9] Parallel genetic algorithms are extended by applying adaptive operations[10] Also[11] genetic algorithm and support vector machine (SVM) hybrid model are used to select a set of genes. Gene selection and classification problem is discussed as a multi objective optimization problem[12] in which the number of features and misclassified AV-951 samples are reduced, simultaneously. Finally in hybrid models, selecting a set of effective genes is done during the training process by a particular classifier. A sample of this model is using a SVM with recursive feature elimination. The idea of this method is eliminating the genes one by one and surveying the effect of this elimination on the expected error.[13] Recursive feature elimination algorithm is a backward feature ranking method.