Biomarkers are biological molecules that may be monitored with minimum invasiveness and whose presence or level provide valuable information about a disease process or treatment thereof. The term “biomarker” covers a broad range of biological phenomenon from blood pressure to single nucleotide polymorphisms to circulating T3 levels. Recently, genes and microRNAs have emerged as important biomarkers for diagnosing diseases or predicting clinical outcomes. Advances in high-throughput multiplexed massively-parallel sequencing technologies have substantially lowered the barrier to identifying genes or microRNAs whose expression levels can be easily monitored and may be linked to disease. Leveraging the large data sets generated by massively-parallel sequencing in order to accelerate drug discovery, advance disease understanding, and improve patient outcomes requires robust and reproducible statistical methods to select the most promising predictive biomarkers.

Biomarkers are used clinically and in the drug-discovery process for diverse purposes. They can be used in patient stratification, to select patients that will likely respond to a drug, increasing the likelihood of clinical trial success. Biomarkers can serve as surrogates for drug efficacy, which could lead to reduced clinical trial duration and cost. They can help differentiate several candidate drugs within a related structural class. Biomarkers could serve as indicators of toxicity to exclude certain patient groups from clinical trials or drug therapy. Biomarkers have great utility in patient screening for the early detection of a disease and clinical trial recruitment. They can also be used for prognostic evaluation.

Ocean Ridge Biosciences provides customized predictive modeling packages designed to best achieve your research goals. Standard statistical models are described below. For more information about analyses that are best suited for your biomarker discovery program, contact us.

Statistical Models for Biomarker Selection

There are a number of models or algorithms that can be used to search for predictive biomarkers. Three popular methods include logistic regression, random forests, and support vector machines.

Logistic Regression

Logistic regression models the probability of an outcome (e.g. drug response) given a set of predictive markers. The probability of the outcome is equal to the "logistic" function of the marker, which in this case might be the expression level of a given gene or microRNA.

Random Forest

Random forest models repeatedly split or stratify the biomarker levels into a number of simple regions. At each split, a random set of all biomarkers is searched to find a value of one biomarker that best splits the data into two groups.

Support Vector Machine

In a support vector machine, a hyperplane (e.g. a line if there are 2 biomarkers, or a plane if there are 3 biomarkers) is constructed that divides the outcomes based on the biomarker levels.

During fitting of these models, the parameters of the hyperplane (e.g. β1,β2,...,βpβ1,β2,...,βp etc.) are found to best separate the outcomes.

Feature Selection

One of the main analytical challenges in identifying predictive biomarkers is the large number of potential genes/microRNAs which could be examined. As with the statistical models, several methods have been used to narrow down the list of potential biomarkers. These approaches usually combine content knowledge of the biological phenomenon with data-driven methods, such as forward stepwise selection, backward stepwise selection, and lasso regression.

Forward Stepwise Selection

In forward stepwise selection, biomarkers are added to model one at a time and the best model at each step is selected. A final model is selected among all the “best” models from each step.

Backward Stepwise Selection or Recursive Feature Elimination

In contrast to forward stepwise selection, backward stepwise selection starts with all biomarkers (or all differentially expressed biomarkers) in the model and removes them one at a time. At each step, the least important biomarker is removed. The model with the best performance, and its associated biomarkers, is chosen.

Lasso Regression

Lasso regression works differently from stepwise selection in that the selection of important biomarkers is built into this model. In this regression, not only is the error of the model minimized, but so is the sum of the coefficients for each biomarker. Thus, the least important biomarkers will have a coefficients near zero, and can be excluded from the model.

Measuring Model Performance

When classifying binary outcomes such as drug response, a predictive model have one of four outcomes: true positive, in which the patient response is positive and the model predicts a positive response, true negative, in which the patient response is negative and the model predicts a negative response, false positive, in which the model predicts a positive response, but the patient response is actually negative, and false negative, in which the patient has a positive response, but the model predicts a negative response. These possibilities are shown in the table below.

Prediction Actual Response
  R PD


Sensitivity, Specificity, and Accuracy

A model's performance can be quantified by sensitivity, specificity, accuracy.

Sensitivity TP / (TP + FN)
Specificity TN / (TN + FP)
Accuracy (TP + FN) / (TP + TN + FP + TN)

ROC Curves

Receiver Operator Characteristic curves plot the true positive rate (or sensitivity) against the false positive rate, which is 1 - specificity. An ideal ideal model will have a true positive rate of 1 and a false positive rate of 0. In this case the area under the ROC curve will be 1. A predictive model that does no better than guessing will have an area under the curve of 0.5.

Receiver Operator Characteristic Curve for Subarachnoid Hemorrhage Outcome Based on S100B

Advance your research and biomarker discovery initiatives with predictive modeling -
contact us for a complementary consultation!