data:image/s3,"s3://crabby-images/c5c5e/c5c5ef23c03ec7e3d4358896c31dddf21c85d256" alt="Mastering Scientific Computing with R"
Descriptive statistics
A useful tool to evaluate your data before you begin your analysis is the summary()
function, which provides a summary of the non-parametric descriptors of a sample, as follows:
> summary(probeA) Min. 1st Qu. Median Mean 3rd Qu. Max. 4.211 4.645 4.774 4.774 4.892 5.231
You can also get a summary for each column of the matrix using the summary()
function, as follows:
> summary(MLLpartner.mx) #output truncated GSM487973 GSM487972 GSM487971 Min. : 2.112 Min. : 1.805 Min. : 1.994 1st Qu.: 3.412 1st Qu.: 3.410 1st Qu.: 3.411 Median : 4.736 Median : 4.745 Median : 4.731 Mean : 5.342 Mean : 5.346 Mean : 5.355 3rd Qu.: 6.870 3rd Qu.: 6.851 3rd Qu.: 6.933 Max. :14.449 Max. :14.453 Max. :14.406
We can also get this information by calling the individual functions used to determine these parameters, namely the mean
, median
, min
, max
, and quantile
functions, as follows:
> min(probeA) [1] 4.21085 > max(probeA) [1] 5.231199 > mean(probeA) [1] 4.773866 > median(probeA) [1] 4.774236 > quantile(probeA) 0% 25% 50% 75% 100% 4.210850 4.644994 4.774236 4.892259 5.231199
You can also specify which probabilities to use with the probs
argument, as follows:
> quantile(probeA, probs = c(0.1, 0.2, 0.6, 0.9)) 10% 20% 60% 90% 4.375377 4.576501 4.821101 5.118735
To avoid getting a string of numbers in your output, you can specify the number of decimal places to be displayed using the round()
function, as follows:
> round(mean(probeA), 2) [1] 4.77
Suppose we also had information on the response to the drugA
treatment for those 42 patient samples, we could include it to the probeA
and probeB
gene expression levels, as follows:
> df <- data.frame(expr_probeA=probeA, expr_probeB=probeB, drugA_response= factor(rep(c("success", "fail"), 21))) > head(df) expr_probeA expr_probeB drugA_response GSM487973 4.372242 8.176033 success GSM487972 4.293080 8.541016 fail GSM487971 4.210850 8.338475 success GSM487970 4.707231 7.935021 fail GSM487969 4.345101 7.868985 success GSM487968 4.586062 7.909702 fail
Now, we can get a summary for each column by response to the drugA
treatment using the by()
function, as follows:
> by(df, df$drugA_response, summary) df$drugA_response: fail expr_probeA expr_probeB drugA_response Min. :4.293 Min. :6.960 fail :21 1st Qu.:4.687 1st Qu.:7.935 success: 0 Median :4.766 Median :8.245 Mean :4.786 Mean :8.201 3rd Qu.:4.895 3rd Qu.:8.575 Max. :5.218 Max. :8.926 ------------------------------------------------- df$drugA_response: success expr_probeA expr_probeB drugA_response Min. :4.211 Min. :6.652 fail : 0 1st Qu.:4.571 1st Qu.:7.597 success:21 Median :4.776 Median :7.921 Mean :4.762 Mean :7.950 3rd Qu.:4.885 3rd Qu.:8.338 Max. :5.231 Max. :9.033
Data variability
In addition to general information from the mean
, median
, and quantiles
functions, we are often interested in knowing how variable our data points are from each other. The simplest measure of variability is the range, which is the difference between the largest and smallest value. We can obtain the range by subtracting the maximum value from the minimum value, or simply using the range()
function, as shown in the following code:
> max(probeA) - min(probeA) [1] 1.02035 > range(probeA) [1] 4.210850 5.231199
Other measures of variability include the variance and standard deviation. The variance is defined as the average squared deviation of the data values from the mean. More formally, the population variance is defined as:
data:image/s3,"s3://crabby-images/b1a18/b1a18339d1b72ea07911c25a6087b90c77140499" alt="Data variability"
In the preceding formula, N is the size and is the mean of the population given by each data point
. The sample variance is defined as:
data:image/s3,"s3://crabby-images/c6a33/c6a338436f280e487367fd666bf10060e74aa130" alt="Data variability"
In the preceding formula, n is the number of samples from the population, n is less than N and is the sample mean.
In other words, the sum of squares divided by the number of data values (n) for a population and the degrees of freedom (n-1) for a sample. We can find the mean using the mean()
function, as follows:
> mean(probeA) [1] 4.773866
We can calculate the sum of squares in R using the sum()
function, as follows:
> probeA.soq <- sum((probeA-mean(probeA))^2) [1] 2.734039
Now, we can easily calculate the unbiased sample variance by dividing the sum of squares by the number of degrees of freedom defined as (n-1), where n is length()
of the probeA
vector, as shown in the following command:
> d.f <- length(probeA) - 1 > probeA.soq/(d.f) [1] 0.06668388
A quicker way to get the variance would be to use the var()
function, as shown in the following command:
> var(probeA) [1] 0.06668388
Another measure of variability is the standard deviation defined as the square root of the sample variance. To get the standard deviation for our probeA
data, we can write the formula using the sqrt()
function to get the square root of the variance or, more simply, use the sd()
function, as shown in the following command:
> sqrt(var(probeA)) [1] 0.2582322 > sd(probeA) [1] 0.2582322
Often, it is important to determine how reliable our measures are by calculating the standard error of the mean, defined as the square root of the variance divided by the number of samples, as shown in the following command:
> sqrt(var(probeA)/length(probeA)) [1] 0.0398461
Another way we can assess the reliability of our measurements is by determining the confidence intervals for the mean of our data points. Confidence intervals (CI) estimate the range the mean would fall in, should the experiment or exercise be repeated. We can calculate the confidence intervals for the mean of our sample distribution by multiplying the standard error by the t value associated with a significance level of equal to 0.025 or 0.975 quantile of the t-distribution with 41 degrees of freedom. The
qt()
function gives us the quantiles of the t-distribution for n degrees of freedom, which we can apply in the formula to calculate the confidence interval, as shown in the following code:
> std.err.s2A <- sqrt(var(probeA)/length(probeA)) > qt(.975, d.f)* std.err.s2A [1] 0.08047082
Therefore, the mean of the gene expression values for probeA
is 4.77 ± 0.080 for the 42 samples with a 95 percent confidence interval.