Descriptive statistics_Mastering Scientific Computing with R-QQ阅读中文古言网

上QQ阅读APP看书，第一时间看更新

Descriptive statistics

A useful tool to evaluate your data before you begin your analysis is the summary() function, which provides a summary of the non-parametric descriptors of a sample, as follows:

> summary(probeA)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
  4.211   4.645   4.774   4.774   4.892   5.231

You can also get a summary for each column of the matrix using the summary() function, as follows:

> summary(MLLpartner.mx) #output truncated
   GSM487973        GSM487972        GSM487971     
 Min.   : 2.112   Min.   : 1.805   Min.   : 1.994  
 1st Qu.: 3.412   1st Qu.: 3.410   1st Qu.: 3.411  
 Median : 4.736   Median : 4.745   Median : 4.731  
 Mean   : 5.342   Mean   : 5.346   Mean   : 5.355  
 3rd Qu.: 6.870   3rd Qu.: 6.851   3rd Qu.: 6.933  
 Max.   :14.449   Max.   :14.453   Max.   :14.406

We can also get this information by calling the individual functions used to determine these parameters, namely the mean, median, min, max, and quantile functions, as follows:

> min(probeA)
[1] 4.21085
> max(probeA)
[1] 5.231199
> mean(probeA)
[1] 4.773866
> median(probeA)
[1] 4.774236
> quantile(probeA)
      0%      25%      50%      75%     100%
4.210850 4.644994 4.774236 4.892259 5.231199

You can also specify which probabilities to use with the probs argument, as follows:

> quantile(probeA, probs = c(0.1, 0.2, 0.6, 0.9))
     10%      20%      60%      90%
4.375377 4.576501 4.821101 5.118735

To avoid getting a string of numbers in your output, you can specify the number of decimal places to be displayed using the round() function, as follows:

> round(mean(probeA), 2)
[1] 4.77

Suppose we also had information on the response to the drugA treatment for those 42 patient samples, we could include it to the probeA and probeB gene expression levels, as follows:

> df <- data.frame(expr_probeA=probeA, expr_probeB=probeB, drugA_response= factor(rep(c("success", "fail"), 21)))
> head(df)
          expr_probeA expr_probeB drugA_response
GSM487973    4.372242    8.176033        success
GSM487972    4.293080    8.541016           fail
GSM487971    4.210850    8.338475        success
GSM487970    4.707231    7.935021           fail
GSM487969    4.345101    7.868985        success
GSM487968    4.586062    7.909702           fail

Now, we can get a summary for each column by response to the drugA treatment using the by() function, as follows:

> by(df, df$drugA_response, summary)
df$drugA_response: fail
  expr_probeA     expr_probeB    drugA_response
 Min.   :4.293   Min.   :6.960   fail   :21    
 1st Qu.:4.687   1st Qu.:7.935   success: 0    
 Median :4.766   Median :8.245                 
 Mean   :4.786   Mean   :8.201                 
 3rd Qu.:4.895   3rd Qu.:8.575                 
 Max.   :5.218   Max.   :8.926                 
-------------------------------------------------
df$drugA_response: success
  expr_probeA     expr_probeB    drugA_response
 Min.   :4.211   Min.   :6.652   fail   : 0    
 1st Qu.:4.571   1st Qu.:7.597   success:21    
 Median :4.776   Median :7.921                 
 Mean   :4.762   Mean   :7.950                 
 3rd Qu.:4.885   3rd Qu.:8.338                 
 Max.   :5.231   Max.   :9.033

Data variability

In addition to general information from the mean, median, and quantiles functions, we are often interested in knowing how variable our data points are from each other. The simplest measure of variability is the range, which is the difference between the largest and smallest value. We can obtain the range by subtracting the maximum value from the minimum value, or simply using the range() function, as shown in the following code:

> max(probeA) - min(probeA)
[1] 1.02035
> range(probeA)
[1] 4.210850 5.231199

Other measures of variability include the variance and standard deviation. The variance is defined as the average squared deviation of the data values from the mean. More formally, the population variance is defined as:

In the preceding formula, N is the size and Data variability is the mean of the population given by each data point . The sample variance is defined as:

In the preceding formula, n is the number of samples from the population, n is less than N and Data variability is the sample mean.

In other words, the sum of squares divided by the number of data values (n) for a population and the degrees of freedom (n-1) for a sample. We can find the mean using the mean() function, as follows:

> mean(probeA)
[1] 4.773866

We can calculate the sum of squares in R using the sum() function, as follows:

> probeA.soq <- sum((probeA-mean(probeA))^2)
[1] 2.734039

Now, we can easily calculate the unbiased sample variance by dividing the sum of squares by the number of degrees of freedom defined as (n-1), where n is length() of the probeA vector, as shown in the following command:

> d.f <- length(probeA) - 1
> probeA.soq/(d.f)
[1] 0.06668388

A quicker way to get the variance would be to use the var() function, as shown in the following command:

 > var(probeA)
[1] 0.06668388

Another measure of variability is the standard deviation defined as the square root of the sample variance. To get the standard deviation for our probeA data, we can write the formula using the sqrt() function to get the square root of the variance or, more simply, use the sd() function, as shown in the following command:

> sqrt(var(probeA))
[1] 0.2582322
> sd(probeA)
[1] 0.2582322

Often, it is important to determine how reliable our measures are by calculating the standard error of the mean, defined as the square root of the variance divided by the number of samples, as shown in the following command:

> sqrt(var(probeA)/length(probeA))
[1] 0.0398461

Confidence intervals

Another way we can assess the reliability of our measurements is by determining the confidence intervals for the mean of our data points. Confidence intervals (CI) estimate the range the mean would fall in, should the experiment or exercise be repeated. We can calculate the confidence intervals for the mean of our sample distribution by multiplying the standard error by the t value associated with a significance level of equal to 0.025 or 0.975 quantile of the t-distribution with 41 degrees of freedom. The qt() function gives us the quantiles of the t-distribution for n degrees of freedom, which we can apply in the formula to calculate the confidence interval, as shown in the following code:

> std.err.s2A <- sqrt(var(probeA)/length(probeA))
> qt(.975, d.f)* std.err.s2A
[1] 0.08047082

Therefore, the mean of the gene expression values for probeA is 4.77 ± 0.080 for the 42 samples with a 95 percent confidence interval.