Posted on

In mulitple testing experiments, the p values from a regular t-test will introduce a lot of false positive. In R, the function p.adjust provides serveral ways to adjust the p-values.

However, the speed of Benjamini–Hochberg procedure in p.adjust is not optimized. In this case, I rewrote it in Rcpp, which is basically c++. The algorithm is explained here.

The cpp function of FDR control:

library(Rcpp)
cppFunction(' NumericVector FDRcontrol(NumericVector p, double alpha){
	// the input should be ascending order p value // 
	int size = p.size();
	double threshold1, threshold2;
	NumericVector passed(size);
	int i = 0;
	while ( i < size){
		threshold1 = alpha * (i+1) / size;
		threshold2 = alpha * (i+2) / size;
		if (threshold1 > p[i]){
			if (threshold2 < p[i+1]){
				for (int j = 0; j <= i; j++){
					passed[j] = 1;
				}
			}
		} 
		i++;
	}
	return(passed);
}')

Now, start benchmarking,

library(dplyr)
library(rbenchmark)
library(data.table)

set.seed(100) #Again, reporoducible

df <- data.table(p = rnorm(100000,-0.5:0.5)) %>%
		filter(p < 1 & p > -1) %>%
		mutate(p = abs(p))	%>% 
		tbl_df
df
## Source: local data frame [62,512 x 1]
## 
##            p
## 1  0.6315312
## 2  0.5789171
## 3  0.3830287
## 4  0.8186301
## 5  0.1401379
## 6  0.4101139
## 7  0.5962745
## 8  0.7016340
## 9  0.3766205
## 10 0.4706833
## ..       ...
# The goal is to put a 1 if FDR < alpha
# here set FDR threshold as 0.5, which is half the time is random
benchmark(df %>% arrange(p) %>% mutate(fdr = FDRcontrol(p,0.5)),
		df %>% mutate(padj = p.adjust(p,method='BH'),FDR=ifelse(padj<0.5,1,0)))
##                                                                               test
## 1                           df %>% arrange(p) %>% mutate(fdr = FDRcontrol(p, 0.5))
## 2 df %>% mutate(padj = p.adjust(p, method = "BH"), FDR = ifelse(padj < 0.5, 1, 0))
##   replications elapsed relative user.self sys.self user.child sys.child
## 1          100   3.272    1.000     3.267    0.004          0         0
## 2          100   7.304    2.232     7.159    0.142          0         0
df1 = df %>% 
		arrange(p) %>% 
		mutate(fdr = FDRcontrol(p,0.5)) %>% 
		mutate(padj = p.adjust(p,method='BH'),FDR=ifelse(padj<0.5,1,0))
df1
## Source: local data frame [62,512 x 4]
## 
##               p fdr      padj FDR
## 1  6.717138e-06   1 0.4199017   1
## 2  2.243902e-05   0 0.5541057   0
## 3  2.659197e-05   0 0.5541057   0
## 4  3.975755e-05   0 0.5898058   0
## 5  4.863623e-05   0 0.5898058   0
## 6  7.776861e-05   0 0.5898058   0
## 7  7.790700e-05   0 0.5898058   0
## 8  8.057130e-05   0 0.5898058   0
## 9  8.491573e-05   0 0.5898058   0
## 10 1.065339e-04   0 0.6312590   0
## ..          ... ...       ... ...

Although the Rcpp function is faster, it does not have the flexibility to choose a thershold for FDR after adjusting the p values.

And also, Rcpp is a good way to increase the speed of R.