Posted on

Occasionally, I would encounter a problem in R where I want to split a string in a character columns with the same separator. However, there's is no function in R that is capable of doing that, and the strsplit function always return a list which I have to unlist it.

So today, I finally typed up a Rcpp function to ease my work and the code is as followed.

The function takes three inputs:

  • The character vector that is being split
  • separator
  • the piece that is desired

+++

Test the code:

library(Rcpp)
sourceCpp('~/scripts/R/Rcpp/string_split.cpp')

testVector <- rep('I~am~a~boy',10)
for (i in 1:4){
	print(string_split(testVector,'~',i))
}
##  [1] "I" "I" "I" "I" "I" "I" "I" "I" "I" "I"
##  [1] "am" "am" "am" "am" "am" "am" "am" "am" "am" "am"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
##  [1] "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy"

Benchmarking:

library(rbenchmark)
library(ggplot2)
## Loading required package: methods
r_string_split <- function(x){
	sapply(x,function(y) unlist(strsplit(x,'~'))[2])
}

bm <- benchmark(string_split(testVector,'~',2),r_string_split(testVector))
bm
##                               test replications elapsed relative user.self
## 2       r_string_split(testVector)          100   0.050       25     0.049
## 1 string_split(testVector, "~", 2)          100   0.002        1     0.001
##   sys.self user.child sys.child
## 2        0          0         0
## 1        0          0         0
ggplot(data = bm,aes(x = test, y = relative)) +
		geom_bar(stat='identity') +
		theme(axis.text.x = element_text(angle=90,
										hjust = 1,
										vjust = 0.5))+
		labs(y = 'relative speed',title = 'benchmarking result')

![plot of chunk unnamed-chunk-2]({{ site.url }}/assets/article_images/string/unnamed-chunk-2-1.png)

The c++ function is ~25x faster.