Posted on

Occasionally, I would encounter a problem in R where I want to split a string in a character columns with the same separator. However, there's is no function in R that is capable of doing that, and the strsplit function always return a list which I have to unlist it.

So today, I finally typed up a Rcpp function to ease my work and the code is as followed.

#include <Rcpp.h>
#include <sstream>
#include <string>
#include <vector>
using namespace Rcpp;
using namespace std;
typedef vector<string> stringList;
typedef vector<int> numList;
//split function to split line with desired deliminator
stringList split(const string &s, char delim)
{
stringList result;
stringstream ss(s);
string item;
while (getline(ss, item, delim))
{
result.push_back(item);
}
return result;
}
//[[Rcpp::export]]
stringList string_split(stringList x, string sep, int start, int frag)
{
int size = x.size();
stringList result(size);
char delim = sep[0];
for (int i = 0; i < size; i++)
{
stringList splittedString = split(x[i],delim);
int splittedSize = splittedString.size();
if (splittedSize < start + frag)
{
Rcpp::stop("Wrong end value\n");
}
string str = splittedString[start-1];
for (int j = start ; j < start + frag ; j++)
{
str = str + sep + splittedString[j];
}
result[i] = str;
}
return result;
}
//[[Rcpp::export]]
stringList changeDelim(stringList x, char sep, string delim)
{
int size = x.size();
stringList result(size);
for (int i = 0 ; i < size ; i ++)
{
stringList splittedString = split(x[i],sep);
string str = splittedString[0];
for (int j = 1 ; j < splittedString.size(); j++)
{
str = str + delim + splittedString[j];
}
result[i] = str;
}
return result;
}

The function takes three inputs:

  • The character vector that is being split
  • separator
  • the piece that is desired

+++

Test the code:

library(Rcpp)
sourceCpp('~/scripts/R/Rcpp/string_split.cpp')

testVector <- rep('I~am~a~boy',10)
for (i in 1:4){
	print(string_split(testVector,'~',i))
}
##  [1] "I" "I" "I" "I" "I" "I" "I" "I" "I" "I"
##  [1] "am" "am" "am" "am" "am" "am" "am" "am" "am" "am"
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
##  [1] "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy"

Benchmarking:

library(rbenchmark)
library(ggplot2)
## Loading required package: methods
r_string_split <- function(x){
	sapply(x,function(y) unlist(strsplit(x,'~'))[2])
}

bm <- benchmark(string_split(testVector,'~',2),r_string_split(testVector))
bm
##                               test replications elapsed relative user.self
## 2       r_string_split(testVector)          100   0.050       25     0.049
## 1 string_split(testVector, "~", 2)          100   0.002        1     0.001
##   sys.self user.child sys.child
## 2        0          0         0
## 1        0          0         0
ggplot(data = bm,aes(x = test, y = relative)) +
		geom_bar(stat='identity') +
		theme(axis.text.x = element_text(angle=90,
										hjust = 1,
										vjust = 0.5))+
		labs(y = 'relative speed',title = 'benchmarking result')

![plot of chunk unnamed-chunk-2]({{ site.url }}/assets/article_images/string/unnamed-chunk-2-1.png)

The c++ function is ~25x faster.