Character vector splitting in R.
Occasionally, I would encounter a problem in R where I want to split a string in a character columns with the same separator. However, there's is no function in R that is capable of doing that, and the strsplit function always return a list which I have to unlist it.
So today, I finally typed up a Rcpp function to ease my work and the code is as followed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#include <Rcpp.h> | |
#include <sstream> | |
#include <string> | |
#include <vector> | |
using namespace Rcpp; | |
using namespace std; | |
typedef vector<string> stringList; | |
typedef vector<int> numList; | |
//split function to split line with desired deliminator | |
stringList split(const string &s, char delim) | |
{ | |
stringList result; | |
stringstream ss(s); | |
string item; | |
while (getline(ss, item, delim)) | |
{ | |
result.push_back(item); | |
} | |
return result; | |
} | |
//[[Rcpp::export]] | |
stringList string_split(stringList x, string sep, int start, int frag) | |
{ | |
int size = x.size(); | |
stringList result(size); | |
char delim = sep[0]; | |
for (int i = 0; i < size; i++) | |
{ | |
stringList splittedString = split(x[i],delim); | |
int splittedSize = splittedString.size(); | |
if (splittedSize < start + frag) | |
{ | |
Rcpp::stop("Wrong end value\n"); | |
} | |
string str = splittedString[start-1]; | |
for (int j = start ; j < start + frag ; j++) | |
{ | |
str = str + sep + splittedString[j]; | |
} | |
result[i] = str; | |
} | |
return result; | |
} | |
//[[Rcpp::export]] | |
stringList changeDelim(stringList x, char sep, string delim) | |
{ | |
int size = x.size(); | |
stringList result(size); | |
for (int i = 0 ; i < size ; i ++) | |
{ | |
stringList splittedString = split(x[i],sep); | |
string str = splittedString[0]; | |
for (int j = 1 ; j < splittedString.size(); j++) | |
{ | |
str = str + delim + splittedString[j]; | |
} | |
result[i] = str; | |
} | |
return result; | |
} |
The function takes three inputs:
- The character vector that is being split
- separator
- the piece that is desired
+++
Test the code:
library(Rcpp)
sourceCpp('~/scripts/R/Rcpp/string_split.cpp')
testVector <- rep('I~am~a~boy',10)
for (i in 1:4){
print(string_split(testVector,'~',i))
}
## [1] "I" "I" "I" "I" "I" "I" "I" "I" "I" "I"
## [1] "am" "am" "am" "am" "am" "am" "am" "am" "am" "am"
## [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
## [1] "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy" "boy"
Benchmarking:
library(rbenchmark)
library(ggplot2)
## Loading required package: methods
r_string_split <- function(x){
sapply(x,function(y) unlist(strsplit(x,'~'))[2])
}
bm <- benchmark(string_split(testVector,'~',2),r_string_split(testVector))
bm
## test replications elapsed relative user.self
## 2 r_string_split(testVector) 100 0.050 25 0.049
## 1 string_split(testVector, "~", 2) 100 0.002 1 0.001
## sys.self user.child sys.child
## 2 0 0 0
## 1 0 0 0
ggplot(data = bm,aes(x = test, y = relative)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle=90,
hjust = 1,
vjust = 0.5))+
labs(y = 'relative speed',title = 'benchmarking result')

The c++ function is ~25x faster.