knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library("gpuMagic")
The gpuMagic package is designed for ultilizing the computing power of modern GPU. The package has the ability to compile R code and implement the code on a GPU kernel. Therefore, it is easy for the user to design their own algorithm in pure R language and run the code in GPU. The package uses openCL as its parallel language so parallelization using the CPU is also supported.
To get started, it is always a good idea to check your GPU avalability.
getDeviceList()
The devices listed above show you the current avalable devices, if you cannot see the device, you should install the openCL driver obtained from the corresponding vender. The default device is the device 1, if you have multiple devices, you can query and change your device by
getDeviceInfo(1) setDevice(1)
The package also has a GPU resources manager. You can check your memory usage at any time by
gpuMagic.getMemUsage()
After the proper device has been selected, you are ready to write your GPU code.
If you are using a Nvidia graphics card, it is a good practice to change your Timeout Detection and Recovery(TDR) setting as the driver will reset the graphics card if the function takes too long to run the code. The default setting is 2 seconds, which is too short for most operations. If you are using another device such as an Intel graphic card, you are safe to continue.
The setting can be found in NVIDIA Nsight. Change it to a larger number such as 20 seconds will make most programs work.
The GPU code should be written as an R function and should be compatible with the sapply function. Here is an example of matrix multiplication
#C=A%*%B #Each time the function will compute a column of C matMul<function(ind,A,B){ C=A%*%B[,ind] return(C) }
You can test your function by using the sapply function
n=10 m=50 k=100 A=matrix(runif(n*m),n,m) B=matrix(runif(m*k),m,k) C_sapply=sapply(1:k,matMul,A,B) #The internal matrix operation function in R C_internel=A%*%B #Compute error max(abs(C_sapplyC_internel))
As you see here, the result from sapply is consistent with the result from the R internal matrix multiplication function. To run the above function in GPU, it can be simply achieved by changing sapply to gpuSapply
C_gpu=gpuSapply(1:k,matMul,A,B) #Compute error max(abs(C_gpuC_internel))
That is all you need to do to run the GPU code! We can try a larger sample size to see the speed differences between GPU and R code.
n=1000 m=1000 k=1000 A=matrix(runif(n*m),n,m) B=matrix(runif(n*m),m,k) system.time( gpuSapply(1:k,matMul,A,B) ) system.time( A%*%B )
In my test environment(GF 940m), the above code takes about 0.1 secs and CPU(i56300U) takes 0.6 secs to finish, a 6 times speed up in the matrix operation!
Here we discuss some common questions that you will have when using the package. Some are due to the limitations of openCL language and some are due to the lazyness of the package author.
NA
and NULL
values are not supported, I did not test them yet but it should be OK to pass them to the device, it just may not work as you expect. This feature will be added in the near future.A[1:2]=A[2:1]
does not work, since the opencl does not have a temporary space to store the intermediate results and therefore the assignments will be excuted in sequence, so the above code is equivalent to A[1]=A[2];A[2]=A[1]
. The value in A[2]
will be incorrect.In R language, it is fairly easy to create a matrix without knowing its size in advance, for example,
n=sample(1:10,1) A=matrix(0,n,n)
will create an nbyn matrix A. The size of A is undetermined until the first line of the code is evaluated. The ability to create a matrix without knowing its size before running the code is called dynamic allocation. Unfortunately, due to the language limitation, openCL does not support this feature, which means the size of every matrix in the openCL code should be known before its excution. Therefore, it is impossible to create a matrix dynamically. To solve this problem, the package provides two alternative ways to make a dynamiclike matrix.
The gpuSapply
function has an argument .macroParms
to specify the macro variable. Therefore, the value of the macro will be plugged into the function and is known for the compiler. for example
dynamic_example1<function(ind,A,n){ B=matrix(0,n) C=A+B return(C) } n=10 A=matrix(runif(n),n) #The code below wouldn't work, #the value of n is unknown in the compilation stage #res_device=gpuSapply(1,dynamic_example1,A,n) #works, the value of n is knwown as a macro in the compilation stage res_device=gpuSapply(1,dynamic_example1,A,n,.macroParms = c("n"))
Please note that when you specify a variable as a macro, you cannot change the value of it inside the function since it is fixed
If you do need a dynamic feature in your code, you can use the reference function to create it. The reference function works like a macro. For example
#Perform the column sum of the first n elements in each column of A dynamic_example2<function(ind,A,n){ #Dynamically subsetting the first n elements in the ind column of the matrix A #Almost equivalent to A_dyn=A[1:n,ind] #The size of A_dyn is unknown in the compilation stage since the value of n is unknown A_dyn=subRef(A,1:n,ind) C=sum(A_dyn) return(C) } n=10 M=100 A=matrix(runif(M),M,M) res_device=gpuSapply(1:M,dynamic_example2,A,n) res_cpu=sapply(1:M,dynamic_example2,A,n) #check error range(res_deviceres_cpu)
With the reference function, you can allocate a large matrix and then dynamically subset it to create the matrix with the size you need. However, you need to be careful when you do the operations on the reference object. In the example above, any changes to the reference object A_dyn
will cause the change in its target object A
. Please refer to the function list to see the available reference functions.
Since dynamic allocation is not possible, you may wonder if returning an object with dynamic size is possible. The answer is yes but it is depreciated. If you write multiple return
functions in your code. The package will first compute the maximum length of your return variables, and then allocate the memory according to the maximum length. If the length of one return variable cannot be correctly infered, you will get a warning message, and the length of that variable will be ignored. Therefore, in order to correctly return a variable of undetermined length, you should create a matrix which size is always larger than the return variable, then you return that matrix in the end. Therefore the compiler will always have enough space for your return variable( but with redundant space). For example:
dynamic_return<function(ind,A,n){ #Dynamically subsetting the first n elements in the ind column of the matrix A #equivalent to A_dyn=A[1:n,ind] A_dyn=subRef(A,1:n,ind) #return A_dyn, the size is unknown return(A_dyn) #The largest length of the variable A_dyn tmp=matrix(0,nrow(A)) #tell the compiler the return size should be at least the same as the variable tmp return(tmp) } n=10 M=20 A=matrix(runif(M*M),M,M) #Expect a warning message: Undetermined return size has been found suppressWarnings({ res_device=gpuSapply(1:M,dynamic_return,A,n) }) #retrieve the result (The first n rows) res_device=res_device[1:n,] #check error range(res_deviceA[1:n,])
In the future, the package will provide a more concise way to specify the return size.
The language used in the package is only a subset of the R language, it fully supports scalar to scalar, matrix to matrix, scalar to matrix, and matrix to scalar opration. However, for the matrix to matrix operations, they must have the same size, otherwise an error will be given(eg. matrix + vector is not allowed). An alternative solution for the matrix to vector operations is using the sweep
function, please refer to the R documentation to see the usage.
For performance reasons, the package only performs the error checking in the compilation stage. Therefore, when an error occurs on GPU, the code will continue to run silently, and crash R at some point without any hint. It is always a good idea to run a sample code with thesapply
function to check for possible errors before you launch the rocker with your precious data.
For the sequence variable, you can create it without limitation. As you see in the dynamic_example2
function, we have implicitly created a sequence variable 1:n
, whose size is unknown. You can use the function seq
or :
to create the sequence. the only exception is that you cannot change the value of the sequence variable, because the variable is in a compact format and does not own a memory. If you do want to create a sequence variable in memory, an alternative way is to create the sequence first and assign it to a new variable, then the compiler will allocate the memory for the new variable.
The current version does not support calling userdefined R functions inside your main function, this feature is under development and will come in the near future.
Here we provide more examples to show the usage of the package. Please note that not all the functions will be efficient when running on the GPU, it is possible that the CPU code can be faster than the GPU. Depending on the job type, you should decide which device you want to run on by calling setDevice()
.
largestK<function(ind,A,k){ #get a column of A A_col=subRef(A,,ind) #the largest k values and indices of A_col largest_value=matrix(0,k) largest_ind=matrix(0,k) #loop over A_col to find the largest k values for(i in 1:length(A_col)){ for(j in 1:k){ if(A_col[i]>largest_value[j]){ if(j!=k){ #Backward assignment to avoid self assignment problem #(see section: Important note) ind_src=(k1):j largest_value[ind_src+1]=largest_value[ind_src] largest_ind[ind_src+1]=largest_ind[ind_src] } largest_value[j]=A_col[i] largest_ind[j]=i break } } } return_nocpy(largest_ind) } N=1000 M=1000 k=10 A=matrix(runif(N*M),N,M) #Warmup warm_res=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k") warm_res=sapply(1:M,largestK,A=A,k=k) #count the time system.time({res_device=gpuSapply(1:M,largestK,A=A,k=k,.macroParms = "k")}) system.time({res_cpu=sapply(1:M,largestK,A=A,k=k)}) #Check the error range(res_deviceres_cpu)
computeLoss<function(ind,x,y,parms){ #Find the parameters for the thread parm=parms[ind,] #Compute y hat, use nocopy transpose y_hat=x%*%t_nocpy(parm) #absolute loss value(L1 loss) loss=sum(abs(yy_hat)) return(loss) } #Sample size n=1000 #Number of parameters p=2 beta=c(2,3) #Generate data x=matrix(runif(n*p),n,p) e=runif(n) y=x%*%beta+e #Bruteforce search #The seaching range and precision search_range=seq(0,10,0.1) parms=expand.grid(search_range,search_range) #Warmup warm_res=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms) warm_res=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms) #count the time system.time({res_device=gpuSapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)}) system.time({res_cpu=sapply(seq_len(nrow(parms)),computeLoss,x=x,y=y,parms=parms)}) #print out the parameters which minimize the loss function parms[which.min(res_device),] parms[which.min(res_cpu),]
The package currently supports a limited number of functions, the usage of them are similar but not promised to be the same as the R functions. For example, the drop argument in [ function is not supported yet. Therefore, in the current stage, it is better to only call the simplest form of the function. The current available functions are:
 Function  Note    nrow  the value can be used as a dimensional parameter to create a matrix   ncol  the value can be used as a dimensional parameter to create a matrix   length    matrix    [  drop does not work   abs    floor    ceiling    t    sum  Supports one argument only   rowSums  Supports one argument only   colSums  Supports one argument only   rowMeans  Supports one argument only   colMeans  Supports one argument only   sort  Supports one argument only, ascending sort   sweep    +        *    /    ^    >    >=    <    <=    ==    %*%    return  
 Explicit declaration function  note    subRef  Create a reference object to a subset of a matrix, the usage is same as [ function   seq  Support from, to, by,length.out arguments only   :    Matrix  Declare a matrix(Depreciated)   Scalar  Declare a scalar(Depreciated) 
Besides traditional R function, the package also provides a few nocopy methods. These methods do not allocate memory for the result. Instead, they return a referent(AKA pointer) pointing to the target matrix. Therefore, any changes made in the reference object will cause the changes in the target matrix. This feature is for saving memory usage, but it is could be very useful when you need "dynamic" allocation. You can first allocate a matrix with fixed dimension and then create a reference object pointing to that memory space, the size of the reference object can be undetermined in the compilation stage.
 Function  Note 

 subRef  Nocopy subsetting, the usage is same as [

 t_nocpy  Nocopy transpose, it return a reference of the transpose of a matrix 
 return_nocpy  Nocopy return, return the variable without an additional copy, it does not support reference object 
Here are the platform information used to compile this vignette.
version
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.