library(bis557) library(reticulate) source_python("my-python-functions.py") dt <- iris dt$new_Petal.Width <- 2*dt$Petal.Width # create collinearity variable X <- model.matrix(Sepal.Length ~ Petal.Width + Petal.Length + new_Petal.Width, data = dt) Y <- matrix(dt$Sepal.Length, ncol = 1) # result from python ridge_py_beta <- my_ridge_py(X = X, Y = Y, lam = 2.0) # result from R ridge_r_fit <- my_lm_ridge(form = Sepal.Length ~ Petal.Width + Petal.Length + new_Petal.Width, data = dt, lambda = 2) ridge_r_beta <- ridge_r_fit$coefficients rslt1 <- as.data.frame(cbind(ridge_py_beta, ridge_r_beta)) colnames(rslt1) <- c("python", "R") rslt1
From the above results comparison, we can see that the python ridge regression model works similarly as my R implemention.
# data simulation n <- 100000 p <- 4 b_true <- c(2, -3, 5, 4) X <- matrix(rnorm(n*p, mean = 1, sd = 1), nrow = n, ncol = p) Y <- X %*% b_true + rnorm(n, mean = 0, sd = 1) # read 10 samples a time batch_size <- 10 # if 1, stochastic mode # initial of betas betas <- matrix(rep(0, p), ncol = 1) n_iter <- n/batch_size for(i in seq_len(n_iter)){ row_beg <- (i-1)*10 + 1 row_end <- i*10 X_batch <- X[row_beg : row_end, ] Y_batch <- as.matrix(Y[row_beg : row_end, 1]) betas <- my_ridge_py_by_batch(X_batch, Y_batch, betas, lam = 2.0, alpha = 0.02) } rslt2 <- cbind(betas, b_true) colnames(rslt2) <- c("Estiamted betas", "True betas") rslt2
From the above result, we read data with batch size of 10, and the final estimation result is very close to true betas. And the values are smaller because here we use an L-2 penalty, which is ridge regression model.
casl
package.# data simulation n <- 1000 p <- 4 b_true <- c(2.0, -3.0, 5.0, 4.0) X <- matrix(rnorm(n*p, mean = 0, sd = 1), nrow = n, ncol = p) Y <- X %*% b_true + rnorm(n, mean = 0, sd = 1) # result with casl package # my R version is too low to install casl package, so I directly source the functions. source("casl.R") lasso_casl_beta <- casl_lenet(X, Y, lambda = 0.01, alpha = 0, b=matrix(0, nrow=ncol(X), ncol=1), tol = 1e-10, maxit=1000L, W=rep(1, length(Y))/length(Y)) lasso_casl_beta
import numpy as np def my_lasso_py(X, Y, lam = 1.0, maxit=10000, alpha=0.001, tol = 1e-10): """ Return beta coefficients of ridge regression model Args: X: design matrix Y: outcome matrix with column number 1 lam: the tuning parameter lambda in lasso regression maxit: maximum iteration alpha: learning rate tol: tolerence Returns: beta coefficients of ridge regression model """ p = X.shape[1] betas = np.random.randn(p, 1) for t in range(maxit): betas_old = betas Y_hat = X.dot(betas) delta = (X.transpose()).dot(Y_hat - Y) + lam*(np.sign(betas)) betas = betas - alpha*delta if sum(abs(betas - betas_old)) <= tol: break return betas n = 1000 p = 4 b_true = np.array([[2.0, -3.0, 5.0, 4.0]]).T X = np.random.randn(n, p) Y = X.dot(b_true) + np.random.randn(n, 1) # result with python function lasso_py_beta = my_lasso_py(X, Y, lam = 0.01, maxit=1000, alpha=0.001, tol = 1e-10) print(lasso_py_beta)
From the above results, we can see that the results from my python function and 'casl' package are very similar.
My project comes from a kaggle competition: https://www.kaggle.com/c/nlp-getting-started/overview. I will use NLP technique to predict whether there is a true disaster using tweeter contents. I will apply several classification algorithms and evaluate their performances. Hopefully, I can modify some exsiting methods to get a better performance, such as algorithm efficiency, predicting accuracy.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.