1. In Python, implement a numerically-stable ridge regression that takes into account colinear (or nearly colinear) regression variables. Show that it works by comparing it to the output of your R implementation.

library(bis557)
library(reticulate)
source_python("my-python-functions.py")

dt <- iris
dt$new_Petal.Width <- 2*dt$Petal.Width # create collinearity variable
X <- model.matrix(Sepal.Length ~ Petal.Width + Petal.Length + new_Petal.Width, data = dt)
Y <- matrix(dt$Sepal.Length, ncol = 1)

# result from python 
ridge_py_beta <- my_ridge_py(X = X, Y = Y, lam = 2.0)

# result from R
ridge_r_fit <- my_lm_ridge(form = Sepal.Length ~ Petal.Width + Petal.Length + new_Petal.Width, data = dt, lambda = 2)

ridge_r_beta <- ridge_r_fit$coefficients

rslt1 <- as.data.frame(cbind(ridge_py_beta, ridge_r_beta))
colnames(rslt1) <- c("python", "R")
rslt1

From the above results comparison, we can see that the python ridge regression model works similarly as my R implemention.

2. Create an "out-of-core" implementation of the linear model that reads in contiguous rows of a data frame from a file, updates the model. You may read the data from R and send it to your Python functions fo fitting.

# data simulation
n <- 100000
p <- 4
b_true <- c(2, -3, 5, 4)
X <- matrix(rnorm(n*p, mean = 1, sd = 1), nrow = n, ncol = p)
Y <- X %*% b_true + rnorm(n, mean = 0, sd = 1)

# read 10 samples a time
batch_size <- 10 # if 1, stochastic mode

# initial of betas
betas <- matrix(rep(0, p), ncol = 1)
n_iter <- n/batch_size

for(i in seq_len(n_iter)){
  row_beg <- (i-1)*10 + 1
  row_end <- i*10
  X_batch <- X[row_beg : row_end, ]
  Y_batch <- as.matrix(Y[row_beg : row_end, 1])


  betas <- my_ridge_py_by_batch(X_batch, Y_batch, betas, lam = 2.0, alpha = 0.02)
}

rslt2 <- cbind(betas, b_true)
colnames(rslt2) <- c("Estiamted betas", "True betas")
rslt2

From the above result, we read data with batch size of 10, and the final estimation result is very close to true betas. And the values are smaller because here we use an L-2 penalty, which is ridge regression model.

3. Implement your own LASSO regression function in Python. Show that the results are the same as the function implemented in the casl package.

# data simulation
n <- 1000
p <- 4
b_true <- c(2.0, -3.0, 5.0, 4.0)
X <- matrix(rnorm(n*p, mean = 0, sd = 1), nrow = n, ncol = p)
Y <- X %*% b_true + rnorm(n, mean = 0, sd = 1)

# result with casl package
# my R version is too low to install casl package, so I directly source the functions.
source("casl.R") 

lasso_casl_beta <- casl_lenet(X, Y, lambda = 0.01, alpha = 0, 
                              b=matrix(0, nrow=ncol(X), ncol=1), 
                              tol = 1e-10, maxit=1000L, 
                              W=rep(1, length(Y))/length(Y))
lasso_casl_beta
import numpy as np

def my_lasso_py(X, Y, lam = 1.0, maxit=10000, alpha=0.001, tol = 1e-10):

    """ Return beta coefficients of ridge regression model

    Args:
        X: design matrix
        Y: outcome matrix with column number 1
        lam: the tuning parameter lambda in lasso regression
        maxit: maximum iteration
        alpha: learning rate
        tol: tolerence

    Returns:
        beta coefficients of ridge regression model
    """

    p = X.shape[1]
    betas = np.random.randn(p, 1)

    for t in range(maxit):
        betas_old = betas
        Y_hat = X.dot(betas)
        delta = (X.transpose()).dot(Y_hat - Y) + lam*(np.sign(betas))
        betas = betas - alpha*delta

        if sum(abs(betas - betas_old)) <= tol:
            break

    return betas


n = 1000
p = 4
b_true = np.array([[2.0, -3.0, 5.0, 4.0]]).T
X = np.random.randn(n, p)
Y = X.dot(b_true) + np.random.randn(n, 1)

# result with python function
lasso_py_beta = my_lasso_py(X, Y, lam = 0.01, maxit=1000, alpha=0.001, tol = 1e-10)
print(lasso_py_beta)

From the above results, we can see that the results from my python function and 'casl' package are very similar.

4. Propose a final project for the class.

My project comes from a kaggle competition: https://www.kaggle.com/c/nlp-getting-started/overview. I will use NLP technique to predict whether there is a true disaster using tweeter contents. I will apply several classification algorithms and evaluate their performances. Hopefully, I can modify some exsiting methods to get a better performance, such as algorithm efficiency, predicting accuracy.



tqchen07/bis557 documentation built on Dec. 21, 2020, 3:06 a.m.