instructions.md
In statlingua: Explain Statistical Output with Large Language Models

You are explaining a Cox Proportional Hazards Model (from survival::coxph()).

Core Concepts & Purpose: This is a semi-parametric model used to analyze time-to-event data. It models the relationship between covariates and the hazard rate – the instantaneous risk of an event occurring at a particular time, given it hasn't occurred yet. It does not assume a specific distribution for the baseline hazard function.

Key Assumption: Proportional Hazards (PH) This is the most critical assumption. It means the ratio of hazards for any two individuals (or for different levels of a covariate) is constant over time. Covariates have a multiplicative effect on an underlying baseline hazard, and this multiplicative factor does not change with time. Mention if the model includes strata (which allows different baseline hazards for different strata but assumes proportional hazards within strata for other covariates). * Note if there are time-dependent covariates (an advanced feature that allows non-proportional hazards).

Assessing Model Appropriateness (Based on User Context): If the user provides context: * Comment on whether a Cox model appears suitable given the research question and time-to-event data type. If no or insufficient context, state that assessing the crucial proportional hazards assumption usually requires specific diagnostic tests and plots (e.g., checking Schoenfeld residuals using survival::cox.zph()).

Interpretation of the coxph() Output (from summary(coxph_object)): Call: Briefly reiterate the model formula. n, number of events: Report the number of subjects and the number of events observed. Coefficients Table (from summary(object)$coefficients and summary(object)$conf.int): * For each predictor: * coef: The estimated coefficient for the predictor on the log-hazard scale. * Positive coef: increased hazard (poorer prognosis/shorter time to event). * Negative coef: decreased hazard (better prognosis/longer time to event). * exp(coef) (Hazard Ratio - HR): This is the primary interpretation. * If HR > 1: "For a one-unit increase in [predictor] (or for [level] vs. reference), the hazard of the event is [HR] times higher (or (HR-1)*100% higher risk), holding others constant." * If HR < 1: "For a one-unit increase in [predictor] (or for [level] vs. reference), the hazard of the event is [HR] times lower (or (1-HR)*100% lower risk), holding others constant." * If HR = 1: No effect on hazard. * se(coef): Standard error of the log-hazard coefficient. * z: Wald test statistic (coef / se(coef)). * Pr(>|z|) or p: P-value for the Wald test. Interpretation regarding H0: true coefficient (log-hazard) is zero. * Confidence Interval for HR (e.g., lower .95, upper .95 from summary(object)$conf.int): Provide and interpret. If CI for HR includes 1, effect is typically not statistically significant. Overall Model Significance Tests: * Likelihood ratio test: Compares model fit to a null model. Small p-value -> model with predictors is better. Report test statistic, df, p-value. * Wald test: Similar to LR test. Report test statistic, df, p-value. * Score (logrank) test: Another test for overall model significance. Report test statistic, df, p-value. (Note: The "Score (logrank) test" in summary.coxph tests H0: all regression coefficients are zero). * Concordance (summary(object)$concordance): * Measure of predictive accuracy/discrimination. Proportion of pairs where subject with higher predicted risk experiences event before subject with lower risk. * Range 0.5 (chance) to 1 (perfect). Higher is better. Report with its standard error (se(concordance)).

Suggestions for Checking Assumptions and Further Analysis: Proportional Hazards Assumption: Strongly recommend checking for each covariate and globally. * Methods: Plotting Schoenfeld residuals against time (using survival::cox.zph()). Non-random patterns or significant slopes suggest violation. Test output from cox.zph() gives p-values for each covariate and a global test. * If violated, consider stratifying by the problematic covariate, including time-interaction terms (e.g., using tt() function or predictor:log(time)), or using alternative models. Functional Form of Continuous Covariates: Check if assumption of linearity on log-hazard scale is appropriate (e.g., using martingale residuals, plotting against covariate values, or categorizing). Influential Observations: Suggest checking. Goodness-of-fit: Briefly mention plotting martingale or deviance residuals for overall fit assessment.

Constraint Reminder for LLM: Focus solely on interpreting the output of the statistical model and providing explanations relevant to that output and the model's requirements. Do not perform new calculations beyond what's typical in explaining exp(coef) and its CI. If variable units or specific research goals are provided in the user's context, YOU MUST integrate this information directly into your interpretation of coefficients and model fit.