Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating RMSE or adjusted R2 after k-fold cross-validation for linear regression

    I would like to build a prediction model to see if age, sex, hand grip strength, physical function can predict activities of daily living (ADL). So, my outcome is ADL a continuous variable and the other 4 variables are my predictors. I build a linear regression and then would like to check the prediction performance of my model using RMSE or R2 after k-fold cross-validation. However, since I am a new stata user, I am struggling to make it to work. I am not sure if this codes are correct. Also, I get an error after loop as "unknown syntax" which I am not sure how to fix it. Any help with this or better codes will be appreciated. thanks.
    I have also attached the .do file.


    / Step 1: Generate simulated dataset
    clear
    set seed 12345 // for reproducibility

    // Simulate age, sex, hand grip strength, physical function, and ADL
    set obs 1000
    gen age = rnormal(50, 10)
    gen sex = rbinomial(1, 0.5)
    gen grip_strength = rnormal(30, 5)
    gen physical_function = rnormal(50, 10)
    gen ADL = 10 + 0.3*age - 2*sex + 0.1*grip_strength + 0.5*physical_function + rnormal(0, 5)

    // Summarize the dataset
    summarize

    // Step 2: Perform linear regression with k-fold cross-validation
    * Set the number of folds for cross-validation
    local k = 5

    * Initialize variables to store cross-validation results
    matrix results = J(`k', 2, .)

    * Split the data into k folds
    preserve
    tempfile folds
    egen fold = seq(), block(`k')
    save "`folds'"

    * Perform k-fold cross-validation
    forvalues i = 1/`k' {
    use "`folds'", clear
    drop if fold == `i'

    regress ADL age sex grip_strength physical_function

    predict pred`i'

    summarize ADL pred`i', meanonly
    matrix results[`i', 1] = sqrt(r(sum_w_sq)) // Store RMSE

    corr ADL pred`i'
    matrix results[`i', 2] = r(rho)^2 // Store R-squared
    }

    * Calculate average RMSE and R-squared
    // Initialize variables to store sum of RMSE and R-squared
    scalar sum_rmse = 0
    scalar sum_r2 = 0

    // Calculate sum of RMSE and R-squared
    forval i = 1/`k' {
    scalar sum_rmse = sum_rmse + results[`i',1]
    scalar sum_r2 = sum_r2 + results[`i',2]
    }

    // Calculate average RMSE and R-squared
    scalar avg_rmse = sum_rmse / `k'
    scalar avg_r2 = sum_r2 / `k'

    // Display average RMSE and R-squared
    di "Average RMSE: " avg_rmse
    di "Average R-squared: " avg_r2
    Attached Files
    Last edited by Niko keys; 05 Apr 2024, 08:29.

  • #2
    Hi Niko
    If you are flexibl into using other commands, I would suggest `cv_kfold` from SSC.

    You can use it as follows:

    Code:
    qui:regress ADL age sex grip_strength physical_function
    set seed 1
    cv_kfold
    k-fold Cross validation
    Number of Folds     :          5
    Number of Repetions :          1
    Avg Root Mean SE    :    5.04613
    And if you are using regression only, you could also do the following:

    Code:
    qui:regress ADL age sex grip_strength physical_function
     
    ssc install cv_regress
    cv_regress
    
    
    Leave-One-Out Cross-Validation Results 
    -----------------------------------------
             Method          |    Value
    -------------------------+---------------
    Root Mean Squared Errors |       5.0352
    Log Mean Squared Errors  |       3.2329
    Mean Absolute Errors     |       4.0286
    Pseudo-R2                |      0.55606
    -----------------------------------------
    HTH

    Comment


    • #3
      Hi Fernando, Many thanks for your great response. This is an efficient way for sure. I have a question can I use the similar code for logistic regression or cox regression? do you have any codes example for illustration just in case?

      Comment


      • #4
        cv_kfold should work with logistic regression. In that case you compare the LogLikelihood. However, the command does not know how to deal with COX regression.
        If you can identify the parameter that would or should be used for measuring fitness you can modify cv_kfold to work with Cox Regression too.
        F

        Comment


        • #5
          Hi Fernando, Thank you. Can I ask you how to modify the code to also get other metrics to check the prediction performance like "Adjusted R2" ? and do you know if the root mean square error of 5 in the example is a good or not? thanks for your great help.

          Comment


          • #6
            Im not aware of an Adjusted R2 for Cross validation. But, if you look in to the predicted values of the model, and the observed, you can construct any statistic you want that measures goodness of fit.
            For your second point, You can use the MSE to compare across models, not so much to say about the model itself.

            Comment

            Working...
            X