I would like to build a prediction model to see if age, sex, hand grip strength, physical function can predict activities of daily living (ADL). So, my outcome is ADL a continuous variable and the other 4 variables are my predictors. I build a linear regression and then would like to check the prediction performance of my model using RMSE or R2 after k-fold cross-validation. However, since I am a new stata user, I am struggling to make it to work. I am not sure if this codes are correct. Also, I get an error after loop as "unknown syntax" which I am not sure how to fix it. Any help with this or better codes will be appreciated. thanks.
I have also attached the .do file.
/ Step 1: Generate simulated dataset
clear
set seed 12345 // for reproducibility
// Simulate age, sex, hand grip strength, physical function, and ADL
set obs 1000
gen age = rnormal(50, 10)
gen sex = rbinomial(1, 0.5)
gen grip_strength = rnormal(30, 5)
gen physical_function = rnormal(50, 10)
gen ADL = 10 + 0.3*age - 2*sex + 0.1*grip_strength + 0.5*physical_function + rnormal(0, 5)
// Summarize the dataset
summarize
// Step 2: Perform linear regression with k-fold cross-validation
* Set the number of folds for cross-validation
local k = 5
* Initialize variables to store cross-validation results
matrix results = J(`k', 2, .)
* Split the data into k folds
preserve
tempfile folds
egen fold = seq(), block(`k')
save "`folds'"
* Perform k-fold cross-validation
forvalues i = 1/`k' {
use "`folds'", clear
drop if fold == `i'
regress ADL age sex grip_strength physical_function
predict pred`i'
summarize ADL pred`i', meanonly
matrix results[`i', 1] = sqrt(r(sum_w_sq)) // Store RMSE
corr ADL pred`i'
matrix results[`i', 2] = r(rho)^2 // Store R-squared
}
* Calculate average RMSE and R-squared
// Initialize variables to store sum of RMSE and R-squared
scalar sum_rmse = 0
scalar sum_r2 = 0
// Calculate sum of RMSE and R-squared
forval i = 1/`k' {
scalar sum_rmse = sum_rmse + results[`i',1]
scalar sum_r2 = sum_r2 + results[`i',2]
}
// Calculate average RMSE and R-squared
scalar avg_rmse = sum_rmse / `k'
scalar avg_r2 = sum_r2 / `k'
// Display average RMSE and R-squared
di "Average RMSE: " avg_rmse
di "Average R-squared: " avg_r2
I have also attached the .do file.
/ Step 1: Generate simulated dataset
clear
set seed 12345 // for reproducibility
// Simulate age, sex, hand grip strength, physical function, and ADL
set obs 1000
gen age = rnormal(50, 10)
gen sex = rbinomial(1, 0.5)
gen grip_strength = rnormal(30, 5)
gen physical_function = rnormal(50, 10)
gen ADL = 10 + 0.3*age - 2*sex + 0.1*grip_strength + 0.5*physical_function + rnormal(0, 5)
// Summarize the dataset
summarize
// Step 2: Perform linear regression with k-fold cross-validation
* Set the number of folds for cross-validation
local k = 5
* Initialize variables to store cross-validation results
matrix results = J(`k', 2, .)
* Split the data into k folds
preserve
tempfile folds
egen fold = seq(), block(`k')
save "`folds'"
* Perform k-fold cross-validation
forvalues i = 1/`k' {
use "`folds'", clear
drop if fold == `i'
regress ADL age sex grip_strength physical_function
predict pred`i'
summarize ADL pred`i', meanonly
matrix results[`i', 1] = sqrt(r(sum_w_sq)) // Store RMSE
corr ADL pred`i'
matrix results[`i', 2] = r(rho)^2 // Store R-squared
}
* Calculate average RMSE and R-squared
// Initialize variables to store sum of RMSE and R-squared
scalar sum_rmse = 0
scalar sum_r2 = 0
// Calculate sum of RMSE and R-squared
forval i = 1/`k' {
scalar sum_rmse = sum_rmse + results[`i',1]
scalar sum_r2 = sum_r2 + results[`i',2]
}
// Calculate average RMSE and R-squared
scalar avg_rmse = sum_rmse / `k'
scalar avg_r2 = sum_r2 / `k'
// Display average RMSE and R-squared
di "Average RMSE: " avg_rmse
di "Average R-squared: " avg_r2
Comment