Predicting values using Simple Linear Regression on Categorical Data

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#1

Predicting values using Simple Linear Regression on Categorical Data

17 Apr 2023, 21:06

Hi everyone,

I have student scores for two test components (Component1 & Component2) from 485 schools. Using a linear regression model of Component1 on Component2, I want to predict scores for students who missed Component1 or whose marks were not recorded. The regression equation will be school-specific and used to generate predicted scores, recorded in a variable called "Predicted_Scores." The plan is for Stata to regress Component1 on Component2, predict missing Component1 scores for each school one-by-one, and store the predictions in a single variable named "Predicted_Scores." I've written the code bellow to accomplish this, but Stata is applying the same regression coefficients to all schools. Can you assist me in resolving this issue?

Code:

gen Predicted_Scores = .
forvalues i = 1/ 485 {
regress Component1 Component2 i.School_Code if School_Code == `i'
predict predicted_values, xb
replace Predicted_Scores = predicted_values if School_Code == `i'
}

Last edited by Simwinga Simwinga; 17 Apr 2023, 21:09.
Tags: categorical, foreach, loop, regression
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#2

17 Apr 2023, 21:40

I do not understand how this code can run at all past the first (`i' == 1) iteration. On the first iteration, you create a variable, predicted_values. When you come around to `i' = 2, the -predict- command will terminate execution with an error message because the variable predicted_values already exists. So I do not see how you can be getting the results you describe, or any results at all except when School_code == `i'.

For more specific advice, please post back showing the exact code you are running. Copy/paste it from your do-file, log file, or Results window and do not edit it in any way--there is no such thing as a minor change. Also show the complete and exact output that Stata is giving you. Finally, use the -dataex- command to show example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
1 like
Comment
Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#3

18 Apr 2023, 01:56

You are right, the execution is terminated on the first iteration (`i' == 1001) and an error message "variable predicted_values already defined" pops up. However, I want the execution to run through all the values of `i' == 1001/9962. The code I am running is a follows;

Code:

* gen Predicted_Scores = . forvalues i = 1001/9962 { regress Component1 Component2 i.School_Code if School_Code == `i' predict predicted_values, xb replace Predicted_Scores = predicted_values if School_Code == `i' } end

I have copied it as it is in my dofile. I should state that I am just learning programming in stata, I am not a pro. Kindly advise me on how to write a code which will produce my desired results.

Last edited by Simwinga Simwinga; 18 Apr 2023, 02:19.
Comment

Simwinga Simwinga

Join Date: Apr 2022
Posts: 36

18 Apr 2023, 02:15

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input long Id int(School_Code Subject_Code) byte(Component1 Component2)
 1001 4024 13  0
 1001 4024  6  1
 1001 4024  3  5
 1001 4024  2  5
 1001 4024  3  5
 1001 4024 19  6
 1001 4024  5  8
 1001 4024  0  8
 1001 4024  2  9
 1001 4024  2  9
 1001 4024 12  9
 1001 4024  8 10
 1001 4024  0 10
 1001 4024  1 10
 1001 4024  6 12
 1001 4024  6 12
 1001 4024  7 14
 1001 4024  4 14
 1001 4024  2 14
 1001 4024  8 15
 1001 4024 11 15
 1001 4024 21 15
 1001 4024  8 15
 1001 4024  2 16
 1001 4024  8 17
 1001 4024  4 17
 1001 4024  1 18
 1001 4024 27 18
 1001 4024 15 18
 1001 4024  7 19
 1001 4024  7 19
 1001 4024  6 19
 1001 4024  6 20
 1001 4024 37 20
 1001 4024 11 21
 1001 4024 18 21
 1001 4024  5 22
 1001 4024  6 22
 1001 4024 22 23
 1001 4024  4 23
 1001 4024 15 23
 1001 4024 16 23
 1001 4024  7 24
 1001 4024 21 24
 1001 4024  3 25
 1001 4024  5 25
 1001 4024 16 26
 1001 4024 15 26
 1001 4024 13 26
 1001 4024  8 26
 1001 4024  4 27
 1001 4024 15 27
 1001 4024 13 27
 1001 4024 11 27
 1001 4024 19 27
 1001 4024 17 28
 1001 4024 13 28
 1001 4024 16 28
 1001 4024 21 29
 1001 4024 10 29
 1001 4024 10 29
 1001 4024  5 29
 1001 4024 30 30
 1001 4024  6 30
 1001 4024 17 31
 1001 4024 18 31
 1001 4024 31 31
 1001 4024 10 31
 1001 4024 21 31
 1001 4024 24 32
 1001 4024  6 32
 1001 4024 19 32
 1001 4024 11 32
 1001 4024 14 32
 1001 4024 16 32
 1001 4024 14 33
 1001 4024 16 33
 1001 4024  9 33
 1001 4024 21 34
 1001 4024 14 34
 1001 4024 19 34
 1001 4024 15 34
 1001 4024 17 35
 1001 4024 11 35
 1001 4024 10 35
 1001 4024  7 36
 1001 4024 19 36
 1001 4024  9 36
 1001 4024 12 37
 1001 4024 13 37
 1001 4024 14 38
 1001 4024 15 39
 1001 4024 23 39
 1001 4024 15 39
 1001 4024 15 39
 1001 4024 12 39
 1001 4024 25 40
 1001 4024 19 40
 1001 4024 25 40
 1001 4024 28 40
end

Comment

Andrew Musau

Join Date: Oct 2014
Posts: 10223

18 Apr 2023, 05:44

predict will generate a variable, so you need to drop it just in case you will generate it again. Otherwise, give each prediction a distinct name (corresponding to an iteration).

Code:

gen Predicted_Scores = .
forvalues i = 1001/9962 {
    regress Component1 Component2 i.School_Code if School_Code == `i'
    predict predicted_values, xb
    replace Predicted_Scores = predicted_values if School_Code == `i'
    drop predicted_values
}

Code:

gen Predicted_Scores = .
forvalues i = 1001/9962 {
    regress Component1 Component2 i.School_Code if School_Code == `i'
    predict predicted_values`i', xb
    replace Predicted_Scores = predicted_values`i' if School_Code == `i'
}end

Comment

Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#6

18 Apr 2023, 08:31

Many thanks, Andrew Musau, for your prompt response. After executing both codes, I have noticed some progress. The codes are now capable of running beyond the second iteration (`i' == 1002) before stopping and displaying an error message stating "no observations." Would it be possible for me to share my dataset with you so that you can try running the codes?
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30121
#7

18 Apr 2023, 09:39

This means that at some point Stata encounter a School_Code for which, after removing observations with missing values for both Component1 and Component2, there were no observations left to calculate a regression. Remember that in any estimation command, observations withmissing values of any variable mentioned in the command are excluded. Sometimes the pattern of missing data means that nothing is left, or not enough observations are left to do the regression.

There are two possibilities here. One is that this condition should never arise in your data. In that case, it means that your data set is incorrect. You will need to review the data management that created it, and fix whatever errors led to this malformed data set.

But it may be that the data set can be reasonably expected to have schools where there is no, or insufficient, data to regress. In that case, the following code will allow Stata to skip over those, providing a message but continuing on. The same code will, however, stop execution if any other, unanticipated error condition is found.

Code:

gen Predicted_Scores = . forvalues i = 1001/9962 { capture regress Component1 Component2 i.School_Code if School_Code == `i' if c(rc) == 0 { // SUCCESSFUL REGRESSION quietly predict predicted_values, xb quietly replace Predicted_Scores = predicted_values if School_Code == `i' drop predicted_values } else if inlist(c(rc), 2000, 2001) { REGRESSION NOT POSSIBLE; INSUFFICIENT DATA display in red "Insufficient Observations: School_Code `i'" } else { // UNANTICIPATED PROBLEM display in red "Unexpected Regression Failure: School_Code `i'" exit c(rc) } }
Comment
Simwinga Simwinga

Join Date: Apr 2022

Posts: 36
#8

19 Apr 2023, 01:24

Thank you, Clyde Schechter, for providing the code. However, I encountered errors when I ran it. As per your advice, I reviewed the data and realized that some School_Code values did not follow the natural number sequence pattern; there were some missing values. After modifying the data to ensure that the School_Code values followed the natural number sequence, the code worked perfectly well. Once again, thank you for the assistance.
Comment

Announcement

Predicting values using Simple Linear Regression on Categorical Data

Comment

Comment

Comment

Comment

Comment

Comment

Comment