How to Correct Data Errors in Longitudinal Format Data in Stata?

smith Jason

Join Date: Sep 2020

Posts: 380
#1

How to Correct Data Errors in Longitudinal Format Data in Stata?

25 Jul 2022, 16:36

I have a long format dataset with data error on the variables of kg5, k68, and k912 like this,
clear
input byte (id year gr kg5 k68 k912)
1 1 0 0 0 0
1 2 1 0 0 0
1 3 2 0 0 0
1 4 3 0 0 0
1 5 4 0 0 0
1 6 5 0 0 0
1 7 6 0 0 0
1 8 7 0 0 0
1 9 8 0 0 0
1 10 9 0 0 0
1 11 . 0 0 0
1 12 9 0 0 0
1 13 10 0 0 0
2 1 0 0 0 0
2 2 1 0 0 0
2 3 2 0 0 0
2 4 3 0 0 0
2 5 4 0 0 0
2 6 5 0 0 0
2 7 6 0 0 0
2 8 7 0 0 0
2 9 8 0 0 0
2 10 9 0 0 0
2 11 10 0 0 0
2 12 . 0 0 0
2 13 9 0 0 0
3 1 0 0 0 0
3 2 . 0 0 0
3 3 . 0 0 0
3 4 . 0 0 0
3 5 . 0 0 0
3 6 . 0 0 0
3 7 . 0 0 0
3 8 . 0 0 0
3 9 . 0 0 0
3 10 9 0 0 0
3 11 . 0 0 0
3 12 . 0 0 0
3 13 9 0 0 0
4 1 0 0 0 0
4 2 . 0 0 0
4 3 . 0 0 0
4 4 . 0 0 0
4 5 . 0 0 0
4 6 . 0 0 0
4 7 . 0 0 0
4 8 . 0 0 0
4 9 8 0 0 0
4 10 . 0 0 0
4 11 10 0 0 0
4 12 9 0 0 0
4 13 10 0 0 0
5 1 0 0 0 0
5 2 1 0 0 0
5 3 2 0 0 0
5 4 3 0 0 0
5 5 4 0 0 0
5 6 . 0 0 0
5 7 4 0 0 0
6 1 0 0 0 0
6 2 1 0 0 0
6 3 2 0 0 0
6 4 3 0 0 0
6 5 4 0 0 0
6 6 5 0 0 0
6 7 6 0 0 0
6 8 . 0 0 0
6 9 . 0 0 0
6 10 . 0 0 0
6 11 6 0 0 0
end

As can be seen on the dataset, all the values of variables that start with "k" are zero. However, it is not completely correct.
The correct rule is:
When the variable "gr" repeated grades within id, the value of corresponding "K" starting variable should be equal to 1.
For example, for the person with ID==1, gr repeated value of "9" in year 12, then k912 should be 1. (Because the student is retained in the range from 9 to 12 grades).
Like this, for the person with ID==6, gr repeated value of "6" in year 11., then k68 should be 1. (Because the student is retained in the range from 6 to 8 grades).
How can I use Stata code to correct the data error?
Thank you for your help!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#2

25 Jul 2022, 18:07

Code:

xtset id year gen last_gr = gr if year == 1, after(gr) replace last_gr = cond(missing(gr), L1.last_gr, gr) if year > 1 gen byte left_back = gr < L1.last_gr + 1 & year > 1, after(last_gr) replace kg5 = 1 if left_back & inrange(gr, 0, 5) replace k68 = 1 if left_back & inrange(gr, 6, 8) replace k912 = 1 if left_back & inrange(gr, 9, 12) foreach v of varlist kg5 k68 k912 { by id (`v'), sort: replace `v' = `v'[_N] } sort id year

Note: It is unclear from your question whether you want kg5, k68, and k912 set to 1 just in the year that they are left back, or set to 1 for that id's entire block of observations (so that it is an attribute of the id itself). I'm assuming you want the latter. If you don't want that, then leave out everything from -foreach- to the end of the code.
1 like
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#3

26 Jul 2022, 07:53

Professor, Thank you!
For the second line, I found there is no difference on the results whether adding "after(gr) or not.
So, I don't understand why professor used it. Could you explain?
Thank you!
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4546
#4

26 Jul 2022, 08:41

this is fully explained in the help file; see

Code:

h gen
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#5

26 Jul 2022, 10:12

So, I don't understand why professor used it. Could you explain?

You are correct in noticing that specifying the -after()- option has no effect on the results. I did it because I developed this code one line at a time and wanted to check that each line was working the way I expected it to. Its easier for me, at least, to do that if gr, last_gr, and left_back are all right next to each other when I open the browser to look. So I added the -after()- options to make it so. I often do this when developing code. Usually, I go back and remove those inessential elements before I post it here. In this case, I forgot to do that.
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#6

26 Jul 2022, 10:43

Thank you, Professor!
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#7

28 Jul 2022, 12:24

Professor Clyde,
I have to disturb you because I found that the data that I provided above is not that correct. The correct data should look like as follows,
clear
input byte (id year gr kg5 k68 k912)
1 1 0 0 0 0
1 2 1 0 0 0
1 3 2 0 0 0
1 4 3 0 0 0
1 5 4 0 0 0
1 6 5 0 0 0
1 7 6 0 0 0
1 8 7 0 0 0
1 9 8 0 0 0
1 10 9 0 0 0
1 11 . 0 0 0
1 12 9 0 0 0
1 13 10 0 0 0
2 1 0 0 0 0
2 2 1 0 0 0
2 3 2 0 0 0
2 4 3 0 0 0
2 5 4 0 0 0
2 6 5 0 0 0
2 7 6 0 0 0
2 8 7 0 0 0
2 9 8 0 0 0
2 10 9 0 0 0
2 11 10 0 0 0
2 12 . 0 0 0
2 13 9 0 0 0
3 1 0 0 0 0
3 2 . 0 0 0
3 3 . 0 0 0
3 4 . 0 0 0
3 5 . 0 0 0
3 6 . 0 0 0
3 7 . 0 0 0
3 8 . 0 0 0
3 9 . 0 0 0
3 10 9 0 0 0
3 11 . 0 0 0
3 12 . 0 0 0
3 13 9 0 0 0
4 1 0 0 0 0
4 2 . 0 0 0
4 3 . 0 0 0
4 4 . 0 0 0
4 5 . 0 0 0
4 6 . 0 0 0
4 7 . 0 0 0
4 8 . 0 0 0
4 9 8 0 0 0
4 10 . 0 0 0
4 11 10 0 0 0
4 12 9 0 0 0
4 13 10 0 0 0
5 1 0 0 0 0
5 2 1 0 0 0
5 3 2 0 0 0
5 4 3 0 0 0
5 5 4 0 0 0
5 6 . 0 0 0
5 7 4 0 0 0
6 1 0 0 0 0
6 2 1 0 0 0
6 3 2 0 0 0
6 4 3 0 0 0
6 5 4 0 0 0
6 6 5 0 0 0
6 7 6 0 0 0
6 8 8 0 0 0
6 9 . 0 0 0
6 10 7 0 0 0
end

I know your original code didn't work this data. But I tried several times and still don't know how to write correct code to solve the data errors.
It is obvious that for the data observations with id==6, k68 ==1 is incorrect because the student is not retained in grades.
Thank you for your help!

Last edited by smith Jason; 28 Jul 2022, 12:27.
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#8

28 Jul 2022, 12:33

Originally posted by Clyde Schechter View Post

Code:

xtset id year gen last_gr = gr if year == 1, after(gr) replace last_gr = cond(missing(gr), L1.last_gr, gr) if year > 1 gen byte left_back = gr < L1.last_gr + 1 & year > 1, after(last_gr) replace kg5 = 1 if left_back & inrange(gr, 0, 5) replace k68 = 1 if left_back & inrange(gr, 6, 8) replace k912 = 1 if left_back & inrange(gr, 9, 12) foreach v of varlist kg5 k68 k912 { by id (`v'), sort: replace `v' = `v'[_N] } sort id year

Note: It is unclear from your question whether you want kg5, k68, and k912 set to 1 just in the year that they are left back, or set to 1 for that id's entire block of observations (so that it is an attribute of the id itself). I'm assuming you want the latter. If you don't want that, then leave out everything from -foreach- to the end of the code.

Professor Clyde,
I have to disturb you because I found that the data that I provided above is not that correct. The correct data should look like as follows,
clear
input byte (id year gr kg5 k68 k912)
1 1 0 0 0 0
1 2 1 0 0 0
1 3 2 0 0 0
1 4 3 0 0 0
1 5 4 0 0 0
1 6 5 0 0 0
1 7 6 0 0 0
1 8 7 0 0 0
1 9 8 0 0 0
1 10 9 0 0 0
1 11 . 0 0 0
1 12 9 0 0 0
1 13 10 0 0 0
2 1 0 0 0 0
2 2 1 0 0 0
2 3 2 0 0 0
2 4 3 0 0 0
2 5 4 0 0 0
2 6 5 0 0 0
2 7 6 0 0 0
2 8 7 0 0 0
2 9 8 0 0 0
2 10 9 0 0 0
2 11 10 0 0 0
2 12 . 0 0 0
2 13 9 0 0 0
3 1 0 0 0 0
3 2 . 0 0 0
3 3 . 0 0 0
3 4 . 0 0 0
3 5 . 0 0 0
3 6 . 0 0 0
3 7 . 0 0 0
3 8 . 0 0 0
3 9 . 0 0 0
3 10 9 0 0 0
3 11 . 0 0 0
3 12 . 0 0 0
3 13 9 0 0 0
4 1 0 0 0 0
4 2 . 0 0 0
4 3 . 0 0 0
4 4 . 0 0 0
4 5 . 0 0 0
4 6 . 0 0 0
4 7 . 0 0 0
4 8 . 0 0 0
4 9 8 0 0 0
4 10 . 0 0 0
4 11 10 0 0 0
4 12 9 0 0 0
4 13 10 0 0 0
5 1 0 0 0 0
5 2 1 0 0 0
5 3 2 0 0 0
5 4 3 0 0 0
5 5 4 0 0 0
5 6 . 0 0 0
5 7 4 0 0 0
6 1 0 0 0 0
6 2 1 0 0 0
6 3 2 0 0 0
6 4 3 0 0 0
6 5 4 0 0 0
6 6 5 0 0 0
6 7 6 0 0 0
6 8 8 0 0 0
6 9 . 0 0 0
6 10 7 0 0 0
end

It is obvious that the result of k68 ==1 is incorrect for the observations with id==6, because the student is not retained in grades.
Although I tried several times with my code and still it doesn't work and I don't know how to write correct code to solve the data errors.
Thank you for your help!

Last edited by smith Jason; 28 Jul 2022, 12:35.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#9

28 Jul 2022, 13:18

It is obvious that the result of k68 ==1 is incorrect for the observations with id==6, because the student is not retained in grades.

Well, I don't see it that way. This student is in 8th grade in year 8, is, it seems, not in school in year 9, and then in year 10 turns up in grade 7. So this student was actually demoted after 8th grade. To me that qualifies for being left back between 6th and 8th grade.

But it's your call as to how you want to define things for your project. I have to ask, though, that you then make clear to me what you think the correct classification for this student should be, because I don't know what you would call this if not k68 == 1. In addition to indicating what you would like the result for this student to be, please explain how you want the code to handle demotions generally so that I can write code that will cover situations that are similar but differ in the specifics.
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#10

28 Jul 2022, 13:39

My understanding is that student (ID==6) was willing to go back to grade 7 when the student felt his or her academic performance was not that good after finishing grade 8.
Thank you!

Last edited by smith Jason; 28 Jul 2022, 13:43.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#11

28 Jul 2022, 13:55

But what in the data tells us that? If another student were involuntarily sent back to grade 7 after poor performance in grade 8, what in the data would distinguish that student from ID 6?

And, in any case, what do you consider to be the correct classification for ID 6, and how do you arrive at that conclusion?
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#12

28 Jul 2022, 14:02

Thank you for your explanation! I think that I need to spend some time to think over this issue.
Currently, I understand your meaning. However, under this "demoted student" circumstance, I don't know how to create a new binary variable "FAIL" indicating his/her current grade status in this long format data.
Could you please help me with the Stata code?
Thank you!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30355
#13

28 Jul 2022, 14:41

The following, used after the previous code, will create a new variable, demoted, to denote the situation where a student is demoted.

Code:

gen byte demoted = left_back & gr < L1.last_gr
Comment
smith Jason

Join Date: Sep 2020

Posts: 380
#14

28 Jul 2022, 14:42

Thank you!
Comment

Announcement

How to Correct Data Errors in Longitudinal Format Data in Stata?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment