Duplicate observations

Ninap

Join Date: Apr 2014

Posts: 2
#1

Duplicate observations

23 Apr 2014, 06:43

I am working on the data with test scores. Some scores are given as total score and some as sub-scores, like this:
Stu. ID Test Date

1 X 21102010

1 Y 21102010

1 Z 21102010

2 Z 11122010

3 Z 15122010

4 X 19021011

4 Y 19021011

Where Z is the sum score of X and Y tests.
1/ I want to get rid of the sub-scores (X and Y) if the total scores are already given by comparing the test dates (Stu. ID 1)
2/ Is there any way to sum the sub-scores for which the total score is not given? (Stu. ID 4)

It seems very intuitive, but could not find any solution anywhere in FAQ. If you have any suggestions, I would really appreciate it!
Tags: None

Joe Canner

Join Date: Mar 2014
Posts: 580

23 Apr 2014, 07:25

There are probably a couple of ways to do this. Here is one possibility:

Code:

reshape wide score, i(StuID Date) j(Test) string
replace scoreZ=scoreX+scoreY if mi(scoreZ)
reshape long score, i(StuID Date) j(Test) string
keep if Test=="Z"

Comment

Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#3

23 Apr 2014, 09:04

Is Test the name of the test and score a numeric variable you are not showing?

Is there some indicator variable that sets apart the subtotals and totals?

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
Comment
Ninap

Join Date: Apr 2014

Posts: 2
#4

23 Apr 2014, 09:56

Joe,
Thank you very much for your suggestion, but I can't use reshape command because there are multiple observations for the same test types (For example, student 1 took test with similar name in different grades, like mathematics(Z)=algebra(X)+geometry(Y) test taken from grade 7 onward).

Roberto,
The name of the test is string but the score is numeric variable.
I have dummies for each type of test (Z, X, Y).
Comment
Dave Airey

Join Date: Apr 2014

Posts: 398
#5

23 Apr 2014, 10:32

So if I understand, for those students who don't have to sum total, you want to the sum total, and for those who have a sum total, you want to drop subscore rows. Seems like you meant to say your final data set is of students and sum totals only (meaning you left out a step of dropping X and Y for those students for which you calculated the Z)? You want a way to mark those students that have Z and those who don't. For those with a Z you want to keep the Z row and drop X and Y. For those without Z, you want to sum X and Y to get Z and then drop X and Y. Did you say you already have a dummy for those students who have sum totals? You said you had a dummy for Z and Z was a sum total. Seems you already have a good marker.
Comment

Dave Airey

Join Date: Apr 2014
Posts: 398

23 Apr 2014, 11:00

Here's one way to do it, if my understanding of your data set and questions were correct,
leaving you with a data set of one student id per row and a column of test totals.

Code:

clear
input int id str1 test score byte sumtotal
1 X 20 0
1 Y 20 0
1 Z 40 1
2 Z 40 1
3 Z 40 1
4 X 19 0
4 Y 19 0
end

sort id test
bysort id: egen marker = total(sumtotal)
drop if sumtotal == 0 & marker == 1
bysort id: egen sumscore = total(score)
keep id sumscore
duplicates drop id, force
list

Comment

Roberto Ferrer

Join Date: Apr 2014

Posts: 449
#7

23 Apr 2014, 11:46

Code:

clear input int id str1 test score byte sumtotal 1 X 15 0 1 Z 35 1 1 Y 20 0 2 Z 40 1 3 Z 40 1 4 X 19 0 4 Y 19 0 5 X 17 0 end list, sepby(id) bysort id: gen tot = sum(score) if !sumtotal | (_N==1) collapse (max) tot, by(id) list

You should:

1. Read the FAQ carefully.

2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.
1 like
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#8

23 Apr 2014, 11:54

Ninap,

Regardless of whether you use my method (or Dave's method or Roberto's method) you will need to have a way to uniquely identify the tests that go together. I assumed from your original post that Student ID and Date would uniquely identify the set of scores that you want. I would also assume that these variables would take care of the situation where the same test is given in different grades (since the dates are different). If that is not the case, then you need to explain your data set better.

If there are indeed additional variables that are needed to uniquely identify test sets, you can modify the above code (regardless of method) by changing id to the list of variables (say, id date subject)

Regards,
Joe
Comment

Stu. ID	Test	Date
1	X	21102010
1	Y	21102010
1	Z	21102010
2	Z	11122010
3	Z	15122010
4	X	19021011
4	Y	19021011

Announcement

Duplicate observations

Comment

Comment

Comment

Comment

Comment

Comment

Comment