Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Duplicate observations

    I am working on the data with test scores. Some scores are given as total score and some as sub-scores, like this:
    Stu. ID Test Date
    1 X 21102010
    1 Y 21102010
    1 Z 21102010
    2 Z 11122010
    3 Z 15122010
    4 X 19021011
    4 Y 19021011
    Where Z is the sum score of X and Y tests.
    1/ I want to get rid of the sub-scores (X and Y) if the total scores are already given by comparing the test dates (Stu. ID 1)
    2/ Is there any way to sum the sub-scores for which the total score is not given? (Stu. ID 4)

    It seems very intuitive, but could not find any solution anywhere in FAQ. If you have any suggestions, I would really appreciate it!


  • #2
    There are probably a couple of ways to do this. Here is one possibility:

    Code:
    reshape wide score, i(StuID Date) j(Test) string
    replace scoreZ=scoreX+scoreY if mi(scoreZ)
    reshape long score, i(StuID Date) j(Test) string
    keep if Test=="Z"

    Comment


    • #3
      Is Test the name of the test and score a numeric variable you are not showing?

      Is there some indicator variable that sets apart the subtotals and totals?
      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        Joe,
        Thank you very much for your suggestion, but I can't use reshape command because there are multiple observations for the same test types (For example, student 1 took test with similar name in different grades, like mathematics(Z)=algebra(X)+geometry(Y) test taken from grade 7 onward).

        Roberto,
        The name of the test is string but the score is numeric variable.
        I have dummies for each type of test (Z, X, Y).

        Comment


        • #5
          So if I understand, for those students who don't have to sum total, you want to the sum total, and for those who have a sum total, you want to drop subscore rows. Seems like you meant to say your final data set is of students and sum totals only (meaning you left out a step of dropping X and Y for those students for which you calculated the Z)? You want a way to mark those students that have Z and those who don't. For those with a Z you want to keep the Z row and drop X and Y. For those without Z, you want to sum X and Y to get Z and then drop X and Y. Did you say you already have a dummy for those students who have sum totals? You said you had a dummy for Z and Z was a sum total. Seems you already have a good marker.

          Comment


          • #6
            Here's one way to do it, if my understanding of your data set and questions were correct,
            ​leaving you with a data set of one student id per row and a column of test totals.

            Code:
            clear
            input int id str1 test score byte sumtotal
            1 X 20 0
            1 Y 20 0
            1 Z 40 1
            2 Z 40 1
            3 Z 40 1
            4 X 19 0
            4 Y 19 0
            end
            
            sort id test
            bysort id: egen marker = total(sumtotal)
            drop if sumtotal == 0 & marker == 1
            bysort id: egen sumscore = total(score)
            keep id sumscore
            duplicates drop id, force
            list

            Comment


            • #7
              Code:
              clear
              input int id str1 test score byte sumtotal
              1 X 15 0
              1 Z 35 1
              1 Y 20 0
              2 Z 40 1
              3 Z 40 1
              4 X 19 0
              4 Y 19 0
              5 X 17 0
              end
              
              list, sepby(id)
              
              bysort id: gen tot = sum(score) if !sumtotal | (_N==1)
              collapse (max) tot, by(id)
              
              list
              You should:

              1. Read the FAQ carefully.

              2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

              3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

              4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

              Comment


              • #8
                Ninap,

                Regardless of whether you use my method (or Dave's method or Roberto's method) you will need to have a way to uniquely identify the tests that go together. I assumed from your original post that Student ID and Date would uniquely identify the set of scores that you want. I would also assume that these variables would take care of the situation where the same test is given in different grades (since the dates are different). If that is not the case, then you need to explain your data set better.

                If there are indeed additional variables that are needed to uniquely identify test sets, you can modify the above code (regardless of method) by changing id to the list of variables (say, id date subject)

                Regards,
                Joe

                Comment

                Working...
                X