Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Merge to dataset for a diff-in-diff

    Hy everybody.
    I am a graduate student working on the thesis. I'm studying how the quality of school infrastructure affects students performances (test score, attendance) in rural Angola. I have data collected in two phases (baseline and endline). I want to perform a diff-in-diff, but before doing that I need to properly organize my dataset. I have two data sets I want to merge (variable in common in the two dataset is schhol_id):

    - one dataset in a long format, 59,000 obs, at a student level characteristics:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long school_id int student_id float gender_S long age_S
    1 35 0  .
    1 23 0  .
    1 27 0  .
    1 30 0  .
    1  2 0  .
    1 24 0  .
    1 24 0  .
    1  2 0  .
    1 43 0  .
    1 42 0  .
    1 18 0  .
    1  9 0  .
    1 29 0  .
    1 13 0  .
    1  7 0  .
    1 23 0  .
    1 19 0  .
    1  7 0  .
    1 42 0  .
    1  8 0  .
    1  . 0  .
    1 36 0  .
    1 16 0  .
    1 36 0  .
    1 12 0  .
    1 16 0  .
    1  5 0  .
    1 30 0  .
    1 11 0  .
    1 26 0  .
    1 40 0  .
    1  . 0 12
    1 27 0  .
    1  7 0  .
    1  . 0  .
    1  . 0  .
    1 14 0  .
    1  . 0  .
    1 25 0  .
    1 37 0  .
    1 34 0  .
    1 23 0  .
    1 22 0  .
    1 16 0  .
    1 24 0  .
    1  2 0  .
    1  7 0  .
    1 44 0  .
    1 26 0  .
    1 16 0  .
    1 10 0  .
    1  3 0  .
    1 19 0  .
    1  6 0  .
    1 36 0  .
    1 40 0  .
    1 13 0  .
    1  4 0  .
    1 48 0  .
    1 36 0  .
    1 41 0  .
    1 33 0  .
    1 47 0  .
    1 28 0  .
    1 22 0  .
    1 42 0  .
    1 11 0  .
    1 13 0  .
    1 22 0  .
    1  7 0  .
    1 13 0  .
    1 10 0  .
    1  8 0  .
    1 12 0  .
    1 39 0  .
    1 42 0  .
    1 38 0  .
    1 26 0  .
    1 19 0  .
    1 42 0  .
    1 23 0  .
    1  6 0  .
    1 36 0  .
    1  2 0  .
    1 33 0  .
    1 26 0  .
    1  1 0  .
    1 10 0  .
    1 44 0  .
    1  4 0  .
    1 31 0  .
    1 35 0  .
    1 38 0  .
    1  9 0  .
    1 37 0  .
    1 33 0  .
    1 38 0  .
    1 16 0  .
    1 24 0  .
    1 40 0  .
    end

    - one dataset in a long format containing a the test score averages of each school both in the baseline(2016) and endline phases(2018), collected in one variable called 'test'. I also have a variable for the year(2016/2018) and the school_id:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long school_id float(test time)
     1 35.693993 0
     1  51.16608 1
     2  48.04455 0
     2  59.34124 1
     3  48.02045 0
     3  58.74098 1
     4  46.77065 0
     4  49.02344 1
     5   37.7081 0
     5  48.65749 1
     6  59.57922 0
     6  68.95295 1
     7   22.8125 0
     7  69.07217 1
     8 31.959213 0
     8  44.28571 1
     9      44.7 0
     9  71.31195 1
    10  56.50097 0
    10  59.56289 1
    11  41.18038 0
    11  39.90138 1
    12  52.53336 0
    12 69.228966 1
    13  53.78653 0
    13  75.81396 1
    14 37.916664 0
    14  81.23595 1
    15  65.38461 0
    15  61.28039 1
    16  39.66835 0
    16  47.20472 1
    17 33.210228 0
    17  56.44737 1
    18 36.833332 0
    18 30.465117 1
    19  38.98624 0
    19  63.75571 1
    20  57.44048 0
    20  62.63514 1
    21  49.22457 0
    21  61.26515 1
    22  60.18293 0
    22  70.76655 1
    23  45.59028 0
    23        41 1
    24  58.56349 0
    24  74.27728 1
    25  22.71739 0
    25  23.76923 1
    26  51.91993 0
    26  50.95238 1
    27   42.2586 0
    27        54 1
    28 25.069445 0
    28  32.97872 1
    29   39.3125 0
    29  44.29824 1
    30  47.72187 0
    30  58.64078 1
    31    48.375 0
    31  65.25641 1
    32  44.61121 0
    32  67.17391 1
    33  60.23297 0
    33  46.11465 1
    34  56.14584 0
    34  52.40385 1
    35  59.94257 0
    35  80.03456 1
    36  52.78233 0
    36  47.45614 1
    37  55.39412 0
    37  29.23729 1
    38  46.10526 0
    38  68.13067 1
    39 33.499027 0
    39 36.913185 1
    40  56.37134 0
    40  71.58184 1
    41  28.48035 0
    41  48.61371 1
    42 25.462963 0
    42  47.13636 1
    43  33.82611 0
    43  56.21519 1
    44 34.721645 0
    44        57 1
    45  54.30785 0
    45  69.21283 1
    46 38.115253 0
    46  41.32308 1
    47  53.17122 0
    47 32.455357 1
    48 36.628788 0
    48     32.75 1
    49  76.92538 0
    49  56.68317 1
    50  63.04464 0
    50  60.88123 1
    end

    My regression in stata woud be:
    diff test, t(treated*) p(time**)

    *treated is a dummy=1 if the school is trated
    ** time=1 if year==2018

    I need the averages of test scores in 2016 and 2018 to be organized in one variable('test'), since it will be my outcome variable. However I am having trouble merging the two dataset, since m:m merge is not a good idea, 1:1 merge won't work because in both dataset the school_id is not uniquely identified. How shoud I merge thetwo datasets?
    Thank you in advance

  • #2
    Well, I can think of a couple of ways to put these data sets together. But it's not clear to me that you should want to do that.

    The outcome data are defined at the school level. You apparently intend to attribute the school's average test results to every student in the school. It is also not clear whether the students in your data set were in the school in both 2016 and 2018. So it isn't clear what you gain by having the individual level observations. Perhaps you wish to adjust for differences in the age and gender distributions of the schools? If so, that can be done with an aggregated data set that contains summary statistics for those things (e.g. mean age and percent female). You can get that by -collapse-ing the student data set--which will reduce it to one observation per school that is then amenable to -merge 1:m school_id- with the school data set.

    If, on the other hand, there is some reason to maintain individual level observations in your analysis that I am missing, then you can take a different approach:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long school_id int student_id float gender_S long age_S
    1 35 0  .
    1 23 0  .
    1 27 0  .
    1 30 0  .
    1  2 0  .
    1 24 0  .
    1 24 0  .
    1  2 0  .
    1 43 0  .
    1 42 0  .
    1 18 0  .
    1  9 0  .
    1 29 0  .
    1 13 0  .
    1  7 0  .
    1 23 0  .
    1 19 0  .
    1  7 0  .
    1 42 0  .
    1  8 0  .
    1  . 0  .
    1 36 0  .
    1 16 0  .
    1 36 0  .
    1 12 0  .
    1 16 0  .
    1  5 0  .
    1 30 0  .
    1 11 0  .
    1 26 0  .
    1 40 0  .
    1  . 0 12
    1 27 0  .
    1  7 0  .
    1  . 0  .
    1  . 0  .
    1 14 0  .
    1  . 0  .
    1 25 0  .
    1 37 0  .
    1 34 0  .
    1 23 0  .
    1 22 0  .
    1 16 0  .
    1 24 0  .
    1  2 0  .
    1  7 0  .
    1 44 0  .
    1 26 0  .
    1 16 0  .
    1 10 0  .
    1  3 0  .
    1 19 0  .
    1  6 0  .
    1 36 0  .
    1 40 0  .
    1 13 0  .
    1  4 0  .
    1 48 0  .
    1 36 0  .
    1 41 0  .
    1 33 0  .
    1 47 0  .
    1 28 0  .
    1 22 0  .
    1 42 0  .
    1 11 0  .
    1 13 0  .
    1 22 0  .
    1  7 0  .
    1 13 0  .
    1 10 0  .
    1  8 0  .
    1 12 0  .
    1 39 0  .
    1 42 0  .
    1 38 0  .
    1 26 0  .
    1 19 0  .
    1 42 0  .
    1 23 0  .
    1  6 0  .
    1 36 0  .
    1  2 0  .
    1 33 0  .
    1 26 0  .
    1  1 0  .
    1 10 0  .
    1 44 0  .
    1  4 0  .
    1 31 0  .
    1 35 0  .
    1 38 0  .
    1  9 0  .
    1 37 0  .
    1 33 0  .
    1 38 0  .
    1 16 0  .
    1 24 0  .
    1 40 0  .
    end
    tempfile student_file
    save `student_file'
    
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input long school_id float(test time)
     1 35.693993 0
     1  51.16608 1
     2  48.04455 0
     2  59.34124 1
     3  48.02045 0
     3  58.74098 1
     4  46.77065 0
     4  49.02344 1
     5   37.7081 0
     5  48.65749 1
     6  59.57922 0
     6  68.95295 1
     7   22.8125 0
     7  69.07217 1
     8 31.959213 0
     8  44.28571 1
     9      44.7 0
     9  71.31195 1
    10  56.50097 0
    10  59.56289 1
    11  41.18038 0
    11  39.90138 1
    12  52.53336 0
    12 69.228966 1
    13  53.78653 0
    13  75.81396 1
    14 37.916664 0
    14  81.23595 1
    15  65.38461 0
    15  61.28039 1
    16  39.66835 0
    16  47.20472 1
    17 33.210228 0
    17  56.44737 1
    18 36.833332 0
    18 30.465117 1
    19  38.98624 0
    19  63.75571 1
    20  57.44048 0
    20  62.63514 1
    21  49.22457 0
    21  61.26515 1
    22  60.18293 0
    22  70.76655 1
    23  45.59028 0
    23        41 1
    24  58.56349 0
    24  74.27728 1
    25  22.71739 0
    25  23.76923 1
    26  51.91993 0
    26  50.95238 1
    27   42.2586 0
    27        54 1
    28 25.069445 0
    28  32.97872 1
    29   39.3125 0
    29  44.29824 1
    30  47.72187 0
    30  58.64078 1
    31    48.375 0
    31  65.25641 1
    32  44.61121 0
    32  67.17391 1
    33  60.23297 0
    33  46.11465 1
    34  56.14584 0
    34  52.40385 1
    35  59.94257 0
    35  80.03456 1
    36  52.78233 0
    36  47.45614 1
    37  55.39412 0
    37  29.23729 1
    38  46.10526 0
    38  68.13067 1
    39 33.499027 0
    39 36.913185 1
    40  56.37134 0
    40  71.58184 1
    41  28.48035 0
    41  48.61371 1
    42 25.462963 0
    42  47.13636 1
    43  33.82611 0
    43  56.21519 1
    44 34.721645 0
    44        57 1
    45  54.30785 0
    45  69.21283 1
    46 38.115253 0
    46  41.32308 1
    47  53.17122 0
    47 32.455357 1
    48 36.628788 0
    48     32.75 1
    49  76.92538 0
    49  56.68317 1
    50  63.04464 0
    50  60.88123 1
    end
    
    tempfile school_file
    save `school_file'
    
    
    use `student_file', clear
    joinby school_id using `school_file'
    By the way, I notice that the variable student_id has some missing values. Does that reflect a data management error? How do you know a student's age or sex if you don't know who the student is?

    Comment


    • #3
      Thank you for the answer Clyde.

      In the diff-in-diff specification my outcome is set at a school level. This is because I don't have precise student-level observation for the baseline, so I decided working with averages. Also the students in 2016 and 2018 are not necessarely the same students.

      The reason why I want to merge the two datasets is that the first dataset (student level obs) also contains teacher-level and parent-level observation and village-level obs that I want to control for in my specification. Namely, I would like to control for municipality, student gender, parent gender, parents education, teacher experince, teacher education.

      Another reason I woul dlike to have the two datasets together is for simple convenience, since I am running other regressions (i.e OLS, ANCOVA) where I exploit the student-level information.

      Regarding the missing values, I am facing problems with the baseline test score obs, since the data for a couple of schools have been lost. I was thinking about replacing these missing values with a simple mean or median values. Is that a good idea?

      Comment


      • #4
        Regarding the missing values, I am facing problems with the baseline test score obs, since the data for a couple of schools have been lost. I was thinking about replacing these missing values with a simple mean or median values. Is that a good idea?
        Probably not a good idea. Single imputations of any kind tend to bias regression results. There are no really good solutions to the problem of missing data. It becomes a question of choosing the least bad one. It's a very broad topic. https://pdfs.semanticscholar.org/58d...c218e126e4.pdf is a good, slightly technical discussion of the issues and some of the options.

        Comment

        Working...
        X