Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to append a sub-set of observations.

    Hi everyone,

    So, I have a very large yearly database on students of different educational levels (3 Million observations per month).

    What I need to do is to append all months together in order to form a balanced panel. In Month 1 (t=1), I have a database that looks like:

    Id Year Educational Level
    100 1 1
    101 1 1
    102 1 1
    103 1 2
    104 1 2
    100 2 2
    101 2 2
    102 2 2
    103 2 3

    So what I want to do is to use an append command that only appends the subset of id's that had an educational level equal to 1 in the baseline years. So In the example above, I want to append people with Id 100 through 102, but not 103 through 104.

    This is because the database is very large, so I need to do the append with only the essential observations, instead of merging everything and then dropping the observations I don't need.

    Any help with this?

    Regards,
    Alejandro



  • #2
    You don't provide very much information about your data sets. I can think of a few ways to approach this, but the choice among them would depend on things you haven't disclosed.

    What are these different data sets? Specifically, to what extent do they all resemble the arrangement you've shown. In particular, do they repeat the Year and Educational Level data that you've shown--or is that only present in this first data set and needs to be inferred from the ID with reference back to the first data set?

    Second, about how many files do you need to append?

    About how large are these files, not in terms of # of observations, but in terms of mega/gigabytes?

    Comment


    • #3
      Hi Clyde,

      So, I have monthly datasets on school absenteeism for the periods 2011 - 2017. Each month is a separate dataset which contains the id of the student, it's educational level, year, month, and a number of other characteristics. So what I presented would be merely an arbitrary dataset appended to another dataset.

      So these are 7 files - appending months isn't an issue, since the educational level changes yearly, not monthly. Each file is about 1 Giga. So this is why I want to drop the observations I'm not gonna use first and the append.

      Comment


      • #4
        I'm still not clear on some things. Do I understand correctly that you want the final data set to include only those records pertaining to those with education level 1 in 2011? And is the same id maintained consistently for the same person across all years?

        And if they are monthly files, why are there only 7 and not 84 of them?

        Comment


        • #5
          Hi Alejandro,

          There are probably a few ways to do this. If it were me, I would do it as follows:

          1) Create a master list of students from the baseline file (I'll assume this year==2011). Do a duplicates id drop and save this as master_student_list.dta. Let's assume there are 50K 1st-graders (or whatever edu_level==1 means.) In that file, create a variable called in_master == 1 (i.e. it is set ==1 for all 50K first graders).

          2) For each of the later monthly files, merge in the in_master variable. After the merge, in_master == 1 for the 50K students you want to track, missing for the others. Then keep if in_master == 1. Save this reduced file.

          3) Now create a loop to append these reduced monthly files to the 2011-month1 file (or whichever file will become the master file with the monthly data.)

          Comment

          Working...
          X