Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Splitting a data set with specific IDs

    Hello,
    I want to split the data in two files containing observations with specific IDs. I am giving an example of only 18 observations.

    copy starting from the next line ---------- ------------
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input int id str9 date int var1 byte(var2 var3) str4 var4 str1 var5
    1003 "31-Jul-86" 1986 2 1 "INDL" "C"
    1003 "31-Oct-86" 1986 3 1 "INDL" "C"
    1003 "31-Jan-87" 1986 4 1 "INDL" "C"
    1003 "30-Apr-87" 1987 1 1 "INDL" "C"
    1126 "31-Jul-87" 1987 2 1 "INDL" "C"
    1126 "31-Oct-87" 1987 3 1 "INDL" "C"
    1126 "31-Jan-88" 1987 4 1 "INDL" "C"
    3298 "30-Apr-88" 1988 1 1 "INDL" "C"
    3298 "31-Jul-88" 1988 2 1 "INDL" "C"
    3298 "31-Oct-88" 1988 3 1 "INDL" "C"
    3677 "31-Jan-89" 1988 4 1 "INDL" "C"
    3677 "30-Apr-89" 1989 1 1 "INDL" "C"
    5674 "31-Jul-89" 1989 2 1 "INDL" "C"
    5674 "31-Oct-89" 1989 3 1 "INDL" "C"
    5674 "31-Jan-90" 1989 4 1 "INDL" "C"
    6666 "30-Apr-90" 1990 1 1 "INDL" "C"
    6666 "31-Jul-90" 1990 2 1 "INDL" "C"
    end
    copy up to and including the previous line ----- ------------

    I used simple commands for this example like
    Drop if id == 1003
    Drop if id == 6666
    After dropping observations, I saved rest of the data set with different name. Then in second step I did the same with other observations and saved the rest. Thus making two files.
    Is there any other way to do the same because my actual data set consists of around a million of observations with around 15,000 such like IDs??
    Any advice please !! I am new to stata and I hope that i explained the query properly!
    Thanks

  • #2
    Welcome to Statalist.

    The important question is, what is the rule that lets you know which dataset each id belongs in?

    You are asking for code that implements a decision you are making yourself. To do that, we need to know what rule you are applying, so we can translate that into Stata code.

    Comment


    • #3
      So depending on the scale of the number of ID's you wanted to drop or exclude, I can imagine doing this one of three ways:
      1) If you are dropping ID's within a sequence (drop if ID>=1000 & ID<=5000) you could use drop if inrange(ID, 1000, 5000). (You could also use inlist()

      2) If you lots of various lists, you could create each of those into separate datasets and then merge in those datasets into the master (see below)

      3) You may not even need to split them into separate datasets, but merge in as you did in #2, and then just run your analysis if group==1, group==2, etc

      Code:
       list, sepby(group) noobs
      
        +--------------+
        |   id   group |
        |--------------|
        | 1003       1 |
        | 3298       1 |
        | 3677       1 |
        | 6666       1 |
        |--------------|
        | 1003       2 |
        | 1126       2 |
        | 3298       2 |
        | 5674       2 |
        | 6666       2 |
        |--------------|
        | 1003       3 |
        | 1126       3 |
        | 3298       3 |
        | 5674       3 |
        |--------------|
        | 1003       4 |
        | 1126       4 |
        | 3677       4 |
        | 5674       4 |
        +--------------+
      
      * If the above is your list of groups to drop or keep, and in a dataset called groups_to_drop.dta
      use master_data, clear
      merge 1:1 id using groups_to_drop.dta, keep(match master) gen(merge_group)
      * Then you can drop if group==1, drop if group==2, etc

      Comment

      Working...
      X