Splitting a data set with specific IDs

Azhar Mughal

Join Date: Feb 2019

Posts: 7
#1

Splitting a data set with specific IDs

14 Feb 2019, 09:35

Hello,
I want to split the data in two files containing observations with specific IDs. I am giving an example of only 18 observations.

copy starting from the next line ---------- ------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int id str9 date int var1 byte(var2 var3) str4 var4 str1 var5 1003 "31-Jul-86" 1986 2 1 "INDL" "C" 1003 "31-Oct-86" 1986 3 1 "INDL" "C" 1003 "31-Jan-87" 1986 4 1 "INDL" "C" 1003 "30-Apr-87" 1987 1 1 "INDL" "C" 1126 "31-Jul-87" 1987 2 1 "INDL" "C" 1126 "31-Oct-87" 1987 3 1 "INDL" "C" 1126 "31-Jan-88" 1987 4 1 "INDL" "C" 3298 "30-Apr-88" 1988 1 1 "INDL" "C" 3298 "31-Jul-88" 1988 2 1 "INDL" "C" 3298 "31-Oct-88" 1988 3 1 "INDL" "C" 3677 "31-Jan-89" 1988 4 1 "INDL" "C" 3677 "30-Apr-89" 1989 1 1 "INDL" "C" 5674 "31-Jul-89" 1989 2 1 "INDL" "C" 5674 "31-Oct-89" 1989 3 1 "INDL" "C" 5674 "31-Jan-90" 1989 4 1 "INDL" "C" 6666 "30-Apr-90" 1990 1 1 "INDL" "C" 6666 "31-Jul-90" 1990 2 1 "INDL" "C" end

copy up to and including the previous line ----- ------------

I used simple commands for this example like
Drop if id == 1003
Drop if id == 6666
After dropping observations, I saved rest of the data set with different name. Then in second step I did the same with other observations and saved the rest. Thus making two files.
Is there any other way to do the same because my actual data set consists of around a million of observations with around 15,000 such like IDs??
Any advice please !! I am new to stata and I hope that i explained the query properly!
Thanks
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

14 Feb 2019, 10:13

Welcome to Statalist.

The important question is, what is the rule that lets you know which dataset each id belongs in?

You are asking for code that implements a decision you are making yourself. To do that, we need to know what rule you are applying, so we can translate that into Stata code.
Comment

David Benson

Join Date: Oct 2018
Posts: 489

14 Feb 2019, 11:38

So depending on the scale of the number of ID's you wanted to drop or exclude, I can imagine doing this one of three ways:
1) If you are dropping ID's within a sequence (drop if ID>=1000 & ID<=5000) you could use drop if inrange(ID, 1000, 5000). (You could also use inlist()

2) If you lots of various lists, you could create each of those into separate datasets and then merge in those datasets into the master (see below)

3) You may not even need to split them into separate datasets, but merge in as you did in #2, and then just run your analysis if group==1, group==2, etc

Code:

 list, sepby(group) noobs

  +--------------+
  |   id   group |
  |--------------|
  | 1003       1 |
  | 3298       1 |
  | 3677       1 |
  | 6666       1 |
  |--------------|
  | 1003       2 |
  | 1126       2 |
  | 3298       2 |
  | 5674       2 |
  | 6666       2 |
  |--------------|
  | 1003       3 |
  | 1126       3 |
  | 3298       3 |
  | 5674       3 |
  |--------------|
  | 1003       4 |
  | 1126       4 |
  | 3677       4 |
  | 5674       4 |
  +--------------+

* If the above is your list of groups to drop or keep, and in a dataset called groups_to_drop.dta
use master_data, clear
merge 1:1 id using groups_to_drop.dta, keep(match master) gen(merge_group)
* Then you can drop if group==1, drop if group==2, etc

Announcement

Splitting a data set with specific IDs

Comment

Comment