keeping/dropping data based on a specific list of names.

Zaeen de Souza

Join Date: Aug 2018

Posts: 3
#1

keeping/dropping data based on a specific list of names.

30 Oct 2018, 00:41

Hi! I'm working with a dataset that consists of around 500 schools, that gets updated every month. There is a specific set of schools (200) that I need to separate out of this total dataset, every month.

Here's a small example of how school list looks like:

Code:

clear input str35 center_name "17 Ward 3" "169 Ghanoli 2" "170 Namri 1" "170 Namri 1" "170 Namri 1" "17 Ward 3" "169 Ghanoli 2" "17 Ward 3" "170 Namri 1" "169 Ghanoli 2" end

I've been trying different ways to do it. Currently using this:

Code:

keep if inlist(center, "169 Ghanoli 2", "17 Ward 3", "170 Namri 1")

and so on so forth, adding the other schools I want to keep, however, I need to drop all but 200 of these schools, and this might take a while. (Although I realise i'd only have to write it out once)

Is there a better or more efficient way of doing this, instead of having to type out all 200 of the centers I want?

Thanks!
Tags: None
David Benson

Join Date: Oct 2018

Posts: 489
#2

30 Oct 2018, 01:24

Hi Zaeen,

Welcome to Statalist!

Easiest way to accomplish what you want is probably to create a master file that only has the 200 schools that you want to keep. In that file, you could create a variable in that called called to_keep == 1 (i.e. it is set ==1 for all 200 schools). The, when you receive the new monthly file comes out, you merge in the to_keep variable from the good_200_schools file. After the merge, to_keep ==1 for the 200 that you care about, to_keep==. for the remainder. Then keep if to_keep==1.
Is there no numeric school_id that you can match on or that comes with the file? (i.e. are you going to have to merge on variable center_name?)

Are their duplicates in the data that you receive every month? Or is "170 Namri 1" in the sample data above 4 times because for the data that you keep, you have monthly attendance or something, so it's in 4x because you have attendance for September, October, November, and December?

For the code below I assume the following:
that you have a numeric school_id variable. (If not, then match on center_name).

I am also assuming that there are no duplicates in either dataset so you can do a 1:1 match.

Code:

*Sorry, this is using the old merge syntax. I still prefer it, so I do it that way. Someone else can add code using the new merge syntax * New monthly update file == "new_monthly_update.dta" * File with the 200 schools you care about == "200_good_schools.dta" use "new_monthly_update.dta", clear merge school_id using "200_good_schools.dta", unique sort nonotes nokeep keep(to_keep) _merge(merge_200) tabulate merge_200 keep if to_keep==1 // alternatively, keep if merge_200 == 3 (which means school is in both datasets).

Hope that helps!
--David
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

30 Oct 2018, 01:28

See also https://www.stata.com/support/faqs/d...s-for-subsets/ as an FAQ in addition to david Benson's helpful suggestions.
Comment
Zaeen de Souza

Join Date: Aug 2018

Posts: 3
#4

03 Nov 2018, 01:32

David and Nick, the merging and matching with a master list is working. Thank you so much!
Comment

Announcement

keeping/dropping data based on a specific list of names.

Comment

Comment

Comment