Extracting unique strings from multiple variables for each observation

Umair Ali

Join Date: Mar 2018

Posts: 17
#1

Extracting unique strings from multiple variables for each observation

14 Dec 2018, 16:21

I have school level data with the name of different counties they have belonged to across a number of years. Each year has a column for each school, some are missing, some schools might fall into different counties.

1. I want to check how many schools have consistent counties over the year (ignoring any . (missing) values).
2. If schools have consistent counties I want to merge all these 10-15 variables into one giving me the county name if no conflict, and returning "conflict" when conflict, . if data is just missing.

Screenshot is attached.
Tags: None
David Benson

Join Date: Oct 2018

Posts: 489
#2

14 Dec 2018, 16:28

Hi Umair and welcome to Statalist!

It will be a *lot* easier to help if you can post a small snippet of your data using Stata's dataex command (SSC install dataex). If you need help with that, I created a YouTube video to help walk people through the process.

See Statalist - Using CODE delimiters and Stata's dataex command (I would watch at speed 1.5x or 2x ).
Part on using code delimiters on Statalist (1:30 to 3:00)
Part on using -dataex- (3:00 to 5:30, I keep going on, but you can stop there).

We don't need to see all of the variables, but if there is a school_id variable, be sure to include it. It looks like to combine the 15 the data will end up being reshaped from wide to long (which people here can help you do).

Hope that helps!
1 like
Comment

David Benson

Join Date: Oct 2018
Posts: 489

17 Dec 2018, 15:55

Umair Ali

See if this works. You will end up with 1 line per school that you can merge into your master dataset.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str9(county201415 county201314) str7 county201213 str9 county201112
1 "Middlesex" "Middlesex" ""        "Middlesex"
2 ""          "Alameda"   "Alameda" "Alameda"  
3 "Essex"     "Essex"     ""        "A77"      
4 "Carver"    "Carver"    "Essex"   ""         
end

Code:

ssc install egenmore  // in case you don't have it already
egen distinct_county = rowsvals( county201415 - county201112)  // counts unique values of county, ignores missing

. list, noobs abbrev(15)

  +----------------------------------------------------------------------------------+
  | id   county201415   county201314   county201213   county201112   distinct_county |
  |----------------------------------------------------------------------------------|
  |  1      Middlesex      Middlesex                     Middlesex                 1 |
  |  2                       Alameda        Alameda        Alameda                 1 |
  |  3          Essex          Essex                           A77                 2 |
  |  4         Carver         Carver          Essex                                2 |
  +----------------------------------------------------------------------------------+


reshape long county, i(id) j(year)
drop if county==""
drop year
duplicates drop
sort id
by id: gen count = _n
reshape wide county, i(id) j(count)
egen county_count = rowsvals( county1 county2)  // Add county3, 4, 5, etc as necessary
gen is_conflicted = ( county_count >=2)

. list, abbrev(14) noobs

  +--------------------------------------------------------------------------+
  | id     county1   county2   distinct_cou~y   county_count   is_conflicted |
  |--------------------------------------------------------------------------|
  |  1   Middlesex                          1              1               0 |
  |  2     Alameda                          1              1               0 |
  |  3         A77     Essex                2              2               1 |
  |  4       Essex    Carver                2              2               1 |
  +--------------------------------------------------------------------------+

NOTES:
1) I've re-shaped this back to wide so that you have a dataset with 1 line per school that you can merge into your master dataset. You can also filter on is_conflicted==1 to determine the desired answer for schools with >1 county listed. However, for most regression analysis, you will want your master dataset in a long format.

Comment

Umair Ali

Join Date: Mar 2018

Posts: 17
#4

25 Jan 2019, 16:37

Thanks David, sorry I could not thank you earlier, I just got back to this paper after a while, you know how it goes with research. Yes, I will ultimately run a panel data regression so need it in the long form. Let me try what you have suggested above and I will get back to you, thanks a lot again for all this effort!
Comment
Umair Ali

Join Date: Mar 2018

Posts: 17
#5

19 Feb 2019, 13:53

It worked David, I used it to extract unique observations over several variables. Thanks. I got into newer problems though. My data has yearly samples of schools but in each yearly sample there is some missing data for different schools. There is no pattern but schools tend not to be consistent on reporting data on all variables each year.

Currently I used data from 2000 and 2010 only to reduce the data loss, but I do not like that my variability is going down this much!
Comment

Announcement

Extracting unique strings from multiple variables for each observation

Comment

Comment

Comment

Comment