Choice sets and data structure for conditional logistic regression?

Sophia Seifert

Join Date: Jun 2021
Posts: 2

Choice sets and data structure for conditional logistic regression?

21 Jun 2021, 10:43

Hello,

I have panel data showing which schools students choose to attend and I would like to conduct a conditional logistic regression to see what kind of school characteristics are related to enrollment decisions. My data is currently in long format. It is similar to the hypothetical selection below which shows a student transferring schools in 2014. Each student is identified by student_id, schl_code identifies where that student enrolled, year is my time variable, and I have a number of variables that include school characteristics which are a mix of dummies (ex: magnet, which is an indicator for attending a magnet school) and continuous (ex: distance_schl, which is the distance in miles between the student's home and their enrolled school).

student_id	Year	schl_code	district	distance_schl	magnet	schl_lat	schl_lon	stud_lat	stud_lon
256534	2012	8576	6475839	3.22	0	39.9	-72.2	39.8	-76.6
256534	2013	8576	6475839	3.22	0	39.9	-72.2	39.8	-76.6
256534	2014	4040	6475839	2.16	1	40.7	-75.3	39.8	-76.6
256534	2015	4040	6475839	2.16	1	40.7	-75.3	39.8	-76.6

First, I have identified the criteria I would like to use for constructing students' choice sets. It is largely based on identifying schools within a certain distance (stud_lat stud_lon give the location of the centroid of each students residence zip code and schl_lat and schl_lon give school locations) and the student's home school district (identified by district variable). I will be limiting my sample to students who make transfers and will only be using data from the transfer year (some lagged), so in this example data from 2014.

However, I do not know how to actually restructure my data so that each choice (alts) is listed as a separate observation for the case (student_id). See below for the desired structure and where this student hypothetically had 4 schools to choose from. Also, some students, like those living in cities, will have a lot of schools in their choice sets whereas some may only have a few.

student_id	alts	chose	distance_schl	magnet
256534	9765	0	0.75	0
256534	8576	0	3.22	0
256534	4040	1	2.16	1
256534	3795	0	8.53	0

Any help on how to create this dataset would be greatly appreciated.
Thank you!
-Sophia

Last edited by Sophia Seifert; 21 Jun 2021, 11:39.

Tags: choice model, choice set, logistic regression, panel data

Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#2

21 Jun 2021, 11:39

Either you're not describing your data correctly or you are seeking the impossible. In your desired results, schools 9765 and 3795 magically appear from nowhere. Unless somewhere in your data set there is other information about these schools that identifies them as possible choices for this student, there is no way to do what you ask. If there is such information, you need to show example data that includes that information so somebody can figure out how to bring it to bear on the solution to your problem.

When posting back, please use the -dataex- program to show example data. If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Sophia Seifert

Join Date: Jun 2021

Posts: 2
#3

21 Jun 2021, 12:07

Clyde--thanks for your tip about dataex, I am new to these forums and have not heard of it before. To your point about the other school codes appearing, this is actually part of where I am stuck--I have these codes (as I will explain) but am not sure how to link them to the student_id variable as a choice set.

My original data has all the students in a state and, along with them, the school codes for all the schools in the state. I have created a school-level dataset for each year by collapsing by schl_code--because schools open and close yearly, the list of schools and students' choice sets will also vary by year.

The dataex example below is from 2017 school-level data and shows the schl_code, latitude, longitude, and a few example school variables: number of students enrolled in the school (schl_pop) and share of students in the school who receive English Language Learner services (ell). Distance from home and school district id (aun_loc) are the primary criteria for whether a school is an option for a particular student. Because distance calculations are based on the distance between a school's lat/long and the lat/long of the centroid of students' residence zip code, all the students who live in the same zip code, in the same school district, and are in the same grade level (indicated by dummies such as elem below) should have the exact same choice set. There are a few other variables that also influence inclusion in the choice set, such as an dummy for a type of online school available to all students in the state.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input int schl_code float elem double(aun_loc lat_s longit_s) float(schl_pop ell) 7607 0 101260303 39.88497199999986 -79.87193199999913 415 0 6002 1 101260303 39.8482339999999 -79.9029190000001 244 .0040983604 4921 1 101260303 39.80259599999996 -79.80469499999975 335 .0029850747 7608 0 101260303 39.778341999999725 -79.9185099999997 404 0 2129 1 101260303 39.884786000000105 -79.86332899999982 333 .003003003 8364 1 101260303 39.832198000000055 -79.74447000000059 405 0 4922 1 101260303 39.77978199999983 -79.91804699999963 213 0 6001 0 101260303 39.827298999999556 -79.78491599999985 1057 0 2154 0 101260803 40.00699300000041 -79.89329100000026 413 .004842615 8384 1 101260803 40.0086889999994 -79.89454500000087 716 .002793296 end
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#4

21 Jun 2021, 15:27

I can point you in a direction, but as I have no experience working with spatial data in Stata, I can't bring you home.

The command you will need to combine the school and student data sets is -cross-. Since the school data set is actually a series of data sets, one for each year, you will have to do that one at a time and then put all the results together. So it will look something like this:

Code:

use student_dataset, clear by student_id (year), sort: keep if schl_code != schl_code[_n-1] // KEEP ONLY YEARS WHEN STUDENT CHANGES SCHOOLS capture program drop one_year program define one_year local y = year[1] cross using school_dataset_for_year_`y' // HERE INSERT CODE TO REMOVE ALL OBSERVATIONS THAT PAIR A STUDENT // WITH A SCHOOL THAT IS NOT IN THE CHOICE SET exit end runby one_year, by(year)

Notes:

-runby- is written by Robert Picard and me, and is available from SSC. It is similar to using a -foreach- loop to iterate over the values of the variable year, but it is faster and also enables you to simplify the code by omitting -if- conditions on the year variable.

I do not include any code that removes observations that are outside the student's choice set because 1) I don't grasp what all the specific criteria are, and 2) I don't know how to work with the longitude and latitude variables to calculate distance. Concerning the latter, if you are running version 16 or 17, there are a bunch of commands specifically for working with spatial data, and I imagine that therein you will find ways to resolve this part of your task. I have no experience with those commands, however, so I can't advise you more specifically. If you are using an older version of Stata, or if nothing there will suit your needs, I suggest you use Stata's -search- command to find user-written programs that might be what you need.
Comment

Announcement

Choice sets and data structure for conditional logistic regression?

Comment

Comment

Comment