Pooling Data in Longitudinal/Cross-Sectional Survey

Paige LaPierre

Join Date: Oct 2020

Posts: 33
#1

Pooling Data in Longitudinal/Cross-Sectional Survey

17 Oct 2021, 16:42

Hello everyone;

I am working with the National Longitudinal Survey of Children and Youth (NLSCY); a study conducted every 2 years. There are 8 reference periods, each named Cycle 1, Cycle 2, Cycle 3,...... Cycle 8.

I am looking at the impact of participation in regulated childcare (daycare or licensed family-care center) on the school readiness of low-income children. School-readiness is measured using a test score called PPVT-R.

My dependent variable is the standard score of PPVT-R which is administered once to children aged 4-5 across all cycles. My dependent variable is participation in regulated childcare. I am looking at the effects of school-readiness prior to Quebec's subsidized daycare (2000) and after, and comparing it to Ontario.

I would need to pool Cycles 1, 2 and 3 together, and Cycles 4,5,6,7,8, together so that I have information on the child's childcare arrangements. If I do not do this, 4-5 years either (a) dont have information on childcare because they are in school, or (b) their childcare is not relevant.

Therefore, I want to know if there is a way to pool the variables from participants across all cycles.

Example: Child who is aged 3 in Cycle 2 has information on childcare arrangements (eg: attends daycare). At Cycle 3 they are 5 and took the PPVT-R (childcare arrangement is not applicable or changed since they are in Kindergarten). I would need to use the observation of the childcare variable off Cycle 2 instead of cycle 3 so that my regression analysis will make sense.

The PERSUK variable is the child's unique identifier; each child is given a unique number that is the same across cycles. I am guessing I merge 1:1 using PERSUK?

If there isnt enough information in this i am sorry. i also emailed Statistics Canada for help just thought I would also try here.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30115
#2

17 Oct 2021, 17:04

The PERSUK variable is the child's unique identifier; each child is given a unique number that is the same across cycles. I am guessing I merge 1:1 using PERSUK?

No, that would, at best, be a syntax error. To be syntactically correct it would be -merge 1:1 PERSUK using name_of_file_to_be_merged_in-. Even this presupposes that there is only one observation in each file for any given child. If for any reason that is not the case, then a 1:1 merge will fail and some other approach, which would depend on the specifics of how the data is organized, would be needed. Theh -isid- command will help you identify whether or not this uniqueness of observations is satisfied in your data sets. (See -help isid- for details.)

But, most likely, this is not a -merge- at all. -merge- is used to combine data sets for the purpose of adding new variables to an existing data set. But it sounds like each data set you have is a wave of the survey and those files will contain, for the most part, the same variables (which have been assessed repeatedly). So more likely you will want to -append- the files to combine these data sets. -append- will tolerate the addition of new variables as well, but, unlike -merge-, it will leave you with a long layout data set, one observation per child per cycle, which is the layout you need to work with this data in Stata. Again, the exact code for combining these files will depend on details of the data layouts.

As for your other question about pooling data from different waves, there are many different ways to do that, and which one is appropriate for your purposes is neither a Stata question nor a statistical question. It is a matter of the substantive science of child development and learning. For example, one might want to use the most recent available childcare arrangement, or one might want to count out the number of cycles in which the child was in regulated child care, or the proportion of cycles for which we have information in which the child was in regulated childcare, or perhaps some weighted proportion with more recent years counted more (or, for that matter, a weighted proportion with more recent years counted less). There are also other approaches that might make sense. Choosing among them requires a set of beliefs, grounded in theory or prior research, about how participation in regulated child care might affect the development of school readiness. The details of that process would enable you to identify a way of combining the available data that reflects that. I don't know if there are any child development specialists on Statalist. We do have some people in education, psychologists, and psychometricians--some of whom might have insight into this. But if you don't hear from anybody here on this in a few days, I would suggest asking this question in a Forum of child development specialists, or perhaps even directly asking a colleague in that field in your own "shop."
1 like
Comment
Paige LaPierre

Join Date: Oct 2020

Posts: 33
#3

17 Oct 2021, 17:07

This was extremely helpful! I was thinking of append as well, I think it makes the most sense for my own dataset.

Thank you Clyde!
Comment
Weiwen Ng

Join Date: Jun 2015

Posts: 1241
#4

19 Oct 2021, 13:34

I've nothing substantive to add to the question and answer. I'll only add that to my limited knowledge, when you pool multiple years of a survey with complex sampling weights, some survey providers will give you a different set of weights. I am not sure if there's a standard name for this type of weight, but I'd guess it's "longitudinal weight" or something like that. For example, the United States Medicare Current Beneficiary Survey has different weights for when you pool 2, 3, and 4 years of data (this one is a rotating panel survey; everyone is in for 4 years but their induction dates are staggered). This might apply to your survey or it might not.

The other pitfall you might run into is that sometimes, the variable names may change between years, and it can be a huge pain to harmonize them. A competent survey provider would minimize this, but there might be some inconsistencies.

Be aware that it can be very hard to answer a question without sample data. You can use the dataex command for this. Type help dataex at the command line.

When presenting code or results, please use the code delimiters format them. Use the # button on the formatting toolbar, between the " (double quote) and <> buttons.
1 like
Comment

Announcement

Pooling Data in Longitudinal/Cross-Sectional Survey

Comment

Comment

Comment