Insufficient observations with xtreg

George Kollias

Join Date: Apr 2018

Posts: 24
#1

Insufficient observations with xtreg

10 May 2018, 10:19

I am running a regression for the returns of education on gross wage. The variables of interest are spread out among 4 different datasets therefore I had to merge all the datasets together by each wave. So merge dataset 1 through 4 for wave 1, wave 2, wave 3 each in a seprate stata file. I ended up with 7 waves each representing a different year. I then appended all the waves (which included all the variables I needed) together ending up with a single dataset with all the variables and with the 7 waves appended.
The variable names are different for each dataset (apart from the 2 id variables, obviously) but they are also different for each wave as well, for example for the year of birth for 2009, 2010 and 2011.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(cw09b002 cw10c002 cw11d002) . . . . . . end

The ID variables are:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(nomem_encr nohouse_encr) end

I tried running the regression for a random effects model separately for each year but I keep getting an insufficient observations error. Note that my data set does not have a year variable to group each wave since I have merged 4 data sets together therefore I have for each of the 4 data sets for each wave a year and month of fieldwork period variable:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(cw09b_m ch09c_m ci09b_m cf09b_m cw10c_m ch10d_m cf10c_m ci10c_m) end

where the 09 variables are for 2009 and the 10 variables are for 2010. (I have variables till 2016)

Is there anyway to run the regression using a random effects model and stop getting an observations error? Perhaps grouping the variables by year? And how is this possible?

Appreciate the help, thank you
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#2

10 May 2018, 11:21

The problem is that your dataset is not properly organized for the task. You have your data in wide layout, with one observation per subject and separate variables for the same thing in different years. You need to -reshape- it into long layout so that you have one observation for each subject-year combination, and just one variable for bithdates, one variables for whatever.

People often have difficulty using the -reshape- command at first. Once you get the hang of it, it's really easy, initially it can be hard to understand what's called for in the i() and j() options. Try reading -help reshape- and the chapter of the PDF manuals that is linked therein. Then give it a try. If you need additional assistance, post back showing example data (not just names of variables).
Comment
George Kollias

Join Date: Apr 2018

Posts: 24
#3

13 May 2018, 10:29

After trying to convert the data from wide to long using:

. reshape long loggross education yearsofeduc age age@sq experience experience@sq alcohol MS schoolsati gender, i(nomem_encr) j(year)

(note: j = 2009 2010 2011 2012 2013 2015 2016)
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You
specified i(nomem_encr) and j(year). In the current wide form, variable
nomem_encr should uniquely identify the observations.

When I type reshape error I get:

. reshape error
(note: j = 2009 2010 2011 2012 2013 2015 2016)

i (nomem_encr) indicates the top-level grouping such as subject id.

The data are currently in the wide form; there should be a single
observation per i.

44769 of 46721 observations have duplicate i values:
....

Any suggestions as to why the ID does not identify the observations since it has worked with merging all the waves and data sets together?

Kind regards, appreciate all the help
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

13 May 2018, 11:02

The i() option should include both of your id variables, which I understand from post #1 to be nomem_encr and nohouse_encr. That is, each combination of nomem_encr and nohouse_encr should identify a single observation: if you run

Code:

isid nomem_encr nohouse_encr

it should not report an error.
2 likes
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30122
#5

13 May 2018, 11:29

Following up on William's advice, if he has correctly interpreted your #1 post, so that nomem_encr and nohouse_encr are jointly identifying your variables, then it is no surprise that Stata found duplicate observations with just nomemcr in the -i()- option of -reshape. So if he is right about that, his -isid- will give you no errors. In that case just use -i(nomem_encr nohouse_encr)- as your -i()- option in the -reshape- command and you should be fine.

If you do get errors with -isid nomem_encr nohouse_encr-, then you won't be able to use that for the -i()- option, and it sounds like it also means there are some errors in your data set. So to find them:

Code:

duplicates tag nomem_encr nohouse_encr, gen(flag) browse if flag

Then you will have to figure out what to do about them. (Or perhaps you will realize that the data are fine but that yet a third, or even more, variables are needed to uniquely identify your observations.)
1 like
Comment
George Kollias

Join Date: Apr 2018

Posts: 24
#6

13 May 2018, 13:14

Thank you for the help! At first I had the 2 IDs(nomem_encr and nohouse_encr) for the years 2009 and 2010 when I was merging the different data sets. I then re-merged them using only one (nomem_encr) since in the later years 2013, 2015 and 2016 the nohouse_encr is missing, and it worked. But now I get:

isid nomem_encr nohouse_encr
variables nomem_encr nohouse_encr should never be missing
r(459);

. isid nomem_encr
variable nomem_encr does not uniquely identify the observations
r(459);

. isid nohouse_encr
variable nohouse_encr should never be missing
r(459);

So what is illogical is the fact that nomem_encr works seperately as an ID for the merging of all datasets for each year together but does not uniquely identify all observations now.
Is there any logical explanation for this?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#7

13 May 2018, 15:28

The organization and content of your data are a complete mystery to me. I overlooked this in earlier posts, but let me now recommend that you review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question. In particular, please read FAQ #12 and help those whose help you seek by posting example data using the dataex command. If you are running Stata 15.1 or later, it is already installed. For earlier versions of Stata, install dataex by typing ssc install dataex. Type help dataex to read the simple instructions for using it. Using dataex will enable those who want to help you to quickly and easily create a 100% faithful replica of your situation to test their ideas and code on.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#8

14 May 2018, 05:28

Perhaps the following will help you examine your duplicates and understand the source of your problem.

Code:

sort nomem_encr duplicates tag nomem_encr, generate(flag) browse if flag

This will display in the data browser window every set of observations for which the value of nomem_encr is the same.
1 like
Comment
George Kollias

Join Date: Apr 2018

Posts: 24
#9

14 May 2018, 08:21

Thank you for the help William Lisowski! Apologies for the unclear explanation, still learning how to use the dataex command in addition to Stata.

If my first post was somewhat clear enough, the problem is that I do have multiple observations for the nomem_encr with the appended data-set but not with the merged data-sets for each year.

I have 4 datasets for every year (containing all the variables for my regression) merged together from 2009-2016 (without 2014). So dataset 1+2+3+4(all merged together) for 2009 then dataset 1+2+3+4 for 2010 until 2016. The ID, nomem_encr uniquely identifies all observations in those merged data-sets. The problem is that when I append all the years together and then run the isid nomem_encr command the variable now does not uniquely identify all the observations therefore the reshape command as suggested by Clyde Schechter does not work unfortunately.

Is there any other way to manage to reshape my data from wide to long so to be able to run the xtreg which is my end goal?
These are my variables for 2009, with all the rest years having just a different year term:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input double(schoolsati2009 education2009) float yearsofeduc2009 double(alcohol2009 grossyear2009 gender2009) float(MS2009 age2009 loggross2009 experience2009 age2009sq experience2009sq)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#10

14 May 2018, 09:00

Variable names without associated data are not particularly helpful. In the future, please supply data when data is requested. I do not understand why you show only the first few lines of output from dataex. Normally, it will provide 100 observations of sample data.

In the absence of data, I'm going to guess that your data looks something like this, where I have specifically chosen not to use your variable names to emphasize that it is a guess.

Code:

+----------------------------+ | id x2009 x2010 x2011 | |----------------------------| | 1 11 . . | | 1 . 12 . | | 1 . . 13 | |----------------------------| | 2 21 . . | | 2 . 22 . | | 2 . . 23 | +----------------------------+

If not, then you need to show what your data is actually like.

If so, then reshape is not necessarily what you need.

The best approach would be to recreate your dataset correctly starting with the collection of four datasets for each year.

After each year's four datasets are merged, you should then generate a variable "year" with the year represented by that dataset, and you should rename each of the variables to remove the year from the name. Your failure to do so is what has led you to the confused dataset you now have.

Then, when you append the datasets from the years, you will have something that looks like this.

Code:

+----------------+ | id year x | |----------------| | 1 2009 11 | | 1 2010 12 | | 1 2011 13 | |----------------| | 2 2009 21 | | 2 2010 22 | | 2 2011 23 | +----------------+

That is what you should do. It would have the benefit of giving you practice in the correct techniques for merging and appending data.

If instead you want a quick fix to your current dataset, the following code reads in the first dataset shown above and transforms it to the second dataset. Note the example of preesntation of data using dataex.

Code:

clear input float(id x2009 x2010 x2011) 1 11 . . 1 . 12 . 1 . . 13 2 21 . . 2 . 22 . 2 . . 23 end list, noobs sepby(id) generate year = . generate x = . foreach y in 2009 2010 2011 { replace year = `y' if x`y' != . replace x = x`y' if x`y' != . } drop x???? list, noobs sepby(id)
1 like
Comment
George Kollias

Join Date: Apr 2018

Posts: 24
#11

14 May 2018, 13:43

Thank you for the reply and detailed code! I will try to generate year variables to specify each data-set then append and try to correctly use dataex next time or if any other problems persist.
Comment

Announcement

Insufficient observations with xtreg

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment