Impossible values in Stata example Dataset

Charlie Joyez

Join Date: Dec 2014

Posts: 421
#1

Impossible values in Stata example Dataset

30 Aug 2016, 03:21

Looking for an example dataset to introduce Stata to some students, I've found the highschool.dta (or multistage.dta) dataset in the examples available online at Stata Press ( http://www.stata-press.com/data/r14/svy.html)

Code:

use http://www.stata-press.com/data/r14/multistage ,clear * Or more simply : webuse multistage.dta svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools) su weight height,de

Observations are supposed to be individual, with information on their gender, weight and height (among other)
You see that the individual's weight is quite normal, however, the height is clearly impossible (for those, like me, who more familiar with metric system, the mean height is around 11 meters tall).

I've first though that the data have been transformed in purpose, not to reveal confidential data (but why put such values?).
I've then thought it was the wrong unit, but I found no other possible height units with these values. Also, it is not a misplaced decimal sign (43 inches would be too small), nor a squared value, etc...

However, I found some material (here, p.13) where using the same dataset, they get a mean of height around 67 inches, and not 430.

I've been checking r14 r13 r12 r11 r10 and r9 datasets, and always got the same abnormal values of height.

Actually I get the same values that in the Survey data reference manual (http://www.stata.com/manuals14/svy.pdf), p.13, where only the variable weight is described, but height is used as a regressor of weight, and one can see a abnormal result :

Code:

. svy: regress weight height (running regress on estimation sample) Survey: Linear regression Number of strata = 50 Number of obs = 4071 Number of PSUs = 100 Population size = 8000000 Design df = 50 F( 1, 50) = 593.99 Prob > F = 0.0000 R-squared = 0.2787 Linearized weight Coef. Std. Err. t P>t [95% Conf. Interval] height .7163115 .0293908 24.37 0.000 .6572784 .7753447 _cons -149.6183 12.57265 -11.90 0.000 -174.8712 -124.3654

Where the height is supposed to be in inches and the weights in lbs. The regression predicts a negative weight for all observations with a height below 210 inches (5.3 meters).

I was just wondering whether this was done on purpose, or not. If it is the case how come the first reference I found had "normal" values?
If it is not the case, perhaps someone could update the example files.

Thanks,
Charlie

Last edited by Charlie Joyez; 30 Aug 2016, 03:25.
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35725
#2

30 Aug 2016, 03:41

Good question. Let's add that no height or weight values I've ever seen use so many significant figures.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10218
#3

30 Aug 2016, 04:23

My guess is that someone just created a normally distributed variable of an arbitrary mean and variance. For instruction purposes, I personally would not be bothered if the values were not "correct" or realistic. You also have the warning from Stata

Datasets used in the Stata documentation were selected to demonstrate how to use Stata. Some datasets have been altered to explain a particular feature. Do not use these datasets for analysis.
Comment
Charlie Joyez

Join Date: Dec 2014

Posts: 421
#4

30 Aug 2016, 04:37

Andrew, I've seen the warning in the Stata press page, and assumed it could be done on purpose on my post, however, I haven't seen yet what feature was particularly highlighted using such inflated values in the SVY reference document, and I like when strange things are explained to me.

Moreover I'm not bothered at all by these values for instruction purposes, that's not my point. It just was a warning because it seemed to me that this dataset changed between the Gutierrez document in 2008 and the current dataset, and I just wanted to make sure this change was intentional.

Ps : I've just noticed that Gutierrez was a StataCorp member, and it could explained why it has the "true" or "initial" values of the stats, but I still don't see the point in changing them.
Comment
Andrew Musau

Join Date: Oct 2014

Posts: 10218
#5

30 Aug 2016, 05:03

You are correct, it appears that the dataset did change at some point in time. I do not have an inside track on this.
Comment

Announcement

Impossible values in Stata example Dataset

Comment

Comment

Comment

Comment