Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Impossible values in Stata example Dataset

    Looking for an example dataset to introduce Stata to some students, I've found the highschool.dta (or multistage.dta) dataset in the examples available online at Stata Press ( http://www.stata-press.com/data/r14/svy.html)
    Code:
    use http://www.stata-press.com/data/r14/multistage ,clear
    * Or more simply :  webuse multistage.dta
    svyset county [pw=sampwgt], strata(state) fpc(ncounties) || school, fpc(nschools)
    
    su weight height,de
    Observations are supposed to be individual, with information on their gender, weight and height (among other)
    You see that the individual's weight is quite normal, however, the height is clearly impossible (for those, like me, who more familiar with metric system, the mean height is around 11 meters tall).

    I've first though that the data have been transformed in purpose, not to reveal confidential data (but why put such values?).
    I've then thought it was the wrong unit, but I found no other possible height units with these values. Also, it is not a misplaced decimal sign (43 inches would be too small), nor a squared value, etc...

    However, I found some material (here, p.13) where using the same dataset, they get a mean of height around 67 inches, and not 430.

    I've been checking r14 r13 r12 r11 r10 and r9 datasets, and always got the same abnormal values of height.

    Actually I get the same values that in the Survey data reference manual (http://www.stata.com/manuals14/svy.pdf), p.13, where only the variable weight is described, but height is used as a regressor of weight, and one can see a abnormal result :

    Code:
    . svy: regress weight height
    (running regress on estimation sample)
    
    Survey: Linear regression
    
    Number of strata   =        50        Number of obs      =    4071
    Number of PSUs     =       100        Population size    =    8000000
            Design df          =    50
            F(   1,     50)    =    593.99
            Prob > F           =    0.0000
            R-squared          =    0.2787
    
                
    Linearized
    weight       Coef.   Std. Err.    t    P>t     [95% Conf.    Interval]
                
    height    .7163115   .0293908    24.37    0.000     .6572784    .7753447
    _cons   -149.6183   12.57265    -11.90    0.000    -174.8712    -124.3654
    Where the height is supposed to be in inches and the weights in lbs. The regression predicts a negative weight for all observations with a height below 210 inches (5.3 meters).

    I was just wondering whether this was done on purpose, or not. If it is the case how come the first reference I found had "normal" values?
    If it is not the case, perhaps someone could update the example files.

    Thanks,
    Charlie
    Last edited by Charlie Joyez; 30 Aug 2016, 03:25.

  • #2
    Good question. Let's add that no height or weight values I've ever seen use so many significant figures.

    Comment


    • #3
      My guess is that someone just created a normally distributed variable of an arbitrary mean and variance. For instruction purposes, I personally would not be bothered if the values were not "correct" or realistic. You also have the warning from Stata

      Datasets used in the Stata documentation were selected to demonstrate how to use Stata. Some datasets have been altered to explain a particular feature. Do not use these datasets for analysis.

      Comment


      • #4
        Andrew, I've seen the warning in the Stata press page, and assumed it could be done on purpose on my post, however, I haven't seen yet what feature was particularly highlighted using such inflated values in the SVY reference document, and I like when strange things are explained to me.

        Moreover I'm not bothered at all by these values for instruction purposes, that's not my point. It just was a warning because it seemed to me that this dataset changed between the Gutierrez document in 2008 and the current dataset, and I just wanted to make sure this change was intentional.

        Ps : I've just noticed that Gutierrez was a StataCorp member, and it could explained why it has the "true" or "initial" values of the stats, but I still don't see the point in changing them.

        Comment


        • #5
          You are correct, it appears that the dataset did change at some point in time. I do not have an inside track on this.

          Comment

          Working...
          X