Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Plausible transformation of ordinal into continuous with similar distribution

    Dear Statalisters,

    for teaching purposes, I would like to generate a continuous variable (with a range of, say, 0-100) that is based on the distribution of a 5-point likert scale ordinal variable. Now, I could just use the runiform(0) function to generate a distribution with the desired properties but it is important that the new variable is created within the existing dataset and correlates (highly) with its ordinal base. Given this, I could use the cumulative distribution function of the ordinal variable to generate the continuous variable, but the uniform distribution resulting from the cumul function is not well suited for what I am trying to accomplish with the data (a similar problem occurs when I am trying to predict it as a latent response variable via ordered logit).

    Ideally, the procedure I am looking for would generate random values with a pre-defined standard deviation around the five categories of the existing ordinal variable. Drawing from a uniform distribution separately for all five categories rather than for the whole variable might also do the trick (a corresponding procedure has been proposed in the linked paper, but I can't implement it in Stata).

    If anyone has an idea how to solve this problem, the procedure could be demonstrated using the rep78 variable from the auto2 dataset. I am using Stata 13.1.

    sysuse auto2.dta
    histogram rep78, discrete /// This is roughly the distribution I would like to end up with, but with (randomly distributed) deviations from the realized values.

    Any help is greatly appreciated!

    Timm

  • #2
    Just a few thoughts.

    1. The hole thing seems to be kind of "making up" data. In other words, it appears like a single(!) imputation method that replaces observed (integer) values by estimated/drawn (real) values. I have no problem with the general idea of replacing observed values by imputed values, but if one does this, I guess one needs to account for the fact that these values are estimated/drawn, probably by generating multiple imputations. Otherwise the confidence intervals estimated for coefficients based on such variables will probably be inappropriate as we do not reflect our uncertainty about the "true" values.

    2. The many "probably"s above make me kind of nervous. I would much rather like to see the "complicated likelihoods" and a mathematical proof for unbiasedness than solely rely on (inappropriate, see 3 below) simulations. This is not to say simulations are a bad idea, and I am aware that their increasing importance (one of the best examples is chained equations approach in MI, which is based solely on simulation results).

    3. The simulation shown is, in my view, inappropriate and does not really relate to the problem. Starting from a ordered model, I would like to see results for a transformed ordinal response - not a transformed predictor, as in the linked paper. Ordinal predictors can be used in any type of regression model and there is little need to transform them. So even if it can be shown that such transformations give unbiased estimates (under specific circumstances) for predictors, this does tell us little about how a transformed response will behave, i.e. about the difference between ordered models and plain vanilla linear regression.

    Edit
    3.a By the way, regressing a continuous response on a ordinal predictor, treating it like continuous, i.e. not creating indicator variables for the different levels, but instead estimate one coefficient, implicitly assumes what is more explicitly done by transformation later. So the relative small differences shown in the simulation are not very surprising, given that the data is created in such a way, that the ordinal predictor indeed has a linear effect.

    This bears the question of why a transformation should be needed in such a situation at all. But maybe it is just me not getting it.


    Best
    Daniel
    Last edited by daniel klein; 13 Jan 2015, 04:01.

    Comment


    • #3
      Here is a try, anyway

      Code:
      sysuse auto ,clear
      
      g rep78c = .
      qui levelsof rep78 ,l(lvls)
      foreach l of loc lvls {
          loc a = `l' - .5
          loc b = `l' + .5
          replace rep78c = ///
          `a' + (`b' - `a') * runiform() if (rep78 == `l')
      }
      
      hist rep78c

      Comment


      • #4
        Here's another one. You'd need to install the user-written package JNSN from SSC (search jnsn), which allows fitting Johnson-type distributions to the ordered-categorical variable (with either the command jnsn or jnsw, both in the package) and then (with the command ajv in the same package) to generate a continuous random variable that emulates the distribution.
        Code:
        version 13.1
        
        clear *
        set more off
        set seed `=date("2015-01-13", "YMD")'
        quietly set obs 500
        
        generate double randu = rpoisson(4)
        
        summarize randu, meanonly
        quietly replace randu = (randu - r(min)) / (r(max) - r(min)) * 100
        scalar define segment_width = 20
        
        generate byte discretized = 1
        forvalues cut = 1/5 {
            replace discretized = discretized + 1 if `cut' * segment_width < randu
        }
        
        *
        * Begin here (ordered-categorical variable is named "discretized")
        *
        jnsn discretized
        ajv , distribution(`r(johnson_type)') generate(suv) gamma(`r(gamma)') ///
            delta(`r(delta)') xi(`r(xi)') lambda(`r(lambda)')
        
        // Rescale 0 - 100
        summarize suv, meanonly
        quietly replace suv = (suv - r(min)) / (r(max) - r(min)) * 100
        
        histogram randu, name(Original)
        histogram discretized, discrete name(Discrete)
        histogram suv, name(Regenerated)
        graph combine Original Discrete Regenerated
        
        exit
        I'm curious about what kind of teaching purposes this exercise is for.

        Comment


        • #5
          Thank you very much to you both for the helpful suggestions. While Joseph's approach assigned the values associated with the distribution of the ordered variable randomly over my dataset, Daniel's code resulted in a high correlation between the ordered variable and the new continuous one and thus preserved the relationships with other covariates. This is what I wanted to accomplish.

          Because both of you seemed to find the question a bit odd, let me clarify why I am doing this. Of course I would never use this procedure for any kind of research but, as I alluded to, it is for teaching purposes. More expressly, I am currently designing the end-of-term exam for an introductory class in quantitative methods. Students will get a dataset with which they must estimate both an OLS and a Logit regression (and compare results). For the Logit, the depvar will be a dummy with the highest category of the ordinal variabel coded as =1. For the OLS, it will be the newly transformed continuous variable. The reasons for transforming the variable, then, are a) to be able to use the same base variable for both estimation procedures, b) to give the illusion of 'natural' data and to avoid confusing students with distributions they haven't seen (like the uniform distribution derived from the cumulative distribution function) and c) to not complicate regression diagnostics.

          I hope that clears things up. Again, thank you very much to both of you for helping me out!

          Comment


          • #6
            Thanks for clarification.

            As an alternative approach you might want to start with a continuous variable and create your binary response using a median split. While still usually a bad idea in research, this is easier implemented.

            Best
            Daniel

            Comment


            • #7
              I forgot about the correlation requirement, sorry. Daniel's answered your query already, but just as an aside you can induce a high correlation by adding the following to the code that I show above.
              Code:
              tempfile tmpfil0
              preserve
              drop suv
              sort discretized
              generate int row = _n
              quietly save `tmpfil0'
              
              restore
              drop discretized
              sort suv
              generate int row = _n
              merge 1:1 row using `tmpfil0', assert(match) nogenerate noreport
              
              spearman discretized suv
              
              exit
              This snippet, when appended to the rest of the code above, gives a Spearman's rho of 0.96 between the ordered-categorical variable and the new continuous one.

              Comment


              • #8
                Thanks, Joseph, that works too.

                Originally posted by daniel klein
                As an alternative approach you might want to start with a continuous variable and create your binary response using a median split. While still usually a bad idea in research, this is easier implemented.
                I have done so before, but this time it just so happened the theoretically relevant variable was only available in ordered form. Also, solving the problem became an end in itself at some point, so thanks again for the help!

                Comment

                Working...
                X