Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to save quantiles as new variables?

    Hi!

    I am trying to create new variables using quantiles. This is what I have done so far:

    Using xtile, I created 3 quantiles and tabulated to show the mean of each quantile. Like this:

    xtile quantiles_3 = data, nq(3)

    tabstat data, stat (n mean) by (quantiles_3)


    Creating a dummy variable will allow me to manuaally input the ranges for the quantiles. For instance,
    • Quantile 1 is between 0 and 12
    • Quantile 2 is between 12 and 28
    • Quantile 3 is between 28 and 87
    • Quantile 4 is anything greater than 87
    Question: How can I make the dummy variable with those ranges? Is there a quicker way to save the quantiles as independent variables?

    Thank you!
    Last edited by Lindsey Bates; 11 Sep 2020, 11:59.

  • #2
    Well, if you really need to create indicator variables for the quantiles, there are several ways to do it. Perhaps the simplest would be:
    Code:
    tab quantiles_3, gen(q)
    But do you really need to do that? For most things where you would use dummy variables, you can, instead, use factor variable notation. For example, if you wanted to use the quantile as a predictor variable in a regression, you could run that as:
    Code:
    regression_command outcome_variable i.quantiles_3
    and Stata will internally create the necessary indicator variables "on the fly" without cluttering up your data set with them. And using factor-variable notation also enables you to use the -margins- command after regressions to calculate all sorts of interesting group-level statistics.

    Do read -help fvvarlist- for details.

    It is only occasionally useful in Stata to create indicator variables to work with a multi-level category variables.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      Well, if you really need to create indicator variables for the quantiles, there are several ways to do it. Perhaps the simplest would be:
      Code:
      tab quantiles_3, gen(q)
      But do you really need to do that? For most things where you would use dummy variables, you can, instead, use factor variable notation. For example, if you wanted to use the quantile as a predictor variable in a regression, you could run that as:
      Code:
      regression_command outcome_variable i.quantiles_3
      and Stata will internally create the necessary indicator variables "on the fly" without cluttering up your data set with them. And using factor-variable notation also enables you to use the -margins- command after regressions to calculate all sorts of interesting group-level statistics.

      Do read -help fvvarlist- for details.

      It is only occasionally useful in Stata to create indicator variables to work with a multi-level category variables.
      Thanks Clyde!

      Instead of using factor variable notation, how could a dummy variable be set up using the ranges? For instance: egen p25 = pctile(data), p(25) --> this would give the 25th percentile and save it as a new variable. I want to save each quantile as individual variables so egen Q1 = [range from the table of quantile using xtile].

      Say this is the table generated when coding "xtile quantiles_3 = data, nq(3)" & "tabstat data, stat (n mean) by (quantiles_3)":
      Quantile_3 N Mean
      1 23 12
      2 45 28
      3 78 87
      So using this data, I would want to create dummy variables for ranges of mean. How can such variables be created using the specific number of the table?
      Q1=0-12, Q2=12-28, Q3=28-87, Q4=>87
      Last edited by Lindsey Bates; 11 Sep 2020, 12:47.

      Comment


      • #4
        There is some confusion here -- for which much literature is to blame -- between (a) quantile meaning a level defined by an associated cumulative probability and (b) quantile meaning bin, class or interval delimited by quantiles in sense (a).

        Usually this confusion doesn't bite hard but -- despite the excellent answer from Clyde Schechter in #2 -- I am at a loss to know what you want to appear in your new variable. For example the pctile() function of egen generates quantiles sense (a) and xtile generates quantiles sense (b) so what else is there?



        More generally, I am puzzled at the idea that a coarse grouping of bins is more informative or predictive than the variable they bin: not only is there a loss of information. but the bins so produced often fail even to be equally populated.

        I suggest that you create a worked example showing some data and what it is that you want to produce.

        Comment


        • #5
          Originally posted by Nick Cox View Post
          There is some confusion here -- for which much literature is to blame -- between (a) quantile meaning a level defined by an associated cumulative probability and (b) quantile meaning bin, class or interval delimited by quantiles in sense (a).

          Usually this confusion doesn't bite hard but -- despite the excellent answer from Clyde Schechter in #2 -- I am at a loss to know what you want to appear in your new variable. For example the pctile() function of egen generates quantiles sense (a) and xtile generates quantiles sense (b) so what else is there?



          More generally, I am puzzled at the idea that a coarse grouping of bins is more informative or predictive than the variable they bin: not only is there a loss of information. but the bins so produced often fail even to be equally populated.

          I suggest that you create a worked example showing some data and what it is that you want to produce.

          Sorry for the confusion!

          Okay, here it is:

          Say I have a data set and I have sorted it by 3 quantiles. The following is the code I used:
          Code:
          xtile quantiles_3 = data, nq(3)
          tabstat data, stat (n mean) by (quantiles_3)
          That code produced this table:
          Quantile_3 N Mean
          1 23 12
          2 45 28
          3 78 87
          What I would want is to create new dummy variables using the means of the tables above. So:
          Var1 = [range of 0 to 12]
          Var2 = [range of 12 to 28]
          Var3 = [range of 28 to 87]
          Var4 = [range of anything greater than 87]

          I don't want to manually make new variables with the numeric values because I need to use each new variable to then find the frequencies by a different variable known as cost.

          So I would need something like:
          gen Var1 = 1 if between 0 and (Mean(Quantile_3) of 1)
          gen Var2 = 1 if between (Mean(Quantile_3) of 1) and (Mean(Quantile_3) of 2)
          gen Var3 = 1 if between (Mean(Quantile_3) of 2) and (Mean(Quantile_3) of 3)
          gen Var4= 1 if greater than (Mean(Quantile_3) of 3)

          ^ obviously not written like that, but similar too. I think I am overthinking it and just need a simple code to get the new variables.

          Does that make sense?

          Comment


          • #6
            Thanks for #5. What you want is very idiosyncratic and raises real questions for me about whether it is ever used in research or why it improves on direct quantile binning.

            Comment

            Working...
            X