Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating Percentiles and storing them as a variable

    Hi,

    My first time posting on this forum. I am using Stata 14.

    I am trying to create a variable which has the percentile value for the value of a given variable. For example, if the 10th percentile has a value of 15.5 then all temperatures less than 20.2 are reflected as 20.2. Similarly for temperatures between 20.2 and 23.2 are then reflected as 23.2. I intend to include this variable (along with others) in a table.

    The code I have been using (unsuccessfully) follows.

    sysuse citytemp4
    pctile jan_perc = tempjan, nq(10)
    levelsof jan_perc

    * This gives values for the different percentiles.

    gen jan_var = 0

    forvalues i of jan_perc {
    replace jan_var = `i' if tempjan > `i'
    }

    I get an error message r(198)


    Any help on this simple issue would be much appreciated.

    Thanks,


  • #2

    It's not a good idea to tag any problem as simple until you've solved it!

    Your forvalues syntax is an illegal mishmash of ideas from forvalues and foreach, which is why you got an error message.

    But I guess that you don't need a loop at all.

    I don't understand this:

    if the 10th percentile has a value of 15.5 then all temperatures less than 20.2 are reflected as 20.2
    What's the connection between 15.5 and 20.2?

    I guess you want is

    1. equal-sized bins.

    2. each bin to be tagged with its upper limit.

    Code:
     
    sysuse citytemp4, clear
    xtile tempbin = tempjan, nq(10)
    egen tempmax = max(tempjan), by(tempbin)
    would be an answer to that.

    The warnings about binning in many threads (e.g. just yesterday http://www.statalist.org/forums/foru...-into-terciles) may apply to your project.

    Binning is throwing away information. You really need to do that?
    There is some discussion of the limits on binning in http://www.stata-journal.com/sjpdf.h...iclenum=pr0054 Section 4

    Comment


    • #3
      Thanks Nick for the prompt reply - the alternative code and links provided are very useful.

      Given the limits with binning, I have derived the following code from your reply. I have given each individual observation a unique number which ascends as per the tempjan value. I can then 'bin' according to the individual observation number for a given set of requirements. Although the code is cumbersome, it doesn't seem to have a material difference in the citytemp4 data. (As per the comparison of tab tempmax28 and tab tempmax).

      Code:
      bys tempjan: gen seq18 = 1
      sort seq18
      replace seq18 = int((_n-1)+1)
      xtile bin18 = seq18, nq(10)
      tab bin18
      egen tempmax28 = max(tempjan), by (bin18)
      tab tempmax28
      tab tempmax
      Similarly, the code can be tweaked for binning within particular variables (say region). The code I came up with is below.


      Code:
      bys region tempjan: gen seq1 = 1 if region == 4
      sort seq1
      replace seq1 = int((_n-1)+1) if region == 4
      
      xtile bin1 = seq1, nq(10)
      tab bin1
      egen tempmax2 = max(tempjan) if region == 4, by (bin1)
      tab tempmax2

      Would I be right in thinking that it avoids the pairing issue and some of the other warnings - but the choice is primarily dependant on my research needs?

      Apologies - another 'new to Stata question'. If I were to apply weights to the variables - would this be possible using the code I derived?
      Last edited by Dann Morgan; 11 Oct 2016, 08:29.

      Comment


      • #4
        Binning is potentially a bad method, but replacing it with a home-grown method could be even worse. I think what you are doing here is just equivalent to

        Code:
        sort tempjan
        gen rank = _n
        but now if you bin on those ranks, then

        1. tied temperatures may just be put in different bins as they will receive different ranks;

        2. that won't be reproducibly consistent with respect to other variables as tied temperatures say 42, 42, 42, ... won't necessarily correspond to identical values on other variables (typically: won't).

        Advice: Don't do this ever. It's not a solution to problems with ties that makes any sense. It just entails a fiction that the tied values differ, and an arbitrary fiction to boot.

        Otherwise put, if this were a solution to problems with ties, it would have been thought of long ago and implemented.

        (When you say "pairing" I assume you mean ties, but ties could be two-fold, three-fold, and so forth.)

        Note that in any case, egen includes rank functions, so there is no need to write your own code to rank data, which is often helpful, even if not so here. There is a unique option in there, which I wrote originally, but it was not included with your problem and your proposal in mind!
        Last edited by Nick Cox; 11 Oct 2016, 09:14.

        Comment


        • #5
          We read in #2:

          It's not a good idea to tag any problem as simple until you've solved it!
          Simply put - sorry for the pun -,the sentence could well become an axiom.

          I always wished to say this, but words failed me.

          Indeed, it is up to the one who solves the problem to "tag" it as simple or not. Even so, oftentimes he/she may tag it as "simple" exactly on account of "much complexity" amassed and digested for a long time so far.

          The thread is emblematic of this point of view. Pitfalls in data management are found galore.

          Best regards,

          Marcos

          Comment


          • #6
            Thanks again.

            If I wanted to obtain the percentiles of temperatures across regions (and the respective frequency), I will therefore use the following code. It seems to produce something similar to the code I created - but seems to be a better alternative.

            Code:
            xtile tempbin90 = tempjan if region == 4, nq(10)
            egen tempmax90 = max(tempjan) if region == 4, by(tempbin90)
            tab tempmax90

            (I am unable to install -egenmore-).

            Comment


            • #7
              I don't understand why that code is described as "across regions" when the code is restricted to region 4. Otherwise it is precisely what I suggested in #2.

              Nobody in this thread mentioned egenmore (SSC) before. I don't think any function there is needed for or even relevant to your questions so far.

              Comment


              • #8
                I think I might have been barking up the wrong tree. Essentially, I was trying to create the relevant variables which I could then use to create a table.

                For example, the percentiles for Jan: both temperature and frequency for each region. Percentiles would be going down the table, while temp, freq (by region) across.

                I'll have to revisit my train of thought.

                Comment


                • #9
                  Now I too am lost. But percentiles are defined by how many values are below them. In principle they don't have frequencies themselves. In practice each percentile only occurs more than once if there are ties in the data, which for continuous variables capriciously depends on data resolution, such as the number of decimal places reported.

                  Comment


                  • #10
                    Dann: are you confusing percentiles and percentile groups? There are 99 of the former; they are values that divide an ordered set of values into 100 percentile groups. (Similarly there are 9 deciles and 10 decile groups.). Given your mention now of a table, and assuming I've guess right, you'd want to create a percentile group membership variable for temperature within each region; label the group values appropriately using -label-, and then do your tabulation. (This is notwithstanding the remarks about creation of quantile groups with small samples and treatment of ties that have been made in other posts.)

                    Comment


                    • #11
                      Thanks Stephen. I was thinking along those lines, as the number in each group would vary by region. I'll look into -label-.

                      Comment

                      Working...
                      X