Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • egen rowmax produces additional decimals

    Hello. I'm working with US Census data in Stata 15.1. Each observation is a geographic area and the variables are the percentage of the population placed in several race/ethnicity categories. The percentages are in the format 12.34. I want a new variable with the percentage from the category with the highest percentage. I'm using the following command:
    Code:
    egen rowmax=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic)
    However, if "pctwhite," for instance, is 55.19, "rowmax" = 55.189999. This creates a problem when I want to create a variable that contains "pctwhite" as the value to indicate the race/ethnicity category that "rowmax" came from. Hope that makes sense. I don't know if I
    Code:
    egen rowmax
    has a fix for this. I tried rounding manipulations with
    Code:
    int
    but that seemed to produce a few records with "rowmax" off by 0.01 from the source variable. Thank you for any suggestions.

  • #2
    See -help precision-. Here is the short summary:

    Code:
        Justifications for all statements made appear in the sections below.  In summary,
    
            1.  It sometimes appears that Stata is inaccurate.  That is not true and, in fact, the appearance of inaccuracy happens in part because
                Stata is so accurate.
    
            2.  You can cover up this appearance of inaccuracy by storing all your data in double precision.  This will double (or more) the size of
                your dataset, and so we do not recommend the double-precision solution unless your dataset is small relative to the amount of memory
                on your computer.  In that case, there is nothing wrong with storing all your data in double precision.
    
                The easiest way to implement the double-precision solution is by typing set type double.  After that, Stata will default to creating
                all new variables as doubles, at least for the remainder of the session.  If all your datasets are small relative to the amount of
                memory on your computer, you can set type double, permanently; see [D] generate.
    
            3.  The double-precision solution is needlessly wasteful of memory.  It is difficult to imagine data that are accurate to more than float
                precision.  Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.
    
                The issue of 1.1 not being equal to 1.1 arises only with "nice" decimal numbers.  You just have to remember to use Stata's float()
                function when dealing with such numbers.
    The whole helpfile is an instructive read.

    Comment


    • #3
      I couldn't reproduce your problem with a simple example

      Code:
      * Example generated by -dataex-. For more info, type help dataex
      clear
      input float(var1 var2)
      12.34 56.78
      90.12 34.56
      end
      
      egen max = rowmax(var1 var2)
      
      list 
      
           +-----------------------+
           |  var1    var2     max |
           |-----------------------|
        1. | 12.34   56.78   56.78 |
        2. | 90.12   34.56   90.12 |
           +-----------------------+
      
      describe 
      
      Contains data
       Observations:             2                  
          Variables:             3                  
      -------------------------------------------------------------------------------
      Variable      Storage   Display    Value
          name         type    format    label      Variable label
      -------------------------------------------------------------------------------
      var1            float   %9.0g                 
      var2            float   %9.0g                 
      max             float   %9.0g
      However, clicking on any cell in the Data Editor shows more decimal places which appear spurious, until as nicely explained by @Ali Atia's quotation, we recall that Stata necessarily works in binary.

      The "fix" is just to use an appropriate display format. %3.2f would insist on two decimal places.

      Comment


      • #4
        Here's what I was doing. I'm sorry I wasn't clear the first time; I was posting at 11pm.

        It seems that -destring, replace- defaults to double and without -egen double [varname]- it gives the additional decimal points.

        Changing the decimal format wouldn't allow me to match cells (see bottom of code).

        Now that it works, is there a way to reduce the file size by -recast- without getting the additional decimal points? Not a big problem with fast computers, but I'm wondering.

        Code:
        * Data from https://api.census.gov/data/2010/surname?get=PCTAPI,PCTBLACK,PCTAIAN,PCTWHITE,COUNT,PCT2PRACE,PCTHISPANIC,NAME&RANK=:100000
        * Reduce 100 000 to 1000 for a simpler exercise
        
        import delimited "E:\Surname Race BG\2010 Census Surnames race top100 000.txt", clear
        
        drop v10
        
        foreach var in pctapi pctblack pctaian pctwhite pct2prace {
         replace `var'=trim(`var')
            }
        foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
         replace `var'=usubinstr(`var',"("," ",.)
            }
            
        foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
         replace `var'=usubinstr(`var',")"," ",.)
            }
        
        foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
         replace `var'=usubinstr(`var',"S"," ",.)
            }
            
        destring, replace
        
        foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic{
         replace `var'=0 if `var' ==.
            }
        
        * -egen- double solves the problem;  -egen- alone creates this problem
        
        egen double rowmax=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic)
        
        egen rowmax_2=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic
        
        * without -egen double- the values don't match
        gen match_1=pctwhite==rowmax
        
        gen match_2=pctwhite==rowmax_2
        
        codebook match*
        /* output:
        -----------------------------------------------------------------------------------------------------------------------------
        match1                                                                                                            (unlabeled)
        -----------------------------------------------------------------------------------------------------------------------------
        
                          type:  numeric (float)
        
                         range:  [0,1]                        units:  1
                 unique values:  2                        missing .:  0/100,301
        
                    tabulation:  Freq.  Value
                                17,547  0
                                82,754  1
        
        -----------------------------------------------------------------------------------------------------------------------------
        match_2                                                                                                           (unlabeled)
        -----------------------------------------------------------------------------------------------------------------------------
        
                          type:  numeric (float)
        
                         range:  [0,1]                        units:  1
                 unique values:  2                        missing .:  0/100,301
        
                    tabulation:  Freq.  Value
                                96,640  0
                                 3,661  1

        Comment


        • #5
          In short, the problem, although linked to precision, was really about relying on Stata's default in egen to create a float when you are taking a maximum over doubles. In this case you are better advised to specify a double.

          It's a murky area. In my own code downloadable by others I've probably covered the entire spectrum from watching out for these problems to using Stata's defaults and not even letting the user specify otherwise. Official code should be more careful. Certainly with egen it is documented that you can and should spell out a preferred storage type if you prefer it.

          Comment


          • #6
            Originally posted by Nick Cox View Post
            In short, the problem, although linked to precision, was really about relying on Stata's default in egen to create a float when you are taking a maximum over doubles. In this case you are better advised to specify a double.

            It's a murky area. In my own code downloadable by others I've probably covered the entire spectrum from watching out for these problems to using Stata's defaults and not even letting the user specify otherwise. Official code should be more careful. Certainly with egen it is documented that you can and should spell out a preferred storage type if you prefer it.
            Thank you. The documentation on precision and decimals could also be clearer. It seems to be written for people with familiarity with low-level computer software or with other computer languages than the median (and below) user. Perhaps I should take a look again now that I'm rested.

            Comment


            • #7
              I have been a minor contributor to the explanatory literature, although I don't take it that you are commenting in my direction. The essence is that in order to understand something that is often puzzling, you need to learn a little about what is going on under the hood, which starts with something like "even 0.1 can't be held exactly in binary".

              Oddly enough, my first lesson on computers, around 1964, was when a mathematics teacher decided to break off from the rather dusty syllabus and teach us a little about binary arithmetic.

              I am not clear whether my own students ever got that lesson. In some countries it seems that nothing serious is taught before graduate school and even then the presumption is that you are learning one subject.

              Comment


              • #8
                Originally posted by Nick Cox View Post
                I have been a minor contributor to the explanatory literature, although I don't take it that you are commenting in my direction. The essence is that in order to understand something that is often puzzling, you need to learn a little about what is going on under the hood, which starts with something like "even 0.1 can't be held exactly in binary".

                Oddly enough, my first lesson on computers, around 1964, was when a mathematics teacher decided to break off from the rather dusty syllabus and teach us a little about binary arithmetic.

                I am not clear whether my own students ever got that lesson. In some countries it seems that nothing serious is taught before graduate school and even then the presumption is that you are learning one subject.
                The only comment directed at you was "Thank you."

                Comment


                • #9
                  Hello,
                  I am using the following command to multiple across 3 variables in the same row. I keep getting 0 instead of the actual result which should be a number between 0 and 1.
                  egen var1= prod(var2*var3*var4), by(id) pmiss(ignore)

                  Any advice would be much appreciated. I read help precision and i tried the command set type double and it didnt help.

                  Comment


                  • #10
                    The advise given most often in this Forum is to read the FAQ of this forum before posting anything. This also applies here.

                    If we can't see your data we can't give you better advise because we cannot know why it is that you get a result that differs from what you expect. Hence, read the FAQ and try to follow its instructions closely, especially with respect to #12.

                    Comment


                    • #11
                      lana chahine in #9: Your question has nothing to do with the title of the thread and would have been better as a new thread.

                      Dirk Enzmann gave excellent advice. A data example would be really helpful. Indeed, you should have explained, as below, that you are using a community-contributed function,

                      I can nevertheless hazard a guess at what is happening. The egen function prod() is community-contributed. It will in your case first calculate the product var2*var3*var4 in each observation and then calculate the product of that across observations for the same identifier. If any zero is met anywhere, then zero is the resulting product.

                      I am guessing that you really want the row product, period, but if so it would seem simpler to work with

                      Code:
                      gen double wanted = var2 * var3 * var4
                      unless you want some special behaviour, such as ignoring zeros, ignoring missings, or whatever.


                      Code:
                      STB-60  dm87  . . . . . . . . . .  Calculating the row product of observations
                              (help rprod if installed) . . . . . . . . . . . . . . . . . .  P. Ryan
                              3/01    pp.3--4; STB Reprints Vol 10, pp.39--41
                              generates new variable whose values are the row product of
                              observations
                      
                      STB-51  dm71  . . . . . . . . . . . .  Calculating the product of observations
                              (help prod if installed)  . . . . . . . . . . . . . . . . . .  P. Ryan
                              9/99    pp.3--4; STB Reprints Vol 9, pp.45--48
                              extension to egen for producing the product of observations

                      Comment


                      • #12
                        Hello, In my notes, Variable ''cell'' created using the egen command takes on the values one to six, but mine gave me 9 and the assumption of equal variances is violated. Someone may help please?

                        Comment


                        • #13
                          My first inclination was to simply ignore your post. On second thoughts, however, perhaps at least some comments that may be helpful to you:
                          • You did not explain what you want to achieve.
                          • You did not show us your data.
                          • You did not show us the commands you actually used.
                          • Why equality of variances are important to you and why this should be related to the egen command (which precisely?) is a mystery to me.
                          • It is not sure whether your question really continues the topic (thread) to which you did respond.
                          To sum up: Please, read the FAQ of the Stata Forum thoroughly and then try to reformulate your question. If it has nothing to do with "additional decimals of egen rowmax" you will need a new topic with an appropriate title.

                          Comment

                          Working...
                          X