egen rowmax produces additional decimals

Doug Hess

Join Date: Nov 2016

Posts: 58
#1

egen rowmax produces additional decimals

31 Aug 2021, 21:05

Hello. I'm working with US Census data in Stata 15.1. Each observation is a geographic area and the variables are the percentage of the population placed in several race/ethnicity categories. The percentages are in the format 12.34. I want a new variable with the percentage from the category with the highest percentage. I'm using the following command:

Code:

egen rowmax=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic)

However, if "pctwhite," for instance, is 55.19, "rowmax" = 55.189999. This creates a problem when I want to create a variable that contains "pctwhite" as the value to indicate the race/ethnicity category that "rowmax" came from. Hope that makes sense. I don't know if I

Code:

egen rowmax

has a fix for this. I tried rounding manipulations with

Code:

int

but that seemed to produce a few records with "rowmax" off by 0.01 from the source variable. Thank you for any suggestions.
Tags: None

Ali Atia

Join Date: May 2020
Posts: 737

31 Aug 2021, 21:09

See -help precision-. Here is the short summary:

Code:

Justifications for all statements made appear in the sections below. In summary,

1. It sometimes appears that Stata is inaccurate. That is not true and, in fact, the appearance of inaccuracy happens in part because
Stata is so accurate.

2. You can cover up this appearance of inaccuracy by storing all your data in double precision. This will double (or more) the size of
your dataset, and so we do not recommend the double-precision solution unless your dataset is small relative to the amount of memory
on your computer. In that case, there is nothing wrong with storing all your data in double precision.

The easiest way to implement the double-precision solution is by typing set type double. After that, Stata will default to creating
all new variables as doubles, at least for the remainder of the session. If all your datasets are small relative to the amount of
memory on your computer, you can set type double, permanently; see [D] generate.

3. The double-precision solution is needlessly wasteful of memory. It is difficult to imagine data that are accurate to more than float
precision. Regardless of how your data are stored, Stata does all calculations in double precision, and sometimes in quad precision.

The issue of 1.1 not being equal to 1.1 arises only with "nice" decimal numbers. You just have to remember to use Stata's float()
function when dealing with such numbers.

The whole helpfile is an instructive read.

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35645

01 Sep 2021, 03:00

I couldn't reproduce your problem with a simple example

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input float(var1 var2)
12.34 56.78
90.12 34.56
end

egen max = rowmax(var1 var2)

list 

     +-----------------------+
     |  var1    var2     max |
     |-----------------------|
  1. | 12.34   56.78   56.78 |
  2. | 90.12   34.56   90.12 |
     +-----------------------+

describe 

Contains data
 Observations:             2                  
    Variables:             3                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
var1            float   %9.0g                 
var2            float   %9.0g                 
max             float   %9.0g

However, clicking on any cell in the Data Editor shows more decimal places which appear spurious, until as nicely explained by @Ali Atia's quotation, we recall that Stata necessarily works in binary.

The "fix" is just to use an appropriate display format. %3.2f would insist on two decimal places.

Comment

Doug Hess

Join Date: Nov 2016
Posts: 58

01 Sep 2021, 08:23

Here's what I was doing. I'm sorry I wasn't clear the first time; I was posting at 11pm.

It seems that -destring, replace- defaults to double and without -egen double [varname]- it gives the additional decimal points.

Changing the decimal format wouldn't allow me to match cells (see bottom of code).

Now that it works, is there a way to reduce the file size by -recast- without getting the additional decimal points? Not a big problem with fast computers, but I'm wondering.

Code:

* Data from https://api.census.gov/data/2010/surname?get=PCTAPI,PCTBLACK,PCTAIAN,PCTWHITE,COUNT,PCT2PRACE,PCTHISPANIC,NAME&RANK=:100000
* Reduce 100 000 to 1000 for a simpler exercise

import delimited "E:\Surname Race BG\2010 Census Surnames race top100 000.txt", clear

drop v10

foreach var in pctapi pctblack pctaian pctwhite pct2prace {
 replace `var'=trim(`var')
    }
foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
 replace `var'=usubinstr(`var',"("," ",.)
    }
    
foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
 replace `var'=usubinstr(`var',")"," ",.)
    }

foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic {
 replace `var'=usubinstr(`var',"S"," ",.)
    }
    
destring, replace

foreach var in pctapi pctblack pctaian pctwhite pct2prace pcthispanic{
 replace `var'=0 if `var' ==.
    }

* -egen- double solves the problem;  -egen- alone creates this problem

egen double rowmax=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic)

egen rowmax_2=rowmax(pctapi pctblack pctaian pctwhite pct2prace pcthispanic

* without -egen double- the values don't match
gen match_1=pctwhite==rowmax

gen match_2=pctwhite==rowmax_2

codebook match*
/* output:
-----------------------------------------------------------------------------------------------------------------------------
match1                                                                                                            (unlabeled)
-----------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/100,301

            tabulation:  Freq.  Value
                        17,547  0
                        82,754  1

-----------------------------------------------------------------------------------------------------------------------------
match_2                                                                                                           (unlabeled)
-----------------------------------------------------------------------------------------------------------------------------

                  type:  numeric (float)

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/100,301

            tabulation:  Freq.  Value
                        96,640  0
                         3,661  1

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35645
#5

01 Sep 2021, 08:48

In short, the problem, although linked to precision, was really about relying on Stata's default in egen to create a float when you are taking a maximum over doubles. In this case you are better advised to specify a double.

It's a murky area. In my own code downloadable by others I've probably covered the entire spectrum from watching out for these problems to using Stata's defaults and not even letting the user specify otherwise. Official code should be more careful. Certainly with egen it is documented that you can and should spell out a preferred storage type if you prefer it.
Comment
Doug Hess

Join Date: Nov 2016

Posts: 58
#6

01 Sep 2021, 09:05

Originally posted by Nick Cox View Post

In short, the problem, although linked to precision, was really about relying on Stata's default in egen to create a float when you are taking a maximum over doubles. In this case you are better advised to specify a double.

It's a murky area. In my own code downloadable by others I've probably covered the entire spectrum from watching out for these problems to using Stata's defaults and not even letting the user specify otherwise. Official code should be more careful. Certainly with egen it is documented that you can and should spell out a preferred storage type if you prefer it.

Thank you. The documentation on precision and decimals could also be clearer. It seems to be written for people with familiarity with low-level computer software or with other computer languages than the median (and below) user. Perhaps I should take a look again now that I'm rested.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35645
#7

01 Sep 2021, 09:25

I have been a minor contributor to the explanatory literature, although I don't take it that you are commenting in my direction. The essence is that in order to understand something that is often puzzling, you need to learn a little about what is going on under the hood, which starts with something like "even 0.1 can't be held exactly in binary".

Oddly enough, my first lesson on computers, around 1964, was when a mathematics teacher decided to break off from the rather dusty syllabus and teach us a little about binary arithmetic.

I am not clear whether my own students ever got that lesson. In some countries it seems that nothing serious is taught before graduate school and even then the presumption is that you are learning one subject.
Comment
Doug Hess

Join Date: Nov 2016

Posts: 58
#8

01 Sep 2021, 10:36

Originally posted by Nick Cox View Post

I have been a minor contributor to the explanatory literature, although I don't take it that you are commenting in my direction. The essence is that in order to understand something that is often puzzling, you need to learn a little about what is going on under the hood, which starts with something like "even 0.1 can't be held exactly in binary".

Oddly enough, my first lesson on computers, around 1964, was when a mathematics teacher decided to break off from the rather dusty syllabus and teach us a little about binary arithmetic.

I am not clear whether my own students ever got that lesson. In some countries it seems that nothing serious is taught before graduate school and even then the presumption is that you are learning one subject.

The only comment directed at you was "Thank you."
Comment
lana chahine

Join Date: Apr 2023

Posts: 9
#9

23 Feb 2024, 14:30

Hello,
I am using the following command to multiple across 3 variables in the same row. I keep getting 0 instead of the actual result which should be a number between 0 and 1.
egen var1= prod(var2*var3*var4), by(id) pmiss(ignore)

Any advice would be much appreciated. I read help precision and i tried the command set type double and it didnt help.
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 528
#10

25 Feb 2024, 02:57

The advise given most often in this Forum is to read the FAQ of this forum before posting anything. This also applies here.

If we can't see your data we can't give you better advise because we cannot know why it is that you get a result that differs from what you expect. Hence, read the FAQ and try to follow its instructions closely, especially with respect to #12.
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35645
#11

25 Feb 2024, 04:31

lana chahine in #9: Your question has nothing to do with the title of the thread and would have been better as a new thread.

Dirk Enzmann gave excellent advice. A data example would be really helpful. Indeed, you should have explained, as below, that you are using a community-contributed function,

I can nevertheless hazard a guess at what is happening. The egen function prod() is community-contributed. It will in your case first calculate the product var2*var3*var4 in each observation and then calculate the product of that across observations for the same identifier. If any zero is met anywhere, then zero is the resulting product.

I am guessing that you really want the row product, period, but if so it would seem simpler to work with

Code:

gen double wanted = var2 * var3 * var4

unless you want some special behaviour, such as ignoring zeros, ignoring missings, or whatever.

Code:

STB-60 dm87 . . . . . . . . . . Calculating the row product of observations (help rprod if installed) . . . . . . . . . . . . . . . . . . P. Ryan 3/01 pp.3--4; STB Reprints Vol 10, pp.39--41 generates new variable whose values are the row product of observations STB-51 dm71 . . . . . . . . . . . . Calculating the product of observations (help prod if installed) . . . . . . . . . . . . . . . . . . P. Ryan 9/99 pp.3--4; STB Reprints Vol 9, pp.45--48 extension to egen for producing the product of observations
Comment
Said Mohamed

Join Date: Aug 2022

Posts: 8
#12

29 Aug 2024, 07:19

Hello, In my notes, Variable ''cell'' created using the egen command takes on the values one to six, but mine gave me 9 and the assumption of equal variances is violated. Someone may help please?
Comment
Dirk Enzmann

Join Date: Apr 2014

Posts: 528
#13

29 Aug 2024, 07:52

My first inclination was to simply ignore your post. On second thoughts, however, perhaps at least some comments that may be helpful to you:
You did not explain what you want to achieve.

You did not show us your data.

You did not show us the commands you actually used.

Why equality of variances are important to you and why this should be related to the egen command (which precisely?) is a mystery to me.

It is not sure whether your question really continues the topic (thread) to which you did respond.

To sum up: Please, read the FAQ of the Stata Forum thoroughly and then try to reformulate your question. If it has nothing to do with "additional decimals of egen rowmax" you will need a new topic with an appropriate title.
3 likes
Comment

Announcement