Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Precision of Numeric Variables Imported From Excel (.xls)

    Hi
    I have a question regarding the precision of numeric variables imported from excel files. I originally imported about 10 different excel files into Stata 13. (something like import excel "~/stores/coffee. xls", sheet("worksheet1") cellrange(B104:H110) all string clear.

    All the excel files had the same variables in the same cell (so cell A11 would be the same variable, i.e., cups of coffee sold on June 24, 2011 for example, etc.) And I sought to make a .dta file that had all the cups of coffee from 10 different vendors sold on June 24, 2011. So, I imported each file and appended, but I noticed that some files imported the same variable as string and other as a numeric variable (byte, float, or double). I chose to import all the variables as strings, then destring them and format them as doubles, and then append them. However, when I did this I noticed I was losing precision with my variables. Specifically, if cell A11 was one variable A12 was another and A13 was the final variable, where in excel A11+A12=A13, I found that in excel if I summed the data from A11 and A12 I got a variable that was every slightly different from A13. If I used the assert command to test whether my sum in state equaled A13 I got messages than in some cases the assertion was false. The difference in variables was small, perhaps 10^-8. But I was wondering why this was happening at all and what I could do to avoid it?

    Thank you
    Margot

  • #2
    I'm confused here: You say some variables got imported as string or float or byte or ... . However, your import command contains the -all string- option (should be -all- or -allstring-), so they should all come in as strings. Can you elaborate? Differences on the order of 1e-8 are likely precision differences between double and float data types. However, except for integers, having data collected with that precision seems quite unusual, so I'm not sure you lost anything *real* from Excel.

    Regards, Mike

    Comment


    • #3
      HI Mike
      I am sorry for the confusion. I initially imported the excel files into stata without the "all string clear" code and then when I appended them to each other, I received an error message saying some variables were string and other were numeric, and they could not be appended to each other. So instead, what I decided to do was import these variables ALL as strings, append the .dta files, destring all the numeric variables, format all the numeric variables as %15.2f , and then recast them as double, hoping that sums of the stata variables would equal their "total" variable. And that is where i kept running into the problem that when I asserted the A+B=C (as it did on excel), i got some assertions that were false. I initially wondered if the numeric variables in excel had to be a particular format, or why this small discrepancy was occurring. Essentially, I wanted to run this assert command to be able to check the accuracy of the survey data that the survey respondents had provided--and when I ran assert commands and kept receiving that the assertion was false, I thought the "totals" in the survey were wrong. But on further evaluation, I realized that they were ever so slightly different in stata then they were on the survey, and I just didn't understand why that would be the case. I was hoping to use the assert command to act as a verification tool, and with these discrepancies arising, I can't do that.
      Best
      Margot

      Comment


      • #4
        Note that recast double can never restore precision lost at some earlier step.

        Comment


        • #5
          Is there a way to ensure that all data imported from excel to stata is precise to the same decimal place then? Should numeric cells be formatted in excel in a certain way that if I sum them in excel and i sum them in stata, the sums are identical (and the assert command does not return any errors)?

          Comment


          • #6
            I can't answer your question in the way that I think you want if only because I hardly ever use MS Excel and am in any case no sort of expert on how it handles quantities expressed with many decimal places.

            Many of the most active members here also hardly ever, or even never, use MS Excel, which cuts down on your readership among people who answer questions.

            Like Mike Lacy, perhaps, I am still very fuzzy on what the specific problem is here, absent any explicit examples. I can see that the data concern cups of coffee, presumably reported as integers, in which case how quantities of the order of 1e-8 appear is unclear.

            As advised in the FAQ (http://www.statalist.org/forums/help#stata) please consider giving specific examples with data and code of what you are actually doing in Stata. If that becomes clear the question of how to match Excel is likely to be easier to think about. dataex (SSC) could be useful here.

            Comment


            • #7
              I would suggest that the most straightforward way around this problem is to change your assertion from (I assume, since you haven't shared your code or data with us)
              Code:
              assert (A+B)==C
              to
              Code:
              assert abs(A+B-C)<0.0001
              where "0.0001" is chosen as the tolerance below which you will accept the difference as rounding error.

              Like Nick, in the absence of data and code (please see the Statalist FAQ linked from the top of each page for advice on presenting questions and problems in a way that enhances the reader's ability to understand and respond) I don't see how this problem could occur with "cups of coffee" (presumably whole numbers) which is the sole description of your data that you have provided. But if you indeed are adding fractional values, then there's a whole literature on precision that discusses problems of the sort you are having. See the output of help precision for a useful discussion. Your problem is in part compounded by the lack of control over how Stata chooses to import data from Excel. Perhaps if Stata could be persuaded to import everything directly as a double, rather than having to convert from a string, the calculation in Stata would match that in Excel. But why bother? Just decide what is close enough to equality for your purposes, and check if the calculation is indeed close enough.

              Comment


              • #8
                I apologize for not providing a more detailed example. Attached is an excel spreadsheet and my do file (code pasted below). The data is from a survey, where some respondents entered "na" if there was no data, and a blank if zero. I imported the excel file as string, changed the "na" to "" and then destringed all the numeric variables. In the excel spreadsheet "sales-cost+rebate"=net, so I calculated an identical variable in stata. And the assert command failed. If the original data was only at the level of precision of two decimal place, I still don't understand why this occurs.

                . import excel "/Users/margothollick/Desktop/test.xls", sheet("Sheet1") firstrow allstring clear
                . replace amt="" if amt=="na"
                (3 real changes made)
                . replace sales="" if sales=="na"
                (3 real changes made)
                . replace cost="" if cost=="na"
                (3 real changes made)
                . replace rebate="" if rebate=="na"
                (2 real changes made)
                . replace net="" if net=="na"
                (3 real changes made)

                .
                .
                . #delimit ;
                delimiter now ;
                . destring
                > amt
                > sales
                > cost
                > rebate
                > net, replace
                > ;
                amt has all characters numeric; replaced as double
                (3 missing values generated)
                sales has all characters numeric; replaced as double
                (3 missing values generated)
                cost has all characters numeric; replaced as double
                (3 missing values generated)
                rebate has all characters numeric; replaced as long
                (4 missing values generated)
                net has all characters numeric; replaced as double
                (3 missing values generated)

                . #delimit cr
                delimiter now cr
                .
                . replace cost=cost*-1
                (3 real changes made)

                .
                . egen netcheck=rowtotal(sales cost rebate)

                .
                . assert netcheck==net
                4 contradictions in 6 observations
                assertion is false
                Attached Files

                Comment


                • #9
                  Originally posted by Margot Hollick View Post
                  . . . Essentially, I wanted to run this assert command to be able to check the accuracy of the survey data that the survey respondents had provided. . . I was hoping to use the assert command to act as a verification tool, and with these discrepancies arising, I can't do that.
                  You can still use the assert command to do what you want. You just need to make sure that the precision in the two sides of the comparison is the same. You can do that in a couple of ways:
                  Code:
                  assert float(A) == float(B)
                  assert round(A, <desired precsion>) == round(B, <desired precision>)
                  I've illustrated both with your Excel file below, where the net now passes the net-check..

                  .ÿversionÿ14.0

                  .ÿ
                  .ÿclearÿ*

                  .ÿsetÿmoreÿoff

                  .ÿ
                  .ÿimportÿexcelÿ"test.xls",ÿsheet(Sheet1)ÿfirstrow

                  .ÿquietlyÿdestringÿ_all,ÿreplaceÿignore("na")

                  .ÿquietlyÿreplaceÿcostÿ=ÿ-cost

                  .ÿegenÿdoubleÿnet_checkÿ=ÿrowtotal(salesÿcostÿrebate)

                  .ÿ
                  .ÿ
                  .ÿgenerateÿdoubleÿdeltaÿ=ÿnetÿ-ÿnet_check
                  (3ÿmissingÿvaluesÿgenerated)

                  .ÿlist,ÿnoobsÿseparator(0)ÿabbreviate(20)

                  ÿÿ+---------------------------------------------------------------------------------------+
                  ÿÿ|ÿtypeÿÿÿÿÿÿÿamtÿÿÿÿÿÿÿsalesÿÿÿÿÿÿÿÿÿcostÿÿÿrebateÿÿÿÿÿÿÿÿÿnetÿÿÿnet_checkÿÿÿÿÿÿÿÿdeltaÿ|
                  ÿÿ|---------------------------------------------------------------------------------------|
                  ÿÿ|ÿÿÿÿ1ÿÿÿÿ4055.3ÿÿÿ1133795.3ÿÿÿ-747442.84ÿÿÿÿÿÿÿÿ.ÿÿÿ386352.41ÿÿÿ386352.41ÿÿÿ-5.821e-11ÿ|
                  ÿÿ|ÿÿÿÿ2ÿÿÿ13225.6ÿÿÿ5847016.2ÿÿÿ-1504622.6ÿÿÿ669387ÿÿÿ5011780.7ÿÿÿ5011780.7ÿÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
                  ÿÿ|ÿÿÿÿ3ÿÿÿÿÿÿÿ773ÿÿÿ4135894.4ÿÿÿ-103949.04ÿÿÿÿÿ6879ÿÿÿ4038824.3ÿÿÿ4038824.3ÿÿÿÿÿÿÿÿÿÿÿÿ0ÿ|
                  ÿÿ|ÿÿÿÿ4ÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
                  ÿÿ|ÿÿÿÿ5ÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
                  ÿÿ|ÿÿÿÿ6ÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ.ÿÿÿÿÿÿÿÿÿÿÿ0ÿÿÿÿÿÿÿÿÿÿÿÿ.ÿ|
                  ÿÿ+---------------------------------------------------------------------------------------+

                  .ÿ
                  .ÿ//ÿOneÿoption
                  .ÿassertÿround(net,ÿ0.01)ÿ==ÿround(net_check,ÿ0.01)ÿifÿ!mi(net)

                  .ÿ
                  .ÿ//ÿAnother
                  .ÿassertÿfloat(net)ÿ==ÿfloat(net_check)ÿifÿ!mi(net)

                  .ÿ
                  .ÿexit

                  endÿofÿdo-file


                  .


                  Comment

                  Working...
                  X