Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Precision loss by importing from Excel

    I am having trouble importing an Excel file. I have a number variable (used as a reference number so every single digit is equally important) in Excel which shows up perfectly in the sheet but gets truncated when using import.

    Please see the attached image to see the made-up Excel file used as an example.


    The exact command I use to import, using Stata 12.0, is:

    import excel "Classeur1.xlsx", sheet("Feuil1") firstrow allstring

    The result being:

    list

    +-----------
    | Variable
    |-----------|
    1. | 3.215e+09
    2. | 65151
    3. | n/a
    4. | 999
    5. |
    |-----------

    The result does not change (probably because of the "n/a") if I drop the allstring option from import.


    This was easily fixed by first exporting to CSV from Excel and then using insheet on the resulting file. Is there any way to do this directly from Stata?


    Thank you,
    Marco


  • #2
    Code:
     
    destring, ignore("n/a")
    should suffice.

    Comment


    • #3
      Originally posted by Nick Cox View Post
      Code:
      destring, ignore("n/a")
      should suffice.

      Thank you for your reply, Mr. Cox. Your answer makes me think I could have been a bit clearer on my first post about what exactly I wanted to fix, Statalist rookie mistake on my side. I will edit it shortly.

      Comment


      • #4
        Apparently, I have the option to edit replies but not the original post so I will elaborate on it in this reply instead.

        The result using the list command should be:

        list

        +-----------
        | Variable
        |-----------|
        1. | 3215118485
        2. | 65151
        3. | n/a
        4. | 999
        5. |
        |-----------

        instead of:

        list

        +-----------
        | Variable
        |-----------|
        1. | 3.215e+09
        2. | 65151
        3. | n/a
        4. | 999
        5. |
        |-----------

        So even converting to a numeric variable does not help as it will only drop the string from the variable. I previously tried the real function:

        gen v2 = real(Variable)

        And got this:

        list

        +----------------------+
        | Variable v2 |
        |----------------------|
        1. | 3.215e+09 3.22e+09 |
        2. | 65151 65151 |
        3. | n/a . |
        4. | 999 999 |
        5. | . |

        I then thought maybe it was only the display format and not the storage, so I did:

        format v2 %15.0g

        And then the result makes me believe that the problem is indeed at the import level:

        list

        +------------------------+
        | Variable v2 |
        |------------------------|
        1. | 3.215e+09 3215000064 |
        2. | 65151 65151 |
        3. | n/a . |
        4. | 999 999 |
        5. | . |
        |------------------------|
        6. | . |
        7. | . |
        +------------------------+


        Thank you again for your time and please do let me know if I haven't given enough information.

        Marco




        Comment


        • #5
          I have not got time at present for a complete answer, but there are various misconceptions here.

          Your use of a real() function is not evidence against real() as your generate statement created a float variable; it is necessary to specify double here.

          Changing the format only changes the display format; it cannot restore precision lost by forcing data into a variable type that can't satisfy the precision required.

          If you want the "n/a" to be preserved as is in a numeric variable, your only option is to use an extended missing value such as .a to which you attach a value label.

          I would tend to use strings to hold identifiers like these any way.

          Comment


          • #6
            Thank you again, Mr. Cox. I think I should insist on the fact that I do not really care about keeping the "n/a". My goal is to have the complete reference number (numerical value), either in string or in any other form, and I'm not able to import it unless I first convert to CSV, which I want to avoid.

            I simply changed the "n/a" in the XLSX file to 0 to make the problem clearer, then the list command:

            import excel "Classeur1.xlsx", sheet("Feuil1") firstrow

            . list

            +-----------+
            | Variable |
            |-----------|
            1. | 3.215e+09 |
            2. | 65151 |
            3. | 0 |
            4. | 999 |
            5. | . |

            Where it should be:


            list

            +-----------+
            | Variable |
            |-----------|
            1. | 3215118485 |
            2. | 65151 |
            3. | 0 |
            4. | 999 |
            5. | . |

            If I export to CSV and then insheet with Stata, here's the result:

            list

            +------------+
            | v1 |
            |------------|
            1. | Variable |
            2. | 3215118485 |
            3. | 65151 |
            4. | 0 |
            5. | 999 |


            Thank you again,

            Marco

            Comment


            • #7
              I'm confused because I am unable to reproduce your problem with the example given. What does the value of the cell in question look like if you open the file in excel? Is it showing up as "3215118485" or as "3.215e+09"? How is the cell formatted?

              If I enter your first list from #4 into excel and then import into Stata using import excel I get exactly the result you desire when I list it.

              Code:
              . import excel "Book1.xlsx", sheet("Sheet1") firstrow allstring clear
              
              . list
              
                   +------------+
                   |        var |
                   |------------|
                1. | 3215118485 |
                2. |      65151 |
                3. |        n/a |
                4. |        999 |
                   +------------+
              There are a couple of different possibilities. One is that I don't have the offending cell formatted the same in excel as it is in your version and that's causing the problem. Another possibility is that it's different because I'm using Stata 13. Maybe the newer version handles these issues better than Stata 12. Hard to tell.

              As a next step, I would heed Nick's warning about the precision issue and try real again. That is:
              Code:
              gen double v2 = real(Variable)
              I don't know if that will work since I can't reproduce your original problem.
              Beyond that if you have easy access to a computer with Stata 13 you might try importing the file there and see if it works better.

              Comment


              • #8
                I can confirm Sarah's results with Stata 12.1. I also tried changing the cell format in the .xlsx file to scientific notation and still could not reproduce the problem. Stata 12.1 does have a few updates to -import excel- but I don't know if any relate to your problem. The recommendation is still to update your Stata installation and try again. See -help update-.

                (I create .xlsx files with LibreOffice on Linux.)
                Last edited by Roberto Ferrer; 26 Nov 2014, 20:41.
                You should:

                1. Read the FAQ carefully.

                2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

                3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

                4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

                Comment


                • #9
                  Originally posted by Sarah Edgington View Post
                  I'm confused because I am unable to reproduce your problem with the example given. What does the value of the cell in question look like if you open the file in excel? Is it showing up as "3215118485" or as "3.215e+09"? How is the cell formatted?

                  If I enter your first list from #4 into excel and then import into Stata using import excel I get exactly the result you desire when I list it.

                  Code:
                  . import excel "Book1.xlsx", sheet("Sheet1") firstrow allstring clear
                  
                  . list
                  
                  +------------+
                  | var |
                  |------------|
                  1. | 3215118485 |
                  2. | 65151 |
                  3. | n/a |
                  4. | 999 |
                  +------------+
                  There are a couple of different possibilities. One is that I don't have the offending cell formatted the same in excel as it is in your version and that's causing the problem. Another possibility is that it's different because I'm using Stata 13. Maybe the newer version handles these issues better than Stata 12. Hard to tell.

                  As a next step, I would heed Nick's warning about the precision issue and try real again. That is:
                  Code:
                  gen double v2 = real(Variable)
                  I don't know if that will work since I can't reproduce your original problem.
                  Beyond that if you have easy access to a computer with Stata 13 you might try importing the file there and see if it works better.

                  Thank you for your reply Mrs. Edgington. I had "shared" a picture of my Excel workbook in the first post but thanks to your post I just realized that it does not show. In Excel (2010), it shows the full number (i.e. it is showing up as "3215118485").

                  I should have access to a computer with Stata 13 next week. I will try by then and post the results.

                  Forcing double on real() gives out a r(109) error (type mismatch).

                  Thank you again,
                  Marco

                  Comment


                  • #10
                    Code:
                      
                    gen double y = real(x)
                    will only fail with type mismatch if x is already numeric.

                    Comment

                    Working...
                    X