Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Numbers that change - due to variable types?

    I have a dataset full of identifiers similar to the id below:
    id
    1093489745
    When I copy and paste this into Stata, the variable is automatically formatted as long %8.0g.

    When I run the following code . . . .

    format id %15.0g
    generate id2 = id
    format id2 %15.0g


    . . . id2 is "magically" different from id (see below).

    I notice when I click on each of the cells in Stata, id[1] is showing 1093489745 but id2[1] is showing 1.093e+09.
    id id2
    1093489745 1093489792
    Why is this happening, and what can I do to prevent it in the future?

    Thanks,
    Erika

  • #2
    Generate id2 as a long variable and use a different format.
    Code:
    generate long id2 = id
    format id2 %15.0f

    Comment


    • #3
      Thanks for the speedy response. Can you explain why I need to specify that id2 be generated as long and format that variable using %15.0f versus %15.0g?

      And why does Stata convert id2 to a float when the original variable, id, was long?

      Sorry, just trying to understand.

      Erika

      Comment


      • #4
        Originally posted by Erika Kociolek View Post
        Can you explain why I need to specify that id2 be generated as long and format that variable using %15.0f versus %15.0g?

        And why does Stata convert id2 to a float when the original variable, id, was long?
        The answer to the second question is that the default type for numeric variables is float.
        Code:
        help generate
        If no type is specified, the new variable type is determined by the type of result returned by =exp. A float variable (or a double, according to set type) is created if the result is numeric, and a string variable is created if the result is a string.
        The format command is not necessary with your data, as long as you generate a long variable, but here is an excerpt from the relevant help file.
        Code:
        help format
        %g differs from %f in that (1) it decides how many digits to display to the right of the decimal point, and (2) it will switch to a %e format if the number is too large or too small.

        Comment


        • #5
          Everything Friedrich says is correct. There is one more point that bears emphasis. The reason you need the long data type instead of the float data type is precision. Long and float types both occupy 4 bytes of data. However, in the float data type, one of those bytes is used to encode the exponent, so you have fewer bits available to encode the mantissa. In particular, the float type cannot encode ten decimal digits--there isn't enough room allocated for that. Floats are good for about 7 digits of accuracy. The long type uses all of its bits, except for 1 sign bit, for precision, so it can accommodate longer integers and still retain accuracy. However, even the long type has its limitations. The largest value it can store is 2,147,483,620. So it can handle 9-digit numbers with no difficulty, and will accommodate ten digits if the number is not too large. The example number you gave is fairly close to the upper limit, and I worry you may have other numbers in this variable that exceed even the limit of the long data type.

          I notice that you call the variable id, suggesting that it is an identifier and is not used in calculations. If that is the case, you are probably better off converting it to a string variable as there is, for practical purposes, no limit on the precision available for strings. If I am wrong and you do need to do calculations with this id variable, you might want to store it as a double, which will take up 8 bytes but will offer you greater precision, and will easily accommodate 16 digits of precision.

          Comment


          • #6
            Thank you, Clyde. This is very helpful.

            The reason I will sometimes store id as a number versus a string is, if the variable is initially formatted as a string, converting to a number quickly gets rid of leading zeroes.

            Comment


            • #7
              Excellent explanations from Friedrich and Clyde.

              I would like to mention also clonevar as specifically designed to replicate every detail about a variable, including its storage type. Even if you intend to change something (indeed it's hard to imagine wanting to create it otherwise), you start with a clone, hence the name.


              Comment


              • #8
                Hi. Can anyone please help me with an issue? I imported data of firms shareholding which are percentage values and reported in stata as 43.95. That is, with 2 decimal places. I was about to create dummies of ownership categories but it was constantly generating wrong dummies. Lately I figured out it was change of numeric value in stata after these numeric values were encoded from string to numeric. The newly encoded variable though apparently shows the same figure but when i click on a value of that variable, it shows this 43.95 as 309. after encoding the value type is float and format is %17.1g. I ran the above suggested command. format var %15.0f but still the inherent value is same in that box i.e. 309. Please guide

                Comment


                • #9
                  if you actually used the -encode- command (we don't know because you did not follow the advice in the FAQ and show us exactly what your command was), then that was a mistake and you should have used -destring- instead; see
                  Code:
                  h destring

                  Comment


                  • #10
                    Thank you so much Rich for your prompt and kind reply. Apologies for missing out details. String variable is fam_own

                    Code to encode was

                    1. encode fam_own, generate(ownership)

                    2. drop fam_own

                    3. generate owndummy = 0

                    4. replace owndummy =1 if ownership>=20 & ownership < 40

                    5. replace owndummy =2 if ownership >=40 & ownership < 60

                    6. replace owndummy =3 if ownership >60

                    and lastly i created labels of low, medium & high ownership.

                    but, dummies created were wrong and data values for ownership variable are changed when i click on any numeric value. Infact, values of whole data have changed that have been encoded. Non-numeric variables when I am clicking on their box are also showing a numeric value.
                    Last edited by Zeenat Murtaza; 19 Sep 2022, 07:48.

                    Comment


                    • #11
                      as I said in #2, use of -encode- is wrong in this case (apparently, but you still do not give a data example using -dataex-; please see the FAQ); so use -destring- instead of -encode-

                      Comment


                      • #12
                        Thank you so much Rich. I shall follow your advise.

                        Comment

                        Working...
                        X