Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • create a loop to convert string variables in numeric variables

    Hello,
    I am Salvatore, happy to join the Stata Forum community.
    I am a new user who recently started using Stata. For my thesis research, I am using panel data. At the time I imported this data on Stata all my variables has been converted to strings. Since my dataset consists of over 60 columns (due to the fact that I use six variables and 12 years-time period of 2021-2010), I would like to try to set up a loop that will allow me to convert all the strings to numbers without typing each code. Please help me and I will give every kind of information that would be helpful. The command I am trying to use is with foreach, but I don't know how to use it.
    [foreach var...]

    Thank you very much. Sorry for my mistakes in following the posting rules, but this is my first post and I am still learning. I apologize for not being able to use dataex, but Stata says "input statement exceeds linesize limit. Try specifying fewer variables" Hope you will understand.
    Salvatore

  • #2
    Use dataex for two variables you're interested in

    Comment


    • #3

      Code:
      destring *, replace
      embodies a loop over variables. Whether it works depends on the details. For example, one reason that data are imported as strings is that metadata are included at the top of the dataset.

      Comment


      • #4
        Sorry for the delay, I tried to get familiar with dataex and code commands. This is the variable I am trying to convert from string to number.


        Code:
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input str9 var4
        "51.485"   
        "220.345"  
        "114.535"  
        "57.840"   
        "49.607"   
        "52.350"   
        "42.270"   
        "65.637"   
        "7.023"    
        "158.303"  
        "100.527"  
        "165.390"  
        "10.731"   
        "115.047"  
        "34.019"   
        "36.422"   
        "58.265"   
        "257.750"  
        "55.376"   
        "36.916"   
        "57.143"   
        "102.392"  
        "2.734"    
        "27.939"   
        "37.173"   
        "36.173"   
        "5.403"    
        "19.683"   
        "63.643"   
        "15.755"   
        "15.262"   
        "68.489"   
        "47.065"   
        "29.694"   
        "70.464"   
        "33.707"   
        "6.740"    
        "2.225"    
        "9.412"    
        "3.348"    
        "42.510"   
        "31.884"   
        "21.913"   
        "781.845"  
        "27.912"   
        "61.866"   
        "98.603"   
        "47.384"   
        "241.609"  
        "25.417"   
        "121.679"  
        "38.248"   
        "0"        
        "13.621"   
        "74.687"   
        "58.827"   
        "41.803"   
        "49.859"   
        "25.748"   
        "12.326"   
        "57.587"   
        "55.765"   
        "66.315"   
        "18.051"   
        "10.251"   
        "103.938"  
        "61.670"   
        "3.433"    
        "0"        
        "12.177"   
        "48.294"   
        "6.376.178"
        "18.086"   
        "44.350"   
        "65.675"   
        "111.292"  
        "104.211"  
        "22.680"   
        "57.203"   
        "31.034"   
        "104.881"  
        "31.699"   
        "78.704"   
        "36.154"   
        "27.630"   
        "15.566"   
        "819.040"  
        "158.962"  
        "26.972"   
        "8.417"    
        "27.314"   
        "318.339"  
        "36.842"   
        "81.301"   
        "4.872"    
        "26.079"   
        "30.300"   
        "43.002"   
        "11.126"   
        "36.628"   
        end
        This is what the describe command says

        Code:
         describe var4
        
        Variable      Storage   Display    Value
            name         type    format    label      Variable label
        --------------------------------------------------------------------------------------------------------------------------------------
        var4            str9    %9s
        and this is what happen when I try to use destring command. It says

        Code:
         destring var4, replace
        var4: contains nonnumeric characters; no replace

        Can you plase help me understanidng why it is not working properly and would you tell me how to create a loop for convert string to numerics.

        Thanks a lot everybody for previous answers and following tips and advices

        Comment


        • #5
          Stata is telling you that, for at least one observation, there is some character that is not something that could be part of a number. A good way to find such problems is:
          Code:
          browse var4 if missing(real(var4))
          which will display in your browser all the observations that are a problem.
          One origin of a problem like this to have imported a spreadsheet file and included a row that has blanks in it, often an apparently innocent blank row at the end of the file.

          If/when you have verified that such problems cannot be fixed, you can likely make use of the -force- or -ignore- options described in -help destring-.

          Comment


          • #6
            I will note that one of the observations in the example data is found by the code in post #5:
            Code:
            . list if missing(real(var4)), clean
            
                        var4  
             72.   6.376.178  
            
            . destring var4, generate(num4) force
            var4: contains nonnumeric characters; num4 generated as double
            (1 missing value generated)
            
            . list if missing(num4), clean
            
                        var4   num4  
             72.   6.376.178      .
            What exactly do the "." (period) characters in your number mean? To have one decimal point may be regarded as precision, to have two looks like carelessness. Or so I thought; a little research in Wikipedia suggests that in many languages the comma and period interchange roles.

            If indeed that is the case in your data, then we can tell destring to ignore period characters in your data.
            Code:
            . destring var4, generate(num5) ignore(".")
            var4: character . removed; num5 generated as long
            
            . list if missing(num4), clean
            
                        var4      num5   num4  
             72.   6.376.178   6376178      .
            If you have numbers where the comma is used as the decimal separator, you will want to include the dpcomma option on your destring command.

            While all my examples used destring, generate() the lesson is also relevant to your task.
            Code:
            destring *, replace ignore(".")
            Last edited by William Lisowski; 01 Jul 2022, 14:59.

            Comment


            • #7
              Thanks a lot, I manage to convert the variable. I tried with several others but I have a doubt.
              As you can see below among all these positive and negative decimal numbers there are two numbers that are listed as follows: "-.066213037" " .672970656". These numbers should have a zero before ".". Therefore, they should be respectively "-0.066213037" and "0.672970656". How can I fix this problem without typing "0"? Thanks in advance.
              Code:
              1.600010282
              1.567243964
               1.51647772
              1.342612026
              1.295207237
              -.066213037
              1.508474983
              1.556633163
              1.104310165
              2.001075946
              1.876840095
               .672970656
              1.338913314
              1.552581534
              1.276709336
               1.32892402

              Comment


              • #8
                Code:
                h format

                Comment


                • #9
                  Thanks Rich for your advice. I implemented it but I have one doubt. I went to "Variable manager" in the "data" section and then I click on the variable I wanted to change. I clicked on "create" and then in "number type" section on "fixed numeric" and flag "pad with leading zeros". Now al the numbers have a leading zero. I don't know why but with dataex command the numbers are shown without this number but with the list command they appear with all zero in front, like this "netinterestmarginlastyear
                  Code:
                  01.8900058
                  01.4006527
                  02.0626731
                  01.6667864
                  01.7978883
                  02.4203497
                  01.9237474
                  01.6864332
                  01.8814430

                  With dataex command like this
                  Code:
                  1.436722123
                  1.617609686
                   2.28215512
                  1.267525516
                  1.190510799
                  1.979301351
                   .572746038
                  1.328277348
                  1.568863389
                   .646618303
                  1.835704345
                  2.157028587
                   .784943048
                   2.04759308
                  1.630280514
                  1.426836211
                  1.592734647
                  1.600010282
                  1.567243964
                   1.51647772
                  1.342612026
                  1.295207237
                  -.066213037
                  Please ignore if the numbers are not the same, they simmply are different sections of a main bigger dataset
                  I don't even know if I implemented the change (adding zeros in front) we were discussing before, since if you look at the dataset those number are still, with the dataset command, without a zero in front.
                  My questions are:
                  1) Did I successfully implemented the change of putting zeros in front of decimals numbers such as 0.22344 (invented number) in order to avoid having .22344 for example?
                  2) The fact the all numbers now have a zero in front matters for Stata, meaning that for example a number written as 01,234 in Stata is recognized as 01,234 or it is recognized in fact as 1,234 (e.g. the way we recognize numbers, giving no attention and value to the zero written before a number in its not decimal part)?
                  Hope to be clear.
                  Thanks to everybody in advance

                  Comment


                  • #10
                    You need to understand three things.

                    1) The choice of a variable's format — leading zeros, number of places to the right of the decimal point, presence of commas, formatting as a date — has no effect at all on the variable's value that is used in Stata's calculation. Consider the following example, and note that the displayed values for x3 and y3 are the same, even though the displayed values for x and y are different.
                    Code:
                    . generate x = 1/3
                    
                    . generate x3 = 3*x
                    
                    . format %9.3f x x3
                    
                    . generate y = 1/3
                    
                    . generate y3 = 3*y
                    
                    . format %9.0f y y3
                    
                    . list, clean
                    
                               x      x3   y   y3  
                      1.   0.333   1.000   0    1
                    2) The dataex command ignores the variable's format and instead uses one that ensures as accurate a representation of the variable's value as is possible. For the example above dataex gives us
                    Code:
                    * Example generated by -dataex-. For more info, type help dataex
                    clear
                    input float(x x3 y y3)
                    .3333333 1 .3333333 1
                    end
                    3) You did not need to add a leading zero to your numbers. You needed instead to choose a fixed numeric format rather than a general numeric format. In this example I use the format command to set the variable format, rather than the Variable Manager.
                    Code:
                    . input double a
                    
                                  a
                      1. 1.295207237
                      2. -.066213037
                      3.  .672970656
                      4. end
                    
                    . format a %12.9f
                    
                    . list, clean
                    
                                      a  
                      1.    1.295207237  
                      2.   -0.066213037  
                      3.    0.672970656

                    Comment


                    • #11
                      As a general guidance, I presume your data were in excel and the latter is using the "Italian" style to separate thousands (.) and decimals (,). Stata used the International standard, i.e. . for decimals and , for thousands. I experience most of the problems you mentioned at the beginning of my stata experience before I changed the setting of my excel. If you do that (in case you have not done this in the past), most of the problems will be solved. Of course your excel files will be unaffected by this change. Obviously, you cannot solve the issue if your original data contain nonnumerical elements, as NA or .. or so on.

                      Comment


                      • #12
                        Thanks for the advices. Therefore, the chosen format has no impact on the values Stata uses for its computation. .067 is considered by Stata as 0,067.
                        I really liked the format you use. What setting did I need to input to have a zero in front of number like ".067" but not zero in front of a number like "1.27"?
                        Thanks

                        Comment


                        • #13
                          To #12: Please read William's point 3) in #10 for the answer.

                          Comment

                          Working...
                          X