Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • chnaging values of a variable

    Dear Listers,

    I am using Stata 15.1.
    I am working with a dataset with more than a million observations.
    I would like to change the values of a variable in order to merge another data with one of the data having an "ID" which is one digit higher than the master data. So I would like to drop the last digit in the using data in order to merge the two datasets.

    What I want: In the following synthetic data, I would like to get rid of the last digit number from the values off variable
    length
    , so that it shortens to a two-digit value.
    I tried to use
    levelsof
    with
    usubstr
    . It did not work. The closest I found is this : https://www.statalist.org/forums/for...velsof-command.
    Code:
    sysuse auto, clear
    levelsof length, local(levels)
    
    foreach l of local levels{
    gen length2=usubstr("`levels'", 1, 2)
    }
    Thank you in advance for any tip.
    Last edited by Philmon Amasalu; 25 Jun 2022, 17:37. Reason: value of length, not the variable lenght.

  • #2
    There is no need to cycle through with a levelof, that'd just create some unnecessary looping. Try this:

    Code:
    clear
    input str5 id_str
    123
    456
    789
    end
    
    * Numeric
    gen id_str_2 = usubstr(id_str, 1, 2)
    
    list
    Results:

    Code:
         +-------------------+
         | id_str   id_str_2 |
         |-------------------|
      1. |    123         12 |
      2. |    456         45 |
      3. |    789         78 |
         +-------------------+
    And for the sake of learning, let's get back to your code:

    Code:
    foreach l of local levels{
        gen length2=usubstr("`levels'", 1, 2)
    }
    Suppose this is the correct approach, the reason that it's not working because the macro variable label is wrong. The whole list of levels is called "levels", and "l" is each item inside "levels." For this code, you'll need "l".

    Code:
    foreach l of local levels{
        gen length2=usubstr("`l'", 1, 2)
    }
    Second, the same variable cannot be generated twice. So, you'll get another error regarding overwriting variables. To move away from that, replace would be needed instead of generate:

    Code:
    gen length2 = ""
    foreach l of local levels{
        replace length2=usubstr("`l'", 1, 2)
    }
    Third, this code will act on the whole column, so basically the final variable will just be the first two digits of the last entry in "levels", in order to apply that change id by id, and if is needed:

    Code:
    gen length2 = ""
    foreach l of local levels{
        replace length2=usubstr("`l'", 1, 2) if length == "`l'"
    }
    Overall, it's not very efficient nor easy to read. I'd suggest just use usbstr on the original variable without looping.

    Comment


    • #3
      If, like your example variable length in the auto dataset, the ID variable is numeric rather than a string, then there is no need to convert the numeric variable to a string in order to use a string function to remove the rightmost digit. You can do it all with math, converting your numeric variable into a numeric variable with the rightmost digit removed.
      Code:
      . sysuse auto, clear
      (1978 automobile data)
      
      . generate long length2 = floor(length/10)
      
      . list length length2 in 1/10, clean
      
             length   length2  
        1.      186        18  
        2.      173        17  
        3.      168        16  
        4.      196        19  
        5.      222        22  
        6.      218        21  
        7.      170        17  
        8.      200        20  
        9.      207        20  
       10.      200        20
      I chose to create length2 as a long numeric variable, which is suitable for 9-digit numbers. But what you should do is make it the shortened ID variable the same numeric type as the original numeric ID variable - double will hold up to a 15-digit number.
      Last edited by William Lisowski; 25 Jun 2022, 18:47.

      Comment


      • #4
        Ken Chui Thank you very much for your detailed comment. I am sorry for the poor description of the issue.

        This is what I meant when referring to the two datasets:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input int ind str99 indname int ind2
        728 "Accounting, tax preparation, bookkeeping and payroll services" 7280
        957 "Administration of economic programs and space research"        9570
        949 "Administration of environmental quality and housing programs"  9490
        948 "Administration of human resource programs"                     9480
        747 "Advertising and related services"                              7470
        359 "Aerospace product and parts manufacturing"                     3590
        218 "Agricultural chemical manufacturing"                           2180
        307 "Agricultural implement manufacturing"                          3070
        607 "Air transportation"                                            6070
        358 "Aircraft and parts manufacturing"                              3580
        end

        ind
        and
        ind2
        have same industrial codes but
        ind2 with
        with a 0 at the end.

        So, I converted ind2 into string
        Code:
        tostring ind2, replace
        gen ind3=usubstr(ind2, 1, 3)
        This code worked, but I am not sure if it is right or legal to do it this way, so is the code you suggested.
        This doesn't work before changing "ind2" into string.

        Comment


        • #5
          William Lisowski Thank you very much. Your code did the trick.

          Code:
          * Example generated by -dataex-. To install: ssc install dataex
          clear
          input int ind str99 indname int ind2 long ind5
          728 "Accounting, tax preparation, bookkeeping and payroll services" 7280 728
          957 "Administration of economic programs and space research"        9570 957
          949 "Administration of environmental quality and housing programs"  9490 949
          948 "Administration of human resource programs"                     9480 948
          747 "Advertising and related services"                              7470 747
          359 "Aerospace product and parts manufacturing"                     3590 359
          218 "Agricultural chemical manufacturing"                           2180 218
          307 "Agricultural implement manufacturing"                          3070 307
          607 "Air transportation"                                            6070 607
          358 "Aircraft and parts manufacturing"                              3580 358
          end
          Thank you both for your help.

          Comment

          Working...
          X