Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • destringing

    Hello,
    I was wondering if anyone could help me destring certain variable in my dataset. I have a variable with observations that contain underscore "_". For example: "448_12132_01". STATA recognises the variable with these observations as a string variable. However, I need to turn this into numerical variable, so that I can categorise the observations by creating dummy variables.
    OR
    is there anyway I can create a dummy variable without destringing this certain variable?

    * I used: destring X, gen(Y) force
    but it just generates blanks

    Cheers,
    Aslan

  • #2
    Without any other options, destring can't recognise a string with underscores as a valid numeric value. So the result will be missing (not "blank").

    You don't need to destring to create dummy variables (I recommend the term indicator variable myself).

    Code:
    . clear
    
    . set obs 5
    number of observations (_N) was 0, now 5
    
    . gen whatever = cond(_n > 3, "toad", "frog")
    
    . list
    
         +----------+
         | whatever |
         |----------|
      1. |     frog |
      2. |     frog |
      3. |     frog |
      4. |     toad |
      5. |     toad |
         +----------+
    
    . tab whatever, gen(whatever)
    
       whatever |      Freq.     Percent        Cum.
    ------------+-----------------------------------
           frog |          3       60.00       60.00
           toad |          2       40.00      100.00
    ------------+-----------------------------------
          Total |          5      100.00
    
    . describe
    
    Contains data
      obs:             5                          
     vars:             3                          
     size:            30                          
    --------------------------------------------------------------------------------------------------------------------------------------------------
                  storage   display    value
    variable name   type    format     label      variable label
    --------------------------------------------------------------------------------------------------------------------------------------------------
    whatever        str4    %9s                   
    whatever1       byte    %8.0g                 whatever==frog
    whatever2       byte    %8.0g                 whatever==toad
    --------------------------------------------------------------------------------------------------------------------------------------------------
    Sorted by: 
         Note: Dataset has changed since last saved.

    Comment


    • #3
      Thanks for the reply and help Nick. But nearly all of the observations in this variable are unique numbers (there are 660 observations in total). When i use the code you provided it creates different dummy variables for nearly each observation. However, what I wanna do is, use the number before the first underscore (there are only 3 different number before the first underscore). For example: 331_4314_12
      8_3423_01
      448_23123_02
      So do you know if there is a way I could use 331, 8 and 448 to categorise and create dummys?

      Thanks again,
      Aslan

      Comment


      • #4
        Originally posted by Aslan Ozat View Post
        Thanks for the reply and help Nick. But nearly all of the observations in this variable are unique numbers (there are 660 observations in total). When i use the code you provided it creates different dummy variables for nearly each observation. However, what I wanna do is, use the number before the first underscore (there are only 3 different number before the first underscore). For example: 331_4314_12
        8_3423_01
        448_23123_02
        So do you know if there is a way I could use 331, 8 and 448 to categorise and create dummys?

        Thanks again,
        Aslan
        I think the problem was the way you framed the question.
        Your problem could be solved by using the
        substr command first to generate three new variables which you can then
        convert to numeric variables using the destring command.
        To read more about substr command, type
        Code:
         help substr
        on the command line

        Comment


        • #5
          If your categorical variable is (almost) an identifier, then there would indeed be little point in creating many, many indicator variables. But (1) you really didn't explain your data in enough detail for better advice to be possible (2) it just means that the syntax I gave is not helpful, not that it is wrong.

          Joseph's suggestion is helpful, but note that substr() is a function, not a command.

          You've yet to give a data example in the requested form (FAQ Advice #12), but something of the following kind may help


          Code:
          gen prefix = substr(X, 1, strpos(X, "_") - 1)
          tab prefix, gen(indicator)
          As before, destring is not necessary here, although likely to do no harm.

          Comment

          Working...
          X