Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to use substring command

    Dear all,
    I have a dataset which contain id number with the display format is %6.3f. And I would like to use substring command to create a new variable take the number before the dot '.'.
    I mean if I have two subjects with their id are
    Code:
    1. 23.149
    2. 24.001
    And I want to create a variable cluster which has value equals 23 for the 1st one and equals 24 for the second one. The problem is part of the id's string ranges from 1 to 32, therefore, I can not use the following command
    Code:
    gen cluster=substr(id,1,2)
    I read and use the following command, instead.
    Code:
    gen cluster=substr(id,-3,2)
    I know that the wrong code, but how to fix it.

    Thank you all inadvance.
    Last edited by Thong Nguyen; 14 Jul 2016, 00:41.

  • #2
    Thong:
    you may want to try:
    Code:
    . set obs 2
    
    . g string="23.149" in 1
    
    . replace string="24.001" in 2
    
    . split string, p(.)
    
    . destring string1, g(cluster)
    
    . drop string2
    Kind regards,
    Carlo
    (Stata 19.0)

    Comment


    • #3
      I figured it out by myself.
      tostring id, gen(x)
      gen x2=x if nhom==1
      split x2, p(.)
      drop x22
      ren x21 cluster
      destring cluster, replace

      Comment


      • #4
        Dear Carlo,
        Now I see your codes, thank you very much for your help.

        Comment


        • #5
          Various small confusions here.

          First, substr() is a function, not a command. Commands and functions are disjoint in Stata. See e.g. http://www.stata-journal.com/sjpdf.h...iclenum=dm0058 for a tutorial making that point and others.

          As you have a numeric variable, e.g.

          Code:
          clear 
          input id 
          23.149
          24.001
          end 
          format id %6.3f
          you cannot apply substr() directly, as you realised.

          But there are easier solutions than posted. An integer identifier could be stored as such. Or you could hold it as string. For identifiers such as in your examples, there is not much in the choice. For long numeric identifiers, you need to be more careful.

          Consider this direct approach:

          Code:
          gen numid = floor(id) 
          gen strid = string(floor(id)) 
          
          list 
          
               +------------------------+
               |     id   numid   strid |
               |------------------------|
            1. | 23.149      23      23 |
            2. | 24.001      24      24 |
               +------------------------+
          For this kind of problem, destring and tostring are just over-elaborate. I have nothing against those commands (see the manual entry) but their main points are

          1. Convenience. You can apply them to several variables at once, even all the variables in the dataset. (In each case, the command will ignore what is irrelevant.)

          2. Security. Each command has extra bells and whistles to try to ensure that you don't lose information, unless you say you don't care.

          Those points are not pertinent here. You **know** you want to ignore the digits after the decimal point. You **know** you want just a single new variable.

          Comment


          • #6
            Dear Nick,
            Thank for your help. You show me a better solution for that kind of problem and I learned a lot.

            Comment


            • #7
              Dear Nick,
              Thank you for the document you shared, it's really interesting and I need to learn and master those functions.

              Comment

              Working...
              X