Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Extracting a string that occurs before an underscore

    Hi,

    I have data as follow:

    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input str7 unique_id
    "2044_18"
    "205_16"
    "207_8"
    "2037_15"
    "2037_14"
    "2037_26"
    "2044_25"
    "2095_8"
    "2095_2"
    "2095_19"
    "2044_28"
    "2083_6"
    "2037_16"
    "2037_9"
    "2037_23"
    "2037_19"
    "2074_8"
    "2037_10"
    "2037_21"
    "2037_11"
    "2044_4"
    "2083_13"
    "2041_11"
    "2041_24"
    "2044_21"
    "2083_10"
    "2090_23"
    end
    I want to create a new variable to take value before an underscore like 2044, 205, 207.

    Thanks,
    Tan

  • #2
    try the following 2 commands:
    Code:
    gen byte pos=strpos(unique_id, "_")
    gen newvar=substr(unique_id,1, pos-1)
    you can then keep or drop "pos" and you see fit; you can choose whatever name you want instead of the "newvar" that I used - and, of course, you can -destring- it if you want

    Comment


    • #3
      Assuming that the part before the underscore is always digits, here is some alternative code using regular expressions:

      Code:
      gen int wanted = real(ustrregexs(1)) if ustrregexm(unique_id,"(\d+)_")

      Comment


      • #4
        ref #5 ignore the following error
        Code:
        gen long firstpart = real(subinstr(unique_id), "_", ".", 1)
        correction:
        Code:
        gen long firstpart = real(subinstr(unique_id, "_", ".", 1))
        Code:
        gen firstpart = int(real(subinstr(unique_id, "_", ".", 1)))
        Last edited by Bjarte Aagnes; 20 Dec 2022, 09:22.

        Comment


        • #5
          Bjarte Aagnes can you help me understand how #4 manages to work?

          One part that seems crucial is setting the data type to be long, so that the non-integer part is dropped.

          What I'm confused about is, the line as you entered seems to suggest that subinstr() is a function that takes one argument, while real() takes four, while it is the opposite. How is that code even legal, instead of needing to be
          Code:
          gen long firstpart = real(subinstr(unique_id, "_", ".", 1))
          Last edited by Hemanshu Kumar; 20 Dec 2022, 09:01.

          Comment


          • #6
            Bjarte Aagnes thanks for the edits to #4, but I'm still confused -- your original code actually works, whereas I would have thought it's syntactically wrong. I have no clue how it manages to work!

            Again, the original code is:

            Code:
            gen long firstpart = real(subinstr(unique_id), "_", ".", 1)

            Comment


            • #7
              I report this issue to tech support:
              Code:
              display real(subinstr("2044_18"), "_", ".",1) // should fail like the mata code:
              mata : strtoreal(subinstr("2044_18"), "_", ".", 1)

              Comment


              • #8
                Funnily enough:

                Code:
                . display real(subinstr("2044_18"), "_", ".",1)
                2044.18
                but
                Code:
                . display real(subinstr("2044_18"), "_", ".",1) // should fail like the mata code:
                unknown function ()
                r(133);

                Comment


                • #9
                  #8 is probably the result of running the command in the command window where the /* */, //, and /// comment indicators cannot be used. The error message should be more precise and informative, and consistent, if illegal commenting is used interactively.

                  Comment


                  • #10
                    You're absolutely right. From a do-file, the result of both commands in #8 is identical, 2044.18. My bad.

                    Comment

                    Working...
                    X