Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Deleting the letter part of a string variable and keep only the numeric part

    Hello you all, I`m very glad to be part of this forum that has already helped me a lot these recent days.

    I would like to delete the letter part of a string variable and keep only the numeric part (so the variable becomes a numeric one).
    actual variable expected variable
    robbery (342) 342
    assault (494) 494
    Moreover, I would like to label the new numeric variable with the deleted name of the old variable. There are as much categories as existing types of crimes, so the "manual" way is not the best option to me.
    Hope I`ve been clear enough. Thanks!

  • #2
    Welcome to the Stata Forum / Statalist,

    Try this:

    Code:
    split actual_variable, parse("(" ")")
    I gather the second part you can do with Nick Cox's - labmask - program (SJ).

    Just install it and follow the examples.
    Last edited by Marcos Almeida; 14 Feb 2019, 14:11.
    Best regards,

    Marcos

    Comment


    • #3
      Below, a toy example:

      Code:
       . set obs 2
      number of observations (_N) was 0, now 2
      
      . input str40 actual_variable
      
                                    actual_variable
        1. "robbery (342)"
        2. "assault (494)"
      
      . split actual_variable, parse("(" ")")
      variables created as string: 
      actual_var~1  actual_var~2
      
      . destring actual_variable2, gen( expected)
      actual_variable2: all characters numeric; expected generated as int
      
      . labmask expected, values( actual_variable1 )
      
      . list
      
           +------------------------------------------------+
           | actual_vari~e   actual~1   actual~2   expected |
           |------------------------------------------------|
        1. | robbery (342)   robbery         342    robbery |
        2. | assault (494)   assault         494    assault |
           +------------------------------------------------+
      
      . codebook expected
      
      ----------------------------------------------------------------------------------------------------------
      expected                                                                                       (unlabeled)
      ----------------------------------------------------------------------------------------------------------
      
                        type:  numeric (int)
                       label:  expected
      
                       range:  [342,494]                    units:  1
               unique values:  2                        missing .:  0/2
      
                  tabulation:  Freq.   Numeric  Label
                                   1       342  robbery
                                   1       494  assault
      Hopefully that helps.
      Best regards,

      Marcos

      Comment


      • #4
        Thank you Marcos for your welcome and of course for your quick and right answer.
        Everything worked great until labmask. This error came out: actual_variable1 not constant within groups of expected.

        Surelly what happened was that ther are some cathegories written differently for the same number or just misspelled.
        Do you got a hint about this ? How could I "force" the command to work ?

        Comment


        • #5
          Hi Diego. For the examples you showed, I don't think -labmask- is required. Try this:

          Code:
          clear *
          input str40 oldvar
          "robbery (342)"
          "assault (494)"
          end
          
          split oldvar, parse("(" ")")
          
          destring oldvar2, gen(newvar)
          drop oldvar1 oldvar2
          list
          I get the following output:

          Code:
          . list
          
               +------------------------+
               |        oldvar   newvar |
               |------------------------|
            1. | robbery (342)      342 |
            2. | assault (494)      494 |
               +------------------------+
          HTH.
          --
          Bruce Weaver
          Email: [email protected]
          Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
          Version: Stata/MP 18.0 (Windows)

          Comment


          • #6
            Thank you Bruce. That part worked ok. But I want to label newvar with the values of oldvar. I think Nick Cox`s landmask is just great for that, but I`m dealing withe different problems: missing values in newvar, misspelled cathegories in the old var, and so on.

            Best regards,

            Diego

            Comment


            • #7
              Ah yes, I see you said that pretty clearly in #1. I was in way too big a hurry earlier! Sorry.
              --
              Bruce Weaver
              Email: [email protected]
              Web: http://sites.google.com/a/lakeheadu.ca/bweaver/
              Version: Stata/MP 18.0 (Windows)

              Comment


              • #8
                An alternate solution (although Marco's is more robust, for reasons I will explain below):

                Code:
                dataex orig_text  //  Data shared via -dataex-. To install: ssc install dataex
                clear
                input str37 orig_text
                "robbery (342)"                        
                "assault (494)"                        
                "Homicide (187)"                      
                "Motor Vehicle Theft (487)"            
                "Forgery, Check and Access Cards (113)"
                end
                ------------------ copy up to and including the previous line ------------------
                
                
                ssc install strkeep  // in case you don't already have it
                strkeep orig_text, numeric gen(num_part)
                strkeep orig_text, alpha keep("," " ") gen(text_part)  // Keeps commas & spaces in the result
                strkeep orig_text, alpha  gen(text_only)
                
                . list, noobs
                
                  +------------------------------------------------------------------------------------------------------------------+
                  |                             orig_text   num_part                          text_part                    text_only |
                  |------------------------------------------------------------------------------------------------------------------|
                  |                         robbery (342)        342                           robbery                       robbery |
                  |                         assault (494)        494                           assault                       assault |
                  |                        Homicide (187)        187                          Homicide                      Homicide |
                  |             Motor Vehicle Theft (487)        487               Motor Vehicle Theft             MotorVehicleTheft |
                  | Forgery, Check and Access Cards (113)        113   Forgery, Check and Access Cards    ForgeryCheckandAccessCards |
                  +------------------------------------------------------------------------------------------------------------------+

                The weakness in using strkeep is that it will combine any numbers that are actually part of the crime description, whereas Marco's solution only keeps the numbers that are within parentheses. Thus, his solution is more robust. But thought I would point out this alternative in case it is helpful.

                Comment


                • #9
                  With regards to the comments in #3, I gather you didn’t share the whole information, I mean, maybe you ‘miscoded’ the variable previously.

                  Being this so (I hope not!), it is not to -labmask-‘s fault. In short, it shall work right if the data was correctly typed.

                  If you need further help, you’ll probably need to provide more information.

                  Using -dataex - is the finest starting point!
                  Best regards,

                  Marcos

                  Comment

                  Working...
                  X