Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Concatenate contents of string variable / local macro

    I have a simple problem. I have the letters of the alphabet ordered differently after 2 different lists (order_pre and order_post). Now I would like to create an artificial word that contains the letters in the order given by the order of each order variable as a single string. I.e. I would like to create a local macro or variable called word_pre for order_pre that reads "abc" and the same for order_post reading "bac.


    Code:
    clear 
    input order_pre str1 letters order_post 
    1 "a" 2
    2 "b" 1
    3 "c" 3
    end 
    
    * word_pre should be "abc"
    preserve 
    sort order_pre
    levelsof letters, local(word_pre)
    restore 
    
    * word_post should be "bac"
    preserve 
    sort order_post
    levelsof letters, local(word_post)
    restore 
    
    gen word_pre = `word_pre'  // this does not work as the letters are not concatenated

  • #2
    You can't use -levelsof- to get your second word, because -levelsof- always returns strings in alphabetical order, not order of appearance in the data. So you have to build word_post directly in the data.

    As for word_pre, your approach is workable--you just need to remove keep -levelsof- from splattering the result with quotes, and then edit out the blank spaces.

    Code:
    clear
    input order_pre str1 letters order_post
    1 "a" 2
    2 "b" 1
    3 "c" 3
    end
    
    * word_pre should be "abc"
    sort order_pre
    levelsof letters, local(word_pre) clean
    gen word_pre = subinstr(`"`word_pre'"', " ", "", .)
    
    * word_post should be "bac"
    sort order_post
    gen word_post = letters in 1
    replace word_post = word_post[_n-1] + letters if _n > 1
    replace word_post = word_post[_N]
    
    list, noobs clean

    Comment


    • #3
      Code:
      . ssc desc valuesof
      
      ----------------------------------------------------------------------------------
      package valuesof from http://fmwww.bc.edu/repec/bocode/v
      ----------------------------------------------------------------------------------
      
      TITLE
            'VALUESOF': module to return the contents of a variable in a macro
      
      DESCRIPTION/AUTHOR(S)
            
             valuesof displays and returns in r(values) the  values of a
            variable joined together in a single string. The values are
            listed in the current sort order of the  dataset and are
            separated by blanks.
            
            KW: variable
            KW: values
            KW: macro
            KW: return
            
            Requires: Stata version 9
            
            Distribution-Date: 20080526
            
            Author: Ben Jann, ETH Zurich
            Support: email [email protected]
            
      
      INSTALLATION FILES                             (type net install valuesof)
            valuesof.ado
            valuesof.hlp
      ----------------------------------------------------------------------------------
      (type ssc install valuesof to install)

      Comment


      • #4
        Thank you Nick and Clyde, both answers are really helpful!

        Comment


        • #5
          Dear Clyde, dear Nick, I ran into another problem. The original dataset is a cross-section of 190,000 individuals, hence my loop structure would take ages to calculate the word_pre and word_post variables. However, I struggle to find an alternative as neither -valuesof- nor -levelsof- can be combined with by. I want to do exactly the same we discussed above, form an artificial word that reflects the pre- and post-ranking, just this time by individual. Perhaps you could once again point me in the right direction!


          Code:
          clear 
          input ID order_pre str1 letters order_post 
          101 1 "a" 2
          101 2 "b" 1
          101 3 "c" 3
          205 1 "b" 1
          205 2 "c" 2
          205 3 "a" .
          end 
          
          
          preserve 
          drop _all
          tempfile cumulator
          quietly save `cumulator', emptyok
          restore 
          
          egen ID_levels = group(ID)
          su ID_levels, meanonly
          
          tempfile dataset 
          save `dataset' 
          
          forvalues i = 1/`r(max)' {
          
              use `dataset' , clear
          
              keep if ID_levels == `i' 
          
              preserve
              drop if missing(order_pre)
              sort order_pre
              valuesof letters
              local word_pre = subinstr("`r(values)'", " " , "" , .)
              restore 
          
              preserve
              drop if missing(order_post)
              sort order_post
              valuesof letters
              local word_post = subinstr("`r(values)'", " " , "" , .)
              restore 
          
              gen word_pre = "`word_pre'"
              gen word_post = "`word_post'" 
          
              tempfile `i' 
              save ``i'' 
          
              append using `cumulator'
              quietly save `cumulator', replace
          
          }

          Comment


          • #6
            To be honest, 190,000 observations isn't that large a data set these days. And as one who sometimes has analyses that run for weeks, I don't get too worked up about one that will run for "ages," where "ages" really means hours or perhaps a day at worst.

            That said, the -runby- command, written by Robert Picard and me, available from SSC, is precisely what you are looking for here. All of that stuff inside your loop just needs to be packaged as a program designed to handle all the observations of a single value of ID, and then -runby- executes it iteratively over the value of ID using a very efficient algorithm. You won't need to build up that `cumulator' file either, -runby- will handle that for you automatically. Of course, no need for all that disk-thrashing with -preserve- and -restore- which probably accounts for 75% or more of the run-time in your loop. You won't even have to expend the time on creating ID_levels, because -runby- will be happy to use ID itself as the grouping variable.

            Code:
            capture program drop one_ID
            program define one_ID
                sort order_pre
                levelsof letters if !missing(order_pre), local(word_pre) clean
                gen word_pre = subinstr(`"`word_pre'"', " ", "", .)
            
                sort order_post
                gen word_post = letters in 1
                replace word_post = cond(!missing(order_post), ///
                    word_post[_n-1] + letters, word_post[_n-1]) if _n > 1
                replace word_post = word_post[_N]
                
                exit
            end
            
            runby one_ID, by(ID) status
            -runby-, try it, you'll like it.

            By the way, I'm sure the program above could be modified to use -valuesof- instead of my approach.

            Added: I -expanded- your example data to 60,000 observations with 20,000 ID's and it took 6 seconds to run. Now, since this program involves sorting, the run time is more than linear in the number of observations, but less than quadratic. So your 190,000 observation data set will probably run in well under 1 minute if your setup is similar to mine. In any case, you will also get a progress report as it runs with periodic updates about how many IDs have been processed, how much time has elapsed, and an estimate of the time remaining.
            Last edited by Clyde Schechter; 09 Nov 2020, 17:53.

            Comment

            Working...
            X