Concatenate contents of string variable / local macro

Max Schwarzer

Join Date: Sep 2020

Posts: 10
#1

Concatenate contents of string variable / local macro

03 Nov 2020, 14:55

I have a simple problem. I have the letters of the alphabet ordered differently after 2 different lists (order_pre and order_post). Now I would like to create an artificial word that contains the letters in the order given by the order of each order variable as a single string. I.e. I would like to create a local macro or variable called word_pre for order_pre that reads "abc" and the same for order_post reading "bac.

Code:

clear input order_pre str1 letters order_post 1 "a" 2 2 "b" 1 3 "c" 3 end * word_pre should be "abc" preserve sort order_pre levelsof letters, local(word_pre) restore * word_post should be "bac" preserve sort order_post levelsof letters, local(word_post) restore gen word_pre = `word_pre' // this does not work as the letters are not concatenated
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#2

03 Nov 2020, 18:35

You can't use -levelsof- to get your second word, because -levelsof- always returns strings in alphabetical order, not order of appearance in the data. So you have to build word_post directly in the data.

As for word_pre, your approach is workable--you just need to remove keep -levelsof- from splattering the result with quotes, and then edit out the blank spaces.

Code:

clear input order_pre str1 letters order_post 1 "a" 2 2 "b" 1 3 "c" 3 end * word_pre should be "abc" sort order_pre levelsof letters, local(word_pre) clean gen word_pre = subinstr(`"`word_pre'"', " ", "", .) * word_post should be "bac" sort order_post gen word_post = letters in 1 replace word_post = word_post[_n-1] + letters if _n > 1 replace word_post = word_post[_N] list, noobs clean
2 likes
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35657

03 Nov 2020, 19:12

Code:

. ssc desc valuesof

----------------------------------------------------------------------------------
package valuesof from http://fmwww.bc.edu/repec/bocode/v
----------------------------------------------------------------------------------

TITLE
      'VALUESOF': module to return the contents of a variable in a macro

DESCRIPTION/AUTHOR(S)
      
       valuesof displays and returns in r(values) the  values of a
      variable joined together in a single string. The values are
      listed in the current sort order of the  dataset and are
      separated by blanks.
      
      KW: variable
      KW: values
      KW: macro
      KW: return
      
      Requires: Stata version 9
      
      Distribution-Date: 20080526
      
      Author: Ben Jann, ETH Zurich
      Support: email [email protected]
      

INSTALLATION FILES                             (type net install valuesof)
      valuesof.ado
      valuesof.hlp
----------------------------------------------------------------------------------
(type ssc install valuesof to install)

Comment

Max Schwarzer

Join Date: Sep 2020

Posts: 10
#4

04 Nov 2020, 02:16

Thank you Nick and Clyde, both answers are really helpful!
Comment

Max Schwarzer

Join Date: Sep 2020
Posts: 10

09 Nov 2020, 16:11

Dear Clyde, dear Nick, I ran into another problem. The original dataset is a cross-section of 190,000 individuals, hence my loop structure would take ages to calculate the word_pre and word_post variables. However, I struggle to find an alternative as neither -valuesof- nor -levelsof- can be combined with by. I want to do exactly the same we discussed above, form an artificial word that reflects the pre- and post-ranking, just this time by individual. Perhaps you could once again point me in the right direction!

Code:

clear 
input ID order_pre str1 letters order_post 
101 1 "a" 2
101 2 "b" 1
101 3 "c" 3
205 1 "b" 1
205 2 "c" 2
205 3 "a" .
end 


preserve 
drop _all
tempfile cumulator
quietly save `cumulator', emptyok
restore 

egen ID_levels = group(ID)
su ID_levels, meanonly

tempfile dataset 
save `dataset' 

forvalues i = 1/`r(max)' {

    use `dataset' , clear

    keep if ID_levels == `i' 

    preserve
    drop if missing(order_pre)
    sort order_pre
    valuesof letters
    local word_pre = subinstr("`r(values)'", " " , "" , .)
    restore 

    preserve
    drop if missing(order_post)
    sort order_post
    valuesof letters
    local word_post = subinstr("`r(values)'", " " , "" , .)
    restore 

    gen word_pre = "`word_pre'"
    gen word_post = "`word_post'" 

    tempfile `i' 
    save ``i'' 

    append using `cumulator'
    quietly save `cumulator', replace

}

Comment

Clyde Schechter

Join Date: Apr 2014

Posts: 30075
#6

09 Nov 2020, 17:24

To be honest, 190,000 observations isn't that large a data set these days. And as one who sometimes has analyses that run for weeks, I don't get too worked up about one that will run for "ages," where "ages" really means hours or perhaps a day at worst.

That said, the -runby- command, written by Robert Picard and me, available from SSC, is precisely what you are looking for here. All of that stuff inside your loop just needs to be packaged as a program designed to handle all the observations of a single value of ID, and then -runby- executes it iteratively over the value of ID using a very efficient algorithm. You won't need to build up that `cumulator' file either, -runby- will handle that for you automatically. Of course, no need for all that disk-thrashing with -preserve- and -restore- which probably accounts for 75% or more of the run-time in your loop. You won't even have to expend the time on creating ID_levels, because -runby- will be happy to use ID itself as the grouping variable.

Code:

capture program drop one_ID program define one_ID sort order_pre levelsof letters if !missing(order_pre), local(word_pre) clean gen word_pre = subinstr(`"`word_pre'"', " ", "", .) sort order_post gen word_post = letters in 1 replace word_post = cond(!missing(order_post), /// word_post[_n-1] + letters, word_post[_n-1]) if _n > 1 replace word_post = word_post[_N] exit end runby one_ID, by(ID) status

-runby-, try it, you'll like it.

By the way, I'm sure the program above could be modified to use -valuesof- instead of my approach.

Added: I -expanded- your example data to 60,000 observations with 20,000 ID's and it took 6 seconds to run. Now, since this program involves sorting, the run time is more than linear in the number of observations, but less than quadratic. So your 190,000 observation data set will probably run in well under 1 minute if your setup is similar to mine. In any case, you will also get a progress report as it runs with periodic updates about how many IDs have been processed, how much time has elapsed, and an estimate of the time remaining.

Last edited by Clyde Schechter; 09 Nov 2020, 17:53.
1 like
Comment

Announcement