Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generating new variables based on a variable naming convention

    Dear Statalist,

    I hope this finds you well - my first post so please be gentle.
    I regularly use data-sets which contain a very methodical naming convention and I'm trying to write some script to combine them.

    The naming convention is variable_e1_c1 (incomplete observations 1-10), variable_e1_c2 (incomplete observations 11-20), variable_e1_c3 (incomplete observations 21-30), variable_e1_c4 (incomplete observations 31-40) where the variables in the data set I want to combine will all contain the same variable-name_e-number. I've started with a few approaches, however, I don't know how to pass the variable name into logical test after running say:

    - ds, has(varl *_e1*).
    If varname == "*_c*" {then do}.

    I've also tried the approach with:

    - foreach x of varlist *_e2_* {

    but again I'm lost as to how to apply logic gates to the value of `x' as it takes the value of the first observation where .

    Ultimately I'm keen on generating a new variable where all the observations take a similar convention and I currently use the tedious approach of copy and paste a few lines of code as such:

    gen Bar=.
    replace Bar= real(biadlyn_e1_c1) if biadlyn_e1_c2==.
    replace Bar= biadlyn_e1_c2 if real(biadlyn_e1_c1)==.

    I apologies is this thread has been covered and would appreciate a link and also do note although I'm not new to programming I am new to STATA and it's handling power so any advice would be most welcome.

    Kind regards

  • #2
    Code:
    sysuse auto, clear
    
    // get me a list of all variables with names containing the letter m
    unab mlist : *m*
    di `"|`mlist'|"'
    
    // get me a list of all variables with names containing the letter e
    unab elist : *e*
    di `"|`elist'|"'
    
    // get the intersection of these lists (see: help macrolists)
    local combined_list : list mlist & elist
    di `"|`combined_list'|"'
    ---------------------------------
    Maarten L. Buis
    University of Konstanz
    Department of history and sociology
    box 40
    78457 Konstanz
    Germany
    http://www.maartenbuis.nl
    ---------------------------------

    Comment


    • #3
      Thanks Maarten,

      I've started implementing the code and have gotten to the next gap in knowledge so far I've got this:

      unab camlist : *_E2_*
      unab datlist : BRTHDAT*
      local combined_list : list camlist & datlist
      egen BRTHDAT = rowmax("`combined_list'")


      The egen command is the step which I'm struggling with I essentially would want to count through the macro list (combined_list) and assign the variables to a complete variable (BRTHDAT) with the command and augment this

      I also want to run this for all variables in the database so envisaged running it thought a loop and a variable naming truncation:

      forvalues i = 1/9{
      foreach n of varlist *_E`i'_*{
      local vargen = `"regexs(0) regexm("`n'", "[A-Z][A-Z][A-Z][A-Z][A-Z][A-Z]")"'
      display `vargen'
      }
      }


      A sample of the birthday fields below shows the naming convention which I'm keen on amalgamating (to generate a BRTHDAT_E2 = BRTHDAT_E2_C15 + BRTHDAT_E2_C16 + BRTHDAT_E2_C19 + BRTHDAT_E2_C20 if BRTHDAT == . output)

      BRTHDAT_E1_C1
      BRTHDAT_E1_C2
      BRTHDAT_E2_C15
      BRTHDAT_E2_C16
      BRTHDAT_E2_C19
      BRTHDAT_E2_C20
      BRTHDAT_E3_C21
      BRTHDAT_E3_C22
      BRTHDAT_E3_C25
      BRTHDAT_E3_C26
      BRTHDAT_E4_C27
      BRTHDAT_E4_C28
      BRTHDAT_E4_C32
      BRTHDAT_E4_C33
      BRTHDAT_E5_C34
      BRTHDAT_E5_C35
      BRTHDAT_E5_C43
      BRTHDAT_E5_C44
      BRTHDAT_E6_C45
      BRTHDAT_E6_C46


      The combination of the two codes at this point isn't even close to working - any assistance would be really great. I feel I'm chasing my tale with the macro tags atm `""' / `' / "`'" / ``'' / ``""'' ....

      Thanks again for your time,
      kind regards,

      Marcus

      Comment


      • #4
        I find this very hard to follow. egen, rowmax() is not for string manipulations. It's for finding numerical maxima, across observations, of several variables.

        I can't see that the variables named in #3 have any relation to what's in #1. Here's a very wild guess at what you're aiming at

        Code:
        forval j = 1/6 { 
              egen BRTHDAT_E`j' = rowtotal(BRTHDAT_E`j'_C*) 
        }
        If that is way off, and you don't get a better answer, http://xyproblem.info/ seems to apply.

        1. Tell us about your data. Give a data example. If your data are complicated, give us a simplified analogue that shows the problem.

        2. Tell us what you want to do. The clearest part of #3 is, or seems to be, that you want add variables of a certain kind, given by a shared name prefix.

        As you say, you have programming knowledge, but only some of will apply literally to Stata and you don't know which part at present.

        Comment


        • #5
          Thanks for following this up Nick and apologies for causing any frustration.

          Your code certainly helps with some elements of the dataset - and you're right I should have been more descriptive about my data-set and problem so apologies for that.

          - So the data I have is fragmented, across a number of variables in the data set, I want to combine (or de-fragment these variables into one or fewer variables) all the variables contain the same variable name (I've called variablecode) an event number(I've called En1) and a version or capture number(I've called Cn2). Across the database it has the repeated naming convention:- VARIABLECODE_En1_Cn2 (Where in the example above post#3 - the variables downloaded from Openclinica are BRTHDAT_En1_Cn2).

          - As you've shown above (in post #4) I'm keen to capture all of the observations in BRTHDAT_E1 from the BRTHDAT_E1_C* variables - BRTHDAT_En1 should contain all of the Birthday Observations (not the sum of the rows - apologies again) which will be recorded in the Cn2 convention.

          - I've started with the variables for BRTHDAT and BIADLYN and to help I'll give a visual key of what I have and how I'm keen to transform the dataset:


          Table 1 - The data I have
          BRTHDAT_E4_C28 BRTHDAT_E4_C32 BRTHDAT_E5_C34 BRTHDAT_E5_C35 BRTHDAT_E5_C43 BIADLYN_E1_C1 BIADLYN_E1_C2 BIADLYN_E5_C34 BIADLYN_E5_C35
          01/11/2025 01/11/2025 1 1
          01/12/2022 01/12/2022 01/12/2022 01/12/2022 1 1
          01/03/2027 01/03/2027 01/03/2027 1 1
          01/07/2014 01/07/2014 1
          04/06/2019 04/06/2019 04/06/2019 1 1
          03/08/2019 03/08/2019 03/08/2019 1 1
          03/09/2019 03/09/2019 03/09/2019 1 1
          01/07/2013 01/07/2013 1
          01/09/2020 1
          10/09/2019 10/09/2019 10/09/2019 1 1
          08/08/2019 08/08/2019 08/08/2019 1 1
          01/07/2025 01/07/2025 01/07/2025 1 1
          01/10/2019 01/10/2019 01/10/2019 1
          01/03/2027 01/03/2027 01/03/2027 01/03/2027 1 1
          01/06/2028 01/06/2028 01/06/2028 1 1
          01/06/2027 01/06/2027 1

          Table 2: The output of I'm keen to receive:
          BRTHDAT_E4 BRTHDAT_E5 BIADLYN_E1 BIADLYN_E5
          01/11/2025 01/11/2025 1 1
          01/12/2022 01/12/2022 1 1
          01/03/2027 01/03/2027 1 1
          01/07/2014 01/07/2014 1
          04/06/2019 04/06/2019 1 1
          03/08/2019 03/08/2019 1 1
          03/09/2019 03/09/2019 1 1
          01/07/2013 01/07/2013 1
          01/09/2020 1
          10/09/2019 10/09/2019 1 1
          08/08/2019 08/08/2019 1 1
          01/07/2025 01/07/2025 1 1
          01/10/2019 01/10/2019 1
          01/03/2027 01/03/2027 1 1
          01/06/2028 01/06/2028 1 1

          - I was keen to understand if there are/is a program or function in STATA which can transform all of my data like this i.e. loop for all 1700+ variables and defragment into VARIABLECODE_En1. from VARIABLECODE_En1_Cn2.

          My thought process was to:
          - loop through *_E`1 - 9'_* which is the only sequential part to the naming convention.

          - The second thought was to truncate the VARIABLECODE out from the variable name to generate a new variable.

          - Before combining, "defragmenting" the various versions of the variable VARIABLECODE_En1_C* together

          Thanks again for your time on my problem and sincerely, apologies for confusing a few of my ideas, scripts and concepts.
          Kind regards,

          Marcus

          Comment

          Working...
          X