Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identify variable names that differ only in first letter

    Dear all,

    I have a dataset with variables that differ only in the first letter of the variable name, e.g.
    datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7
    rheum_7 wrheum_7 zrheum_7


    For re-organizing my data I would like to identify those variables that could be "grouped together", i.e. are identical, except for the first letter. For example, I would like to list the "groups" first, then put any "group" in a macro and do some recoding, renaming etc. The main problem here is that I have 2000+ variables in the dataset and I am looking for a handy way to identify the groups without searching for them manually. I tried different commands like renvars, findnames, order or rename - but was not successful so far.

    Anone an idea?

    Thanks,
    Alex

  • #2
    I tried different commands like renvars, findnames, order or rename - but was not successful so far.
    What does this mean? What did you type? What did Stata do (or not do) in response? Why did this not correspond to your goal?

    Code:
    order * , alphabetic
    will order the variables alphabetically, meaning all variables starting with d before those starting with e and so on. You can directly refer to these variables using wildcard characters, as in

    Code:
    describe d*
    describe e*
    As far as I can tell, this is what you ask for. If not, please explain.

    Best
    Daniel

    Comment


    • #3
      The statement about differing in the first letter is, in this case, intended to include the first letter being omitted! That is a complication here.

      Thus given

      Code:
      datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7
      I would not say that datum_7 and xdatum_7 differ only in the first latter, as all the other letters are different.

      However, I think you might make progress by reversing the variable names and looking for common patterns.

      I think I would put all the variable names as a string variable in another dataset and work it out from there.



      Comment


      • #4
        I see, sorry for being ignorant here. This sounds interesting. Can we have a little more information on the the patterns of the variable names.

        In the example give, datum_7 and rheum_7 both have 7 characters, all other names have exactly 8 characters. Does this apply for all variable names, or are there differences in length?

        Best
        Daniel

        Comment


        • #5
          Daniel: I don't think you are ignorant at all here. The twist to the problem needs to be more explicit.

          Comment


          • #6
            Here is a try

            Code:
            cap pr drop namegroups
            m : mata clear
            
            *! version 1.1.0 27nov2015 daniel klein
            pr namegroups
                vers 11.2    
                m : get_namegroups()
            end
            
            vers 11.2
            
            m :
            
            void get_namegroups()
            {
                string rowvector name, group
                real scalar n_groups
                
                name = st_varname((1..st_nvar()))
                
                st_rclear()
                
                n_groups = 0
                for (i = 1; i <= cols(name); ++i) {
                    group = (name[i], select(name, (name[i] :== substr(name, 2, .))))
                    name = select(name, (name[i] :!= substr(name, 2, .)))
                    st_global("r(group" + strofreal(i) + ")", invtokens(group))
                    n_groups = n_groups + 1
                }
                st_numscalar("r(n_groups)", n_groups)
            }
            
            end
            And here is an example

            Code:
            . clear
            
            .
            . foreach x in ///
            >         datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7 ///
            >         rheum_7 wrheum_7 zrheum_7 ///
            >         foo_7 afoo_7 {
              2.        
            .         generate byte `x' = 42
              3. }
            
            .
            . describe
            
            Contains data
              obs:             0                          
             vars:            10                          
             size:             0                          
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
                          storage  display     value
            variable name   type   format      label      variable label
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
            datum_7         byte   %8.0g                  
            xdatum_7        byte   %8.0g                  
            wdatum_7        byte   %8.0g                  
            zdatum_7        byte   %8.0g                  
            ydatum_7        byte   %8.0g                  
            rheum_7         byte   %8.0g                  
            wrheum_7        byte   %8.0g                  
            zrheum_7        byte   %8.0g                  
            foo_7           byte   %8.0g                  
            afoo_7          byte   %8.0g                  
            ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
            Sorted by:  
                 Note:  dataset has changed since last saved
            
            .
            . namegroups
            
            . 
            . return list
            
            scalars:
                       r(n_groups) =  3
            
            macros:
                         r(group3) : "foo_7 afoo_7"
                         r(group2) : "rheum_7 wrheum_7 zrheum_7"
                         r(group1) : "datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7"
            
            . 
            end of do-file
            It is probably not the most efficient way of doing it, but it is a start.

            Best
            Daniel
            Last edited by daniel klein; 27 Nov 2015, 08:34. Reason: version 1.1.0 returns the number of groups, example includes description of dataset

            Comment


            • #7
              Good program!

              Comment


              • #8
                This is indeed an interesting problem. Here's another solution that focuses on identifying common suffixes. The program builds up suffixes as long as there are two or more variable names that match each suffix.

                Code:
                capture program drop varsuffix
                program varsuffix, rclass
                
                    version 12
                    
                    preserve
                    
                    qui {
                    
                        // make a dataset of variables
                        describe,  replace clear
                
                        gen pos = length(name)
                        gen suffix = ""
                        gen try = ""
                        local more 1
                    
                        // look for common suffixes
                        while `more'  {
                    
                            replace pos = pos - 1
                            replace try = substr(name, pos, .)
                            bysort try: replace suffix = try if _N > 1 & pos > 0
                            count if pos > 1
                            local more = r(N)
                            list
                    
                        }
                    
                        levelsof suffix, local(suffixes) clean
                        return local suffixes `suffixes'
                            
                    }
                    
                end
                Here's the output with the variable names provided

                Code:
                . * Example generated by -dataex-. To install: ssc install dataex
                . clear
                
                . input byte(id datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7 rheum_7 wrheum_7 zrheum_7)
                
                           id   datum_7  xdatum_7  wdatum_7  zdatum_7  ydatum_7   rheum_7  wrheum_7  zrheum_7
                  1. 1 2 3 4 5 6 7 8 9
                  2. end
                
                . 
                . varsuffix
                
                . ret list
                
                macros:
                           r(suffixes) : "datum_7 rheum_7"
                
                . 
                . foreach suffix in `r(suffixes)' {
                  2.         des *`suffix'
                  3. }
                
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------------------------------------------------------------------------------
                datum_7         byte    %8.0g                 
                xdatum_7        byte    %8.0g                 
                wdatum_7        byte    %8.0g                 
                zdatum_7        byte    %8.0g                 
                ydatum_7        byte    %8.0g                 
                
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------------------------------------------------------------------------------
                rheum_7         byte    %8.0g                 
                wrheum_7        byte    %8.0g                 
                zrheum_7        byte    %8.0g
                Here's another example with more than one character before the common suffix:

                Code:
                . clear
                
                . input byte(id carprice busprice datum_7 xdatum_7 wdatum_7 y2015zdatum_7 y2015ydatum_7 y2015rheum_7 y2015wrheum_7 y2015zrheum_7)
                
                           id  carprice  busprice   datum_7  xdatum_7  wdatum_7  y2015zd~  y2015y~7  y2015r~7  y2015w~7  y2015zr~
                  1. 1 2 3 4 5 6 7 8 9 10 11
                  2. end
                
                . 
                . varsuffix
                
                . ret list
                
                macros:
                           r(suffixes) : "datum_7 price rheum_7"
                
                . 
                . foreach suffix in `r(suffixes)' {
                  2.         des *`suffix'
                  3. }
                
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------------------------------------------------------------------------------
                datum_7         byte    %8.0g                 
                xdatum_7        byte    %8.0g                 
                wdatum_7        byte    %8.0g                 
                y2015zdatum_7   byte    %8.0g                 
                y2015ydatum_7   byte    %8.0g                 
                
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------------------------------------------------------------------------------
                carprice        byte    %8.0g                 
                busprice        byte    %8.0g                 
                
                              storage   display    value
                variable name   type    format     label      variable label
                ------------------------------------------------------------------------------------------------------------------------------------
                y2015rheum_7    byte    %8.0g                 
                y2015wrheum_7   byte    %8.0g                 
                y2015zrheum_7   byte    %8.0g

                Comment


                • #9
                  Thank you all for your efforts! Both programs run perfectly. I'll use Daniel`s because it best meets my needs - now I have a total of 748 groups in r(list), which I can use to continue.

                  Best,
                  Alex
                  Last edited by Alex Pabst; 30 Nov 2015, 03:56.

                  Comment


                  • #10
                    Actually, Robert's program is the (much) better one. Note that my code assumes the "common" identifier name always occurs first, meaning that

                    Code:
                    datum_7
                    will occur before

                    Code:
                    xdatum_7
                    in the dataset. Thus, grouping depends on the order of the variables in the dataset, which should clearly not be the case.

                    I would go with Robert's approach, maybe with a little tweak, depending on your exact needs.

                    Best
                    Daniel

                    Comment

                    Working...
                    X