Identify variable names that differ only in first letter

Alex Pabst

Join Date: Nov 2015

Posts: 2
#1

Identify variable names that differ only in first letter

27 Nov 2015, 05:18

Dear all,

I have a dataset with variables that differ only in the first letter of the variable name, e.g.
datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7
rheum_7 wrheum_7 zrheum_7

For re-organizing my data I would like to identify those variables that could be "grouped together", i.e. are identical, except for the first letter. For example, I would like to list the "groups" first, then put any "group" in a macro and do some recoding, renaming etc. The main problem here is that I have 2000+ variables in the dataset and I am looking for a handy way to identify the groups without searching for them manually. I tried different commands like renvars, findnames, order or rename - but was not successful so far.

Anone an idea?

Thanks,
Alex
Tags: foreach, label, loop, macro, variable names
daniel klein

Join Date: Mar 2014

Posts: 3847
#2

27 Nov 2015, 07:11

I tried different commands like renvars, findnames, order or rename - but was not successful so far.

What does this mean? What did you type? What did Stata do (or not do) in response? Why did this not correspond to your goal?

Code:

order * , alphabetic

will order the variables alphabetically, meaning all variables starting with d before those starting with e and so on. You can directly refer to these variables using wildcard characters, as in

Code:

describe d* describe e*

As far as I can tell, this is what you ask for. If not, please explain.

Best
Daniel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#3

27 Nov 2015, 07:22

The statement about differing in the first letter is, in this case, intended to include the first letter being omitted! That is a complication here.

Thus given

Code:

datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7

I would not say that datum_7 and xdatum_7 differ only in the first latter, as all the other letters are different.

However, I think you might make progress by reversing the variable names and looking for common patterns.

I think I would put all the variable names as a string variable in another dataset and work it out from there.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#4

27 Nov 2015, 07:56

I see, sorry for being ignorant here. This sounds interesting. Can we have a little more information on the the patterns of the variable names.

In the example give, datum_7 and rheum_7 both have 7 characters, all other names have exactly 8 characters. Does this apply for all variable names, or are there differences in length?

Best
Daniel
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35694
#5

27 Nov 2015, 08:04

Daniel: I don't think you are ignorant at all here. The twist to the problem needs to be more explicit.
Comment

daniel klein

Join Date: Mar 2014
Posts: 3847

27 Nov 2015, 08:28

Here is a try

Code:

cap pr drop namegroups
m : mata clear

*! version 1.1.0 27nov2015 daniel klein
pr namegroups
    vers 11.2    
    m : get_namegroups()
end

vers 11.2

m :

void get_namegroups()
{
    string rowvector name, group
    real scalar n_groups
    
    name = st_varname((1..st_nvar()))
    
    st_rclear()
    
    n_groups = 0
    for (i = 1; i <= cols(name); ++i) {
        group = (name[i], select(name, (name[i] :== substr(name, 2, .))))
        name = select(name, (name[i] :!= substr(name, 2, .)))
        st_global("r(group" + strofreal(i) + ")", invtokens(group))
        n_groups = n_groups + 1
    }
    st_numscalar("r(n_groups)", n_groups)
}

end

And here is an example

Code:

. clear

.
. foreach x in ///
>         datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7 ///
>         rheum_7 wrheum_7 zrheum_7 ///
>         foo_7 afoo_7 {
  2.        
.         generate byte `x' = 42
  3. }

.
. describe

Contains data
  obs:             0                          
 vars:            10                          
 size:             0                          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
              storage  display     value
variable name   type   format      label      variable label
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
datum_7         byte   %8.0g                  
xdatum_7        byte   %8.0g                  
wdatum_7        byte   %8.0g                  
zdatum_7        byte   %8.0g                  
ydatum_7        byte   %8.0g                  
rheum_7         byte   %8.0g                  
wrheum_7        byte   %8.0g                  
zrheum_7        byte   %8.0g                  
foo_7           byte   %8.0g                  
afoo_7          byte   %8.0g                  
----------------------------------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:  
     Note:  dataset has changed since last saved

.
. namegroups

. 
. return list

scalars:
           r(n_groups) =  3

macros:
             r(group3) : "foo_7 afoo_7"
             r(group2) : "rheum_7 wrheum_7 zrheum_7"
             r(group1) : "datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7"

. 
end of do-file

It is probably not the most efficient way of doing it, but it is a start.

Best
Daniel

Last edited by daniel klein; 27 Nov 2015, 08:34. Reason: version 1.1.0 returns the number of groups, example includes description of dataset

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35694
#7

27 Nov 2015, 08:34

Good program!
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

27 Nov 2015, 10:22

This is indeed an interesting problem. Here's another solution that focuses on identifying common suffixes. The program builds up suffixes as long as there are two or more variable names that match each suffix.

Code:

capture program drop varsuffix
program varsuffix, rclass

    version 12
    
    preserve
    
    qui {
    
        // make a dataset of variables
        describe,  replace clear

        gen pos = length(name)
        gen suffix = ""
        gen try = ""
        local more 1
    
        // look for common suffixes
        while `more'  {
    
            replace pos = pos - 1
            replace try = substr(name, pos, .)
            bysort try: replace suffix = try if _N > 1 & pos > 0
            count if pos > 1
            local more = r(N)
            list
    
        }
    
        levelsof suffix, local(suffixes) clean
        return local suffixes `suffixes'
            
    }
    
end

Here's the output with the variable names provided

Code:

. * Example generated by -dataex-. To install: ssc install dataex
. clear

. input byte(id datum_7 xdatum_7 wdatum_7 zdatum_7 ydatum_7 rheum_7 wrheum_7 zrheum_7)

           id   datum_7  xdatum_7  wdatum_7  zdatum_7  ydatum_7   rheum_7  wrheum_7  zrheum_7
  1. 1 2 3 4 5 6 7 8 9
  2. end

. 
. varsuffix

. ret list

macros:
           r(suffixes) : "datum_7 rheum_7"

. 
. foreach suffix in `r(suffixes)' {
  2.         des *`suffix'
  3. }

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------
datum_7         byte    %8.0g                 
xdatum_7        byte    %8.0g                 
wdatum_7        byte    %8.0g                 
zdatum_7        byte    %8.0g                 
ydatum_7        byte    %8.0g                 

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------
rheum_7         byte    %8.0g                 
wrheum_7        byte    %8.0g                 
zrheum_7        byte    %8.0g

Here's another example with more than one character before the common suffix:

Code:

. clear

. input byte(id carprice busprice datum_7 xdatum_7 wdatum_7 y2015zdatum_7 y2015ydatum_7 y2015rheum_7 y2015wrheum_7 y2015zrheum_7)

           id  carprice  busprice   datum_7  xdatum_7  wdatum_7  y2015zd~  y2015y~7  y2015r~7  y2015w~7  y2015zr~
  1. 1 2 3 4 5 6 7 8 9 10 11
  2. end

. 
. varsuffix

. ret list

macros:
           r(suffixes) : "datum_7 price rheum_7"

. 
. foreach suffix in `r(suffixes)' {
  2.         des *`suffix'
  3. }

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------
datum_7         byte    %8.0g                 
xdatum_7        byte    %8.0g                 
wdatum_7        byte    %8.0g                 
y2015zdatum_7   byte    %8.0g                 
y2015ydatum_7   byte    %8.0g                 

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------
carprice        byte    %8.0g                 
busprice        byte    %8.0g                 

              storage   display    value
variable name   type    format     label      variable label
------------------------------------------------------------------------------------------------------------------------------------
y2015rheum_7    byte    %8.0g                 
y2015wrheum_7   byte    %8.0g                 
y2015zrheum_7   byte    %8.0g

Comment

Alex Pabst

Join Date: Nov 2015

Posts: 2
#9

30 Nov 2015, 03:53

Thank you all for your efforts! Both programs run perfectly. I'll use Daniel`s because it best meets my needs - now I have a total of 748 groups in r(list), which I can use to continue.

Best,
Alex

Last edited by Alex Pabst; 30 Nov 2015, 03:56.
Comment
daniel klein

Join Date: Mar 2014

Posts: 3847
#10

30 Nov 2015, 04:19

Actually, Robert's program is the (much) better one. Note that my code assumes the "common" identifier name always occurs first, meaning that

Code:

datum_7

will occur before

Code:

xdatum_7

in the dataset. Thus, grouping depends on the order of the variables in the dataset, which should clearly not be the case.

I would go with Robert's approach, maybe with a little tweak, depending on your exact needs.

Best
Daniel
1 like
Comment

Announcement