Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Keeping only variables whose name is in a list

    Stata 15, Windows 10

    I'm iterating over a set of datasets looking for particular variables, some are not in all datasets.

    I can't figure out how to have stata drop variables that aren't in the list or keep only the variables that appear in the list.

    If I type keep var1, var2, var3, and that particular dataset doesn't have var2, I get an error and nothing happens. So as the do file iterates over the datasets, it stops when it encounters this and doesn't drop the variables I'm not interested in.

    Any suggestions?

  • #2
    Here's sample code that may point you in a useful direction. The key is using a macro list function to select variables that appear in both the list of variables to keep and the list of variables in the dataset. See help macrolists for details.
    Code:
    sysuse auto, clear
    local keep make price sales
    ds
    local vars `r(varlist)'
    local tokeep : list vars & keep
    display "`tokeep'"
    keep `tokeep'
    ds
    Code:
    . sysuse auto, clear
    (1978 Automobile Data)
    
    . local keep make price sales
    
    . ds
    make          mpg           headroom      weight        turn          gear_ratio
    price         rep78         trunk         length        displacement  foreign
    
    . local vars `r(varlist)'
    
    . local tokeep : list vars & keep
    
    . display "`tokeep'"
    make price
    
    . keep `tokeep'
    
    . ds
    make   price

    Comment


    • #3
      It's usually better practice to explain, as William does, how basic commands can get you what you need in a few lines. Nevertheless this problem has necessarily arisen many times before and there are customised solutions too. Consider also isvar (SSC):


      Code:
      . sysuse auto, clear
      (1978 Automobile Data)
      
      . isvar price mpg sales
      
      variables: price mpg
      not variable: sales
      
      . ret li
      
      macros:
                  r(badlist) : "sales"
                  r(varlist) : "price mpg"

      Comment


      • #4
        Could you kindly show me an efficient way of keeping certain variables with the same prefix across waves of a panel (noting some vars are not in every wave). For example, I have about 10 variables beginning with le (lemar, lemvd, lebth, ...) and there are four extensions of each (such as lemarq1, ... lemarq4). Using the local variables command within a loop I tried adding * at the end, e.g. lemar* to pick up the 'q1-q4' variations without success. That said, I would like to do the equivalent of lemar* or better, le*. Suggestions appreciated. I note that my local varlist comprises 30+ variables, so addressing this issue could cut this by up to one-half. I also note that I am using Stata 15 SE.

        Thank you in advance, Chris

        Comment


        • #5
          I don't really follow #4. I note that

          Code:
          keep le*
          will keep all variables starting that way and naturally you can and should also specify any others that you want. I don't see the connection here to local macros at all. If it helps to put some of the names in a local first, that's fine, but I can't see that is central to the quesiton.

          it's a hard lesson to learn, but #4 is talking about code rather than showing us exactly what you are doing....

          Comment


          • #6
            Thanks Nick Cox.

            I was shown to use a local macro to 'keep' the list of variables I wanted in my dataset (#9 at https://www.statalist.org/forums/for...tring-function), but I am not sure if there is a limit to how many variables one can list here. Below is an example of my code to this end, which includes those vars that I discuss in #4. Yes this is a coding issue but I also wonder if I am incorrectly using the local macro. Like I said, I tried to replace all le variables with le* but no variables beginning with le were kept in my dataset). I had the same outcome with cety*

            Code:
            local variables hhid age sex educ marstat lebth lefni lefnw lefrd leins lemar lemvd leprg lesep lercl ledrl ///
            cety01 cety02 cety03 cety04 cety05 cety06 cety07 cety08 cety09 cety10 lsvol lshrvol lsnwmc lsclubn lstrust ///
            losat losatnl losatlc losathl losatsf losatyh losatfs losateo
            If the * function does not work with a local macro, should I swap the local macro to the 'keep' command to achieve what I want in #4?

            Kind regards, Chris

            Comment


            • #7
              A local macro is just a place to put text. There is no problem in putting * into a macro as one character, very often among others as part of longer text. Its meaning outside that macro as a wildcard (not function) meaning all variable names is a different thing. Its meaning outside that macro as a multiplication operator is another. And so on.

              You're referring to some situation in which you used the local macro as argument to something else and didn't get what you want: perhaps an error message, or nothing, or something puzzling. That would be something explainable in those terms. It's not anything to do with * not working with local macros, because there isn't such a problem.

              Here the immediate worry seems to be a keep statement, but you don't show us what it is.

              There is a limit on the length of a local macro, but it's enormously more than you are using here.

              Comment


              • #8
                Thanks for your reply Nick Cox. When I replaced the list of variables beginning with le with just le*, none of the variables were kept. I've added the part of the code I'm using in hope that the potential error is clear.

                Code:
                // variables to keep (sample shown)
                local variables xwaveid hhpxid age sex educ marstat lebth lefni lefnw lefrd leins lemar lemvd leprg lesep lercl ledrl ///
                cety01 cety02 cety03 cety04 cety05 cety06 cety07 cety08 cety09 cety10 lsvol lshrvol lsnwmc lsclubn lstrust ///
                losat losatnl losatlc losathl losatsf losatyh losatfs losateo
                
                local filename partners
                clear
                save "`savingdir'/`filename'", replace emptyok
                
                // drawing individual data + gen wave (Code from link provided above by William Lisowski)
                
                forvalues wave=1/18 {                                    
                    local waveprefix = word(c(alpha), `wave')
                    use "`origdatadir'/Combined_`waveprefix'180c.dta", clear
                    rename `waveprefix'* *
                    drop if hhpxid==""
                   
                    quietly ds
                    local allvars `r(varlist)'
                    local thiswave : list variables & allvars
                    generate byte wave = `wave'
                    keep xwaveid hhpxid wave `thiswave'
                    display "Wave `wave' (`waveprefix') - kept `thiswave'
                    append using "`savingdir'/`filename'"
                    save "`savingdir'/`filename'", replace
                }

                Comment


                • #9
                  Code:
                   
                   local thiswave : list variables & allvars
                  isn't smart the way you want. It certainly won't expand wildcards. It's just looking at words in the macro as it sees them. If "le*" appears in one list and "le1 le2" in another the intersection is empty.

                  Comment


                  • #10
                    I'm still quite new to Stata and do not know how to address this. Some example code would be appreciated?

                    Comment


                    • #11
                      In turn there is no way I can test your code because you don't provide sample data. I appreciate that your datasets are likely to be large and complicated and can't be given here but the lack of example data is a problem for us in answering your questions, In programming forums there is much emphasis on stripping down problems to minimal, complete, verifiable examples, but that in turn requires some general fluency in a language.

                      Or -- more crucially -- you often don't give code or it's not relevant to your problem.

                      #8 is a case in point as nowhere can I see the use of a wildcard le* which you report as causing problems.

                      #9 explained why the code line cited there would not work if a wildcard were part of either local macro. The implication is not to do that -- but you aren't doing it in code you show us.

                      I can't do more to help.

                      I don't understand why the code in #8 wouldn't work. Further, you explain that it is just part of your code. Fair enough, but that doesn't make the problem easier in this case.

                      Conversely, it may be the code works, but you just want to make it a little more clever. OK, but I don't understand enough about your data to understand how.
                      Last edited by Nick Cox; 08 Feb 2020, 02:17.

                      Comment


                      • #12
                        Thanks Nick Cox. It seems I misunderstood what you want. Let me try to explain.

                        With respect to your request for sample data - Yes I am working with a very large dataset, but I'm really not sure what of my data you need to see, so I've taken a guess to provide a sample of some of the variables of interest in my original question. The 'le' variables are all 1, 2 responses ([1] No [2] Yes). There are 10 cety variables because the question asks the question about which type of school do you send your child (to which they have four response options) and they account for up to 10 children, hence cety01...cety10). I hope this gives you a better feel for my data. I can say that I have the crosswave id number for every respondent and a corresponding id for their partners, then a list of about 50 variables, some of which are very similar in name.
                        Code:
                         * Example generated by -dataex-. To install: ssc install dataex
                        clear
                        input byte(lebth lefni lefnw lefrd leins lemar lemvd leprg lesep lercl ledrl cety01 cety02 cety03 cety04 cety05 cety06 cety07 cety08 cety09 cety10)
                        
                        2 1 1 2 1 1 1 2 1 1 1 . . . . . . . . . .
                        1 1 1 1 2 1 1 2 1 1 1 . . . . . . . . . .
                        1 1 1 2 1 1 2 1 1 1 2 . . . . . . . . . .
                        1 2 1 1 1 2 1 2 1 1 1 . . . . . . . . . .
                        2 1 2 1 1 2 1 1 2 1 1 . . . . . . . . . .
                        1 1 1 1 2 1 1 1 2 1 1 1 . . . . . . . . .
                        2 1 1 1 1 1 1 2 1 1 1 . . . . . . . . . .
                        2 1 1 2 1 1 2 1 1 1 2 . . . . . . . . . .
                        1 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        2 1 1 1 2 1 1 1 2 1 2 3 3 . . . . . . . .
                        1 1 1 2 1 1 2 1 1 1 1 . . . . . . . . . .
                        1 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        1 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        2 1 1 2 1 1 1 2 1 1 1 . . . . . . . . . .
                        1 1 1 1 2 1 1 2 1 1 1 . . . . . . . . . .
                        1 1 1 2 1 1 2 1 1 1 2 . . . . . . . . . .
                        1 2 1 1 1 2 1 2 1 1 1 . . . . . . . . . .
                        2 1 2 1 1 2 1 1 2 1 1 . . . . . . . . . .
                        1 1 1 1 2 1 1 1 2 1 1 1 . . . . . . . . .
                        2 1 1 1 1 1 1 2 1 1 1 . . . . . . . . . .
                        2 1 1 2 1 1 2 1 1 1 2 . . . . . . . . . .
                        1 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        2 1 1 1 2 1 1 1 2 1 2 2 2 . . . . . . . .
                        1 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        1 2 1 1 2 1 1 2 1 1 1 . . . . . . . . . .
                        1 1 1 1 1 1 1 2 1 1 1 . . . . . . . . . .
                        2 1 1 1 1 1 1 1 1 1 1 . . . . . . . . . .
                        2 1 1 2 1 1 2 1 1 1 1 1 1 . . . . . . . .
                        With respect to
                        Code:
                        you often don't give code or it's not relevant to your problem. #8 is a case in point as nowhere can I see the use of a wildcard le* which you report as causing problems
                        I thought the code I am using (in #8) so you could see it and help me find a more efficient way of listing the variables as the list of 'le..' variables is quite long, especially with the quarterly variations (noted in #4). If I have to list all the variations of the 'le' variables that will add another 40-odd variables to my list, hence my question to ask for a more efficient way to list them.
                        In #8, I noted that I tried swapping out those variables beginning with 'le' (and others). The code I used was
                        Code:
                         local variables xwaveid hhpxid age sex educ marstat le* cety* lsvol lshrvol lsnwmc lsclubn lstrust losat*
                        While you were kind enough to identify the problem line of code in #9 the explanation was too complicated for me - I do not understand what a 'wildcard' is. I thought it would be easy for you to show a way to improve the code I have, which is why I provided #8 - which seems not to have helped you.
                        The code in #8 works. As I stated in my original question, I would like to know if there is a more efficient way to list all the variables I want to keep in my dataset, especially those listed above. I thought there would be a way (using my code or alternative code) to replace my list of all variables beginning with 'le' (cety) to just one variable called 'le* (and cety*, etc), but while I was able to run my code without error, the variables beginning with 'le' or 'cety' were not kept.
                        You are therefore correct when you state
                        Code:
                        Conversely, it may be the code works, but you just want to make it a little more clever. OK, but I don't understand enough about your data to understand how.
                        I'm sorry, I thought I was clear in my post in #4, I have tried to be clearer here. I hope this helps. If not, maybe I should just use 'keep' instead of 'local variables'. As I said, I've had help with my code - and some of it I do not yet understand ...

                        Comment


                        • #13
                          A Google for wildcard leads to an informative first find and a search in Stata leads to informative links. le* is an example of a wildcard, the use of an abbreviation that may or may not be matched by one or more variable names.

                          Otherwise thanks for the extra detail, and the thread remains wide open to anyone else willing to help.

                          Comment


                          • #14
                            Yes I understand now thanks Nick Cox.

                            Comment


                            • #15
                              Why -isvar-, as suggested by Nick in #3, is not a solution? Besides, the code for keeping wanted variables could be, and should be, put outside the loop for appending.

                              Code:
                              *ssc install isvar
                              
                              local variables xwaveid hhpxid age sex educ marstat le* cety* losat*
                              
                              clear
                              save Output, replace emptyok
                              
                              foreach f in YourFileList {
                                  use `f', clear
                                  append using Output
                                  save Output, replace
                              }
                              
                              isvar `variables'
                              keep `r(varlist)'

                              Comment

                              Working...
                              X