Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How can I use the if command in Stata in a way analogues to other programming languages?

    Hello


    This is maybe a stupid question, however, I have tried hard to answer it without using the time of the Stata Forum. I have just started using Stata and would like to write little programs as I do in other languages, such as pascal, SAS or R. Unfortunately, I seem to have a fundamental misconception about the workings of Stata.

    One of the most basic tools for programming is the if-else facility (with all its extensions and variations, such as while etc.).

    The problem is the following: whenever I try to use the if command, the commands specified after the { are not executed even though the specified condition is met in a given dataline (condition evaluates to "true"), as it would be the case in other programming languages. I searched in the FAQ files and found an answer, which I do not understand, however (underlined words, in particular). I quote

    "A common mistake is to use the if command as the argument. This is incorrect Stata syntax. The if command was designed to be used with a single expression (often a local macro) inside programs and do-files. Using this command incorrectly results in the evaluation of the expression using only the first observation of the variable
    Example: there is a simple datafile with the variable x with values 1 1 2 3 2 .... With the following programm lines
    if x == 3 {
    replace x = 4
    }
    no changes are made. If the condition is "if x==1" instead, all values of x in the dataset are replaced with 4, because the condition is met for the first data line.

    FAQ says:
    "This is obviously not what the user intended. The solution to this problem is often to rewrite the command using if as a qualifier rather than as a command. As a qualifier, if will evaluate each observation as the command passes over the data." such as
    replace x = 4 if x == 3
    replace x = 4 if x == 1
    I cannot believe that this should be the only solution to the problem. The if qualifier is much less flexible and does not allow constructions with else. In other words: the if command would appear as quite useless to me if it were not executed for each line in a dataset (as I know it from other programming languages).

    Thank you very much in advance for the right hint

    Martin

  • #2
    Welcome to Stataland! There are many ways to achieve what you want to do ... and what is best (for you) depends quite a lot on precisely what you are trying to achieve. (And there is often a choice to be made between verbosity combined with transparency versus brevity.) So, I recommend that you set out a specific example (or several) of what you're trying to achieve, and members can make suggestions. Meanwhile, you might like to look at recode as well as replace, and also check out functions like inlist() or inrange() or cond(). I also find missing() very useful in this context as well (or rather constructions including !missing() ).

    Comment


    • #3
      Most of this is expression of puzzlement rather than a specific question that can be answered! The second (third, ...) language you learned was like, yet also unlike, the previous language(s), and Stata is no different. Some of Stata's constructs will strike you at first as very odd, and some will, it is to be hoped, seem so smart that you will start missing them in other languages. I came to Stata with varying experience of about six languages and my initial experiences were a mix of "That's fine" and "That's weird".

      I've often seen posts elsewhere from people who learned Stata before R asking for equivalents of Stata in R, which irritates many experienced R users mightily. I wouldn't say otherwise, would I, but I suspect that (proportionally) more experienced Stata users would be willing to say that R is excellent; it's just a case of how much they also use it.

      Still, I have no interest in sparking debates about which software is better or best.

      Stata is as it is documented. Perhaps you should give some examples of your problems which evoked error messages, explaining what it was that you were trying to do.

      More specifically, if ... else ... is entirely possible at observation level: use cond()

      See its documentation or
      http://www.stata-journal.com/sjpdf.html?articlenum=pr0016

      In some languages, something of the logical form (if not literal syntax) if condition { action } is automatically implemented as a loop over observations. The Stata answer is that precisely that is available as

      action if condition

      which is the main point of the material you cite. You can explicitly set up a loop over observations, but it's slow (sometimes very slow).


      Last edited by Nick Cox; 12 Jun 2014, 05:27.

      Comment


      • #4
        There are 3 uses of the word if in Stata: (1) the if programming command in Stata, (2) the if programming command in Mata, and (3) the if qualifier that can appear within many (if not most) Stata commands. The last of these simply restricts the primary command to operating on those observations for which the expression following the if statement is true. The first two are programming commands of the type you are used to, which can be used with blocks of code, can accommodate else and else if constructions, etc.

        I think you'll find that your "problem" is not with the if command, but rather with the way in which variables work as data structures in Stata. In particular, Stata commands tend to operate on entire variables (rather than just on specific observations), and there are several powerful idioms that permit you to manipulate all of the observations in a variable very quickly and easily (e.g., the by construction). Thus, ~99% of the time, if you're thinking about operating on (i.e., looping through) the values of a variable individually, you're not thinking about the problem in the right (i.e., Stata) way. This may take some getting used to if you're coming from another language or environment, but it is very efficient for performing analyses, and is easy to teach to those without a programming background.

        There are many other data structures in Stata/Mata, including scalars, vectors and matrices (like in R), arrays, associative arrays (i.e., dictionaries), etc. The nice thing about Stata, however, is that you don't need to (explicitly) use these for the majority of analyses or for simple data manipulation.

        My suggestion would be to jump in and consult the reference manuals, as needed; if you're having trouble performing a specific task, then ask questions here (people are glad to help). Once you have a bit of experience with Stata, it's likely to get easier.
        Last edited by Phil Schumm; 12 Jun 2014, 06:14.

        Comment


        • #5
          Thank you very much, Stephen, Nick and Phil for this really quick reply.


          Nick and Phil proposed that I should give a specific example in order to see how it could best be dealt with in Stata. Here is one:

          I have often data on calendar dates, where day, month and year can be known or missing. In many contexts I need a true date, so I habe to replace missing days / months with standardised values.

          In SAS, I have implemented a little macro for this, which checks first year if it is known, if yes then month, and finally day. According to the situation a date variable is created.
          I tried to write al litte Stata program which should do the same:

          program makedat

          /* nimmt vier Argumente: als erstes Argument den Namen einer Datumsvariablen (Zielvariable), ble 1 oder 2 ist
          ........
          */

          version 13
          args datevar dd mm yyyy
          if (`yyyy'>0 & `yyyy'<2999) {
          if (`mm'>0 & `mm'<13) {
          if (`dd'>0 & `dd'<32) {
          generate `datevar'=mdy(`mm',`dd',`yyyy')
          }
          else {
          generate `datevar'=mdy(`mm',15,`yyyy')
          }
          }
          else {
          generate `datevar'=mdy(7,1,`yyyy')
          }
          }
          else {
          generate `datevar'=.
          }

          The call from another program could be
          makedat diagdat ,
          and in a dataline with day missing, month=11 and year=2012 the value of the new variable diagdat would be 15.11.2013.

          As the program did not work, I finally resorted to using the "if qualifier" solution:

          generate `datevar'=.
          replace `datevar'=mdy(7,1,`yyyy') if `yyyy'>0 & `yyyy'<2999
          replace `datevar'=mdy(`mm',15,`yyyy') if `yyyy'>0 & `yyyy'<2999 & `mm'>0 & `mm'<13
          replace `datevar'=mdy(`mm',`dd',`yyyy') if `yyyy'>0 & `yyyy'<2999 & `mm'>0 & `mm'<13 & `dd'>0 & `dd'<32

          This works. However, I was not really satisfied because it appear less efficient to me (or is this different in Stata?), because the replace commands are executed on `datevar' when day, month and year are known (in addition to the initial generate). With the nested structure above, there is only one generate comand for each situation (I assume that the evaluations in the if commands are much less time consuming). Furthermore, I could not imagine that Stata could be so different in its inner workings compared to the other programming languages I know.

          If I have to adapt to a diffenent way of programming, OK (my job requires me to use Stata), but it makes life a bit more complicated as I cannot simply "translate" the work I have done in the past.

          Thanks for a further comment on this.

          Best,
          Martin




          Comment


          • #6
            Thanks for your example, but I don't get your rules. It seems that you want to supply

            one date, possibly partial, as one or more constants

            and then generate an entire variable containing a replacement date. That doesn't correspond to anything I can imagine in serious management of Stata data if only because it only handles one date at a time.

            Remember: a variable is an entire column or field in the dataset.

            Comment


            • #7
              Originally posted by Martin Gebhardt View Post
              I have often data on calendar dates, where day, month and year can be known or missing. In many contexts I need a true date, so I habe to replace missing days / months with standardised values.
              If we let m, d and y represent the month, day and year, respectively, then
              gen mydate = cond(!mi(m,d), mdy(m,d,y), cond(!mi(m), mdy(m,15,y), mdy(7,1,y)))
              will mimic your rules, where it is assumed that if m, d and y are non-missing, then they are putatively valid. I'm not sure where your ranges (0,13), (0,32) and (0,2999) come from, but I would suggest separating the issue of missing from the issue of validation. The latter is best handled by Stata's own date function(s), which will catch errors that your ranges won't (e.g., February 29th in non-leap years).

              Originally posted by Martin Gebhardt View Post
              This works. However, I was not really satisfied because it appear less efficient to me (or is this different in Stata?), because the replace commands are executed on `datevar' when day, month and year are known (in addition to the initial generate).
              You are correct that your 3 replace statements with if conditions will each run through all observations, and this is in principle inefficient. However, this is only going to matter in a large dataset, or if you're doing this type of thing over and over again. In those cases, then you do need to think about performance, and ways to speed things up (e.g., using a statement like that above, or writing a function in Mata). But, in general, don't worry about performance until you need to.

              Originally posted by Martin Gebhardt View Post
              Furthermore, I could not imagine that Stata could be so different in its inner workings compared to the other programming languages I know.
              While it's true that different languages often share similarities in syntax, data structures, etc., and that relating things in a new language to things you know from previous languages can be helpful in learning a new language, this usually breaks down pretty quickly. For example, a Java programmer can quickly learn enough Python to write a functioning program, but it is unlikely to be recognized by Python programmers as efficient Python code. Each language has its own unique features and idioms, and learning these is required to use the language effectively. An experienced R user will miss a few things when working in Stata, to be sure, but she (or he) will also discover several gems not available in R.
              Last edited by Phil Schumm; 12 Jun 2014, 09:34.

              Comment


              • #8
                It looks as if you are trying to write a program (-program makdat-) which takes as arguments a date variable, a day, a month, and year, and then creates a new date variable according to some logic. If that is correct, then I think you don't mean

                Code:
                makedat diagdat
                but
                Code:
                makedat diagdat dayvar monvar yearvar
                Is that correct? Otherwise, I cannot make sense of what you are trying to do. If this is correct, then the program you want is probably


                Code:
                program makedat
                   version 13
                   args datevar dd mm yyyy
                   tempvar day mon year
                   gen `day'=   cond(inrange(`dd',1,31),`dd',15)
                   gen `mon' = cond(inrange(`mm',1,12),`mm',7)
                   gen `year' = cond(inrange(`yyyy',1,2999),`yyyy',2012)
                
                   gen `datevar' = mdy(`mon',`day',year')
                 end
                This creates three new variables for month, day, year where the missing values are filled in, and then creates the new datevar using those. Since they are temporary variables, they disappear when the program ends.

                If you really want to emulate the -if- logic (that is, avoid temporary variables), you could do it all on one line using nested -condition- statements, but it will be very difficult to read. Note that neither your logic nor my program captures problems such as when the month is February and the day is 30.

                Comment


                • #9
                  Dear all

                  thank you again for your helpful comments. My initial motivation for contacting the forum was not to get just a solution for the specific makedat problem (I had one already, the if-qualifyier Version) but to better understand why if-else constructs do not work as in the other languages I have been using so far. The makedat program served as simple example. In all the other languages I am familiar with (SAS, SPSS, R, Splus, Visual Foxbase, Pascal, Fortran), the logic is that a record is read one at a time, and then processed. If-else constructs are used for implementing conditional branching, corresponding to the rhombus shapes in flow charts. It appears that the same logic does not apply to Stata.

                  ~~
                  That doesn't correspond to anything I can imagine in serious management of Stata data if only because it only handles one date at a time
                  .
                  Nick, I can reassure you that it is serious. I am working with surveillance data that come often with incomplete information. If I need to calculate the width of a time interval [t1,t2] in days and only the year is known for t1, t2 complete, it is more accurate to approximate t1 by the first July of that year than to throw away the information on day and month in t2. Of course, the fact that an approximation was used needs to be considered when doing statistics.

                  ~~gen mydate = cond(!mi(m,d), mdy(m,d,y), cond(!mi(m), mdy(m,15,y), mdy(7,1,y)))
                  will mimic your rules, where it is assumed that if m, d and y are non-missing, then they are putatively valid
                  Thank you, Phil, for this alternative solution. I agree that the data validation issue can be dealt with separately.

                  ~~It looks as if you are trying to write a program (-program makdat-) which takes as arguments a date variable, a day, a month, and year, and then creates a new date variable according to some logic. If that is correct, then I think you don't mean
                  Code:
                  makedat diagdat
                  but
                  Code:
                  makedat diagdat dayvar monvar yearvar
                  Is that correct?
                  Of course you are right, Jeph, sorry for my mistake . I find the code you provide much better than mine.

                  Thanks again, I am sure your comments will help me in my efforts to build a toolbox similar to what I have in SAS etc.









                  Comment


                  • #10
                    Naturally I agree that needing to impute dates makes perfect sense: see e.g. http://www.stata-journal.com/article...article=dm0062

                    I am still puzzled at the impulse to create a entire Stata variable corresponding to a specific date. If you want to have a constant date, it would be better to hold it as a scalar.

                    Unfortunately a discussion on what Stata constructs are, or are not, natural or congenial is best not distracted by examples that don't appear natural themselves.

                    Comment


                    • #11
                      Martin: you wrote regarding other software that
                      the logic is that a record is read one at a time, and then processed.
                      May I suggest that you look again at the last line of Nick Cox's first post in this thread? (And, related, the last paragraph in Phil Schumm's.) What you may not have fully grasped -- I'm guessing -- is that, most of the time, one shouldn't think in terms of getting Stata to work observation by observation (working down each row of your data matrix, if you will). [The underlying compiled code may be doing that, but I'm referring to how the user thinks about it.) One of Stata's greatest advantages is that the whole variable (column in your data matrix, if you will) is worked on. Combined with the by construct, this makes Stata really powerful and efficient at many data management tasks. That is, in Stata, one only very very rarely loops over observations. (In Mata, it's different.)

                      Comment


                      • #12
                        Dear Stephen and Nick

                        I am very happy with the idea that I can learn and discover how powerful and efficient many tasks can be handled with Stata. And by no means I want to suggest that other programs are superior, just because I am more familiar with them. However, it is a fact that ALL other programs I have been working with (and there are quite a few) follow the observation by observation logic which is a natural way of processing many types of observational data. In particular, this approach enables the use of if-else constructs (which is not possible in Stata, the reason why I approached the Forum) and although I see well that there are alternative ways to accomplish these tasks, I am still not convinced that data processing before statistical analysis begins is done more efficiently the way Stata wants me to do it. I really do not want to open a discussion between "schools" (Stata against say, SAS or R). However, because so many other programs have the obs by obs approach suggest to me that it may be useful in many instances. At the very least, it would have facilitated my actual task to built a toolbox in Stata as I have built it in SAS (which I wrongly imagined would be kind of simple "translation", as it would be - to a large degree at least - between SAS and R, or SAS and Visual Fox). Now I will have to invest a bit more. However, there are positive sides: first, there is this wonderful Forum which has helped me already a lot, and then there is the possibility that I can solve many other tasks more easily because Stata does it differently.


                        Nick, you write
                        I am still puzzled at the impulse to create a entire Stata variable corresponding to a specific date. If you want to have a constant date, it would be better to hold it as a scalar.
                        Unfortunately a discussion on what Stata constructs are, or are not, natural or congenial is best not distracted by examples that don't appear natural themselves.
                        The data I am working with are follow-up data on persons infected with HIV and/or other deseases. For each person, there are many dates (each corresponding to a variable, of course, why does this puzzle you? birthdate, dates of specific symptoms, dates of various follow-up events, for foreigners date of immigration, for some date of¨death. Some of these dates are only partially known. Then I resort to approximations, which is not the same as using a "constant date". It is still a variable. Because I have to do calculations with the various dates (such as intervals) this is more easily done if I have the variables in date format instead of several variables for the components of the dates.
                        You are certainly right when you say that a discussion on Stata constructs should not be focussed on examples. However, I hope that you can see now that the example I have used is very natural, indeed.

                        I hope you do not find me stubborn, now.. It is clear that I am a Stata "greenhorn" and there may be more problems for which I would like to seek the advice from the Forum.


                        Comment


                        • #13
                          Just comparing SAS and Stata, there is a difference in the way the languages loop over observations. Both languages have an implied loop that acts as a shortcut. The implied loop allows you to treat variables as scalars in your code, when they are actually vectors.

                          In SAS, you have the data step, where the entire code block is executed observation-by-observation. So a series of commands can be run on each observation.

                          In Stata, there is no data step. Or, you could say each data-manipulation command is contained within its own data step. Each command loops over all observations before the next command is executed.

                          Using if as a programming command in Stata is only useful for controlling whether a set of commands is executed. It doesn't work observation-by-observation. If you want to work observation-by-observation, you have to use the if-qualifier. You have to break your command into steps that can be executed in separate loops. As suggested above, you could separate validation from variable creation. So you loop through the observations once for each variable validation, then a final time for variable creation.

                          If you want to use a SAS-style data step construction, you would have to loop over observations explicitly using subscripts. This is possible in Stata, but probably not recommended. I imagine it would be very slow. It would look like this:

                          Code:
                          version 13
                          args datevar dd mm yyyy
                          
                          ** Note: generate variable vectors before looping through observations 
                          generate `datevar' = .
                          
                          forvalues i = 1/_N {
                          
                          if (`yyyy'[`i']>0 & `yyyy'[`i']<2999) {
                          if (`mm'[`i']>0 & `mm'[`i']<13) {
                          if (`dd'[`i']>0 & `dd'[`i']<32) {
                          replace `datevar'[`i']=mdy(`mm'[`i'],`dd'[`i'],`yyyy'[`i'])
                          }
                          else {
                          etc...
                          
                          }
                          The help page for subscripts is help subscripts.
                          If you really have to use this type of construction, and need it to run efficiently, you should look into writing it in Mata.

                          Mike

                          Comment


                          • #14
                            On your comments addressed generally:

                            The difference you are inferring is far less than you imagine. Stata too makes many changes observation by observation: how else could it work? The point is largely that this is not as explicit in the syntax as it is in some other languages.

                            For example, a standard idiom in Stata is something like

                            Code:
                             
                            replace y = y[_n-1] if missing(y)
                            This makes no sense unless you understand that it defines a loop over observations: look at the first, look at the second, ....

                            On your comments addressed to me:

                            We do need examples for a discussion: the point is that those examples should make sense to everyone, so that they then don't distract from the principles being discussed.

                            Suppose you have thousands of observations and one partly known date. Your program puts the same imputed date in every one of those observations. I don't see why you want to do that. Conversely, if you want to edit specific observations the program needs handles do that, namely some use of if and/or in qualifiers (not commands!).

                            In Stata, a variable is a entire field or column of the dataset. This is a quite specific use of the term that doesn't match widespread usage in discussions of programming.

                            By all means come back to the forum with other questions: that's why it's here. An old discussion at http://www.stata.com/statalist/archi.../msg01258.html may still be instructive.

                            Comment


                            • #15
                              Thank you very much for your very constructive remarks. I have only trouble to see why you think that my program puts the "same imputed date in every one of those observations". A approximated date is created only if at least the year is known, and since year varies among observations, the approximated date is a variable, not a scalar. A also would not call this way of approximation an "imputation" which for me would be inferring an estimate for the missing parts based on other Information available in the observartion. The way the approximation works is rather that the missing part is replaced with its expected value (roughly), assuming a constant distribution (eg 15 for a "standard"-month with 30 days)

                              Comment

                              Working...
                              X