Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Variable observations get deleted after running execute

    Good evening dear stata community,

    I am currently using STATA 15 and have run into a problem that I cannot find the solution for myself.
    Since this is my first time posting into statalist, please excuse my lack of formatting.

    I am trying to prepare my dataset for a survival time analysis. To do that, I have to clone new variables from already existing ones, drop certain missing values (while keeping others) and then generate the duration and censor variables from those.

    An example of an already existing variable would be this:

    PHP Code:
     fre df05a

    df05a 
    -- MARRIEDYEAR OF WEDDINGMARRIAGE 1
    ----------------------------------------------------------------------------
                                   |      
    Freq.    Percent      Valid       Cum.
    -------------------------------+--------------------------------------------
    Valid   -10  NAPSCREENED OUT |       9566      14.03      14.03      14.03
            
    -9   NO ANSWER         |        204       0.30       0.30      14.33
            
    -7   REFUSED           |         14       0.02       0.02      14.35
            
    -1   NOT AVAILABLE     |      43375      63.64      63.64      77.99
            1918                   
    |          1       0.00       0.00      77.99
            1921                   
    |          2       0.00       0.00      77.99
            1922                   
    |          1       0.00       0.00      78.00
            1923                   
    |          5       0.01       0.01      78.00
            1924                   
    |          5       0.01       0.01      78.01
            1925                   
    |          9       0.01       0.01      78.02
            1926                   
    |         10       0.01       0.01      78.04
            1927                   
    |         16       0.02       0.02      78.06
            1928                   
    |         12       0.02       0.02      78.08
            1929                   
    |         18       0.03       0.03      78.11
            1930                   
    |         21       0.03       0.03      78.14
            1931                   
    |         28       0.04       0.04      78.18
            1932                   
    |         41       0.06       0.06      78.24
            1933                   
    |         43       0.06       0.06      78.30
            1934                   
    |         55       0.08       0.08      78.38
            1935                   
    |         54       0.08       0.08      78.46
            
    :                      |          :          :          :          :
            
    1991                   |         75       0.11       0.11      98.82
            1992                   
    |         72       0.11       0.11      98.92
            1993                   
    |         64       0.09       0.09      99.02
            1994                   
    |         63       0.09       0.09      99.11
            1995                   
    |         64       0.09       0.09      99.20
            1996                   
    |         61       0.09       0.09      99.29
            1997                   
    |         71       0.10       0.10      99.40
            1998                   
    |         68       0.10       0.10      99.50
            1999                   
    |         64       0.09       0.09      99.59
            2000                   
    |         42       0.06       0.06      99.65
            2001                   
    |         26       0.04       0.04      99.69
            2002                   
    |         31       0.05       0.05      99.74
            2003                   
    |         22       0.03       0.03      99.77
            2004                   
    |         21       0.03       0.03      99.80
            2005                   
    |         21       0.03       0.03      99.83
            2006                   
    |         19       0.03       0.03      99.86
            2007                   
    |         24       0.04       0.04      99.89
            2008                   
    |         29       0.04       0.04      99.94
            2009                   
    |         27       0.04       0.04      99.98
            2010                   
    |         17       0.02       0.02     100.00
            Total                  
    |      68161     100.00     100.00           
    ---------------------------------------------------------------------------- 
    . to create my new variable, I used the following code:

    PHP Code:
    clonevar ymar1 df05a                           // ymar1 == first year of marriage 
    replace ymar1 = . if inlist(ymar1, -10, -1)
    drop if ymar1 == . 
    replace ymar1 = . if inlist(ymar1, -7, -9)
    fre ymar1 
    meaning that I drop the undesired missing observations before replacing them with the missing observations that
    I do want to keep.
    The code is written this way because Stata kept deleting all of my observations if I typed it as

    PHP Code:
    clonevar ymar1 df05a
    replace ymar1 
    = . if inlist(ymar1, -7,-9
    drop if ymar1 == -10 | -
    ... so far this is the only working way that allows me to only keep the missing values I truly need for my analysis.

    I have done this exact procedure on a bunch of variables. However, once I press execute other variables that could previously be displayed via fre or tab won't show up now, as it is the case for my education variable.

    My education variable (when using fre right after loading the dataset) will look like this:


    PHP Code:
    fre educy

    educy 
    -- YEARS OF SCHOOLING
    ---------------------------------------------------------------------------
                                  |      
    Freq.    Percent      Valid       Cum.
    ------------------------------+--------------------------------------------
    Valid   -10 NAPSCREENED OUT |         73       0.11       0.11       0.11
            
    -9  NO ANSWER         |        551       0.81       0.81       0.92
            
    -8  DO NOT KNOW       |          1       0.00       0.00       0.92
            
    -1  NOT AVAILABLE     |      58504      85.83      85.83      86.75
            4                     
    |          1       0.00       0.00      86.75
            5                     
    |         16       0.02       0.02      86.77
            6                     
    |         40       0.06       0.06      86.83
            7                     
    |        130       0.19       0.19      87.02
            8                     
    |       3038       4.46       4.46      91.48
            9                     
    |        997       1.46       1.46      92.94
            10                    
    |       2252       3.30       3.30      96.25
            11                    
    |        407       0.60       0.60      96.84
            12                    
    |        490       0.72       0.72      97.56
            13                    
    |        518       0.76       0.76      98.32
            14                    
    |        190       0.28       0.28      98.60
            15                    
    |        181       0.27       0.27      98.87
            16                    
    |        213       0.31       0.31      99.18
            17                    
    |        207       0.30       0.30      99.48
            18                    
    |        148       0.22       0.22      99.70
            19                    
    |         82       0.12       0.12      99.82
            20                    
    |         75       0.11       0.11      99.93
            21                    
    |         19       0.03       0.03      99.96
            22                    
    |         12       0.02       0.02      99.98
            23                    
    |          9       0.01       0.01      99.99
            24                    
    |          4       0.01       0.01     100.00
            25                    
    |          3       0.00       0.00     100.00
            Total                 
    |      68161     100.00     100.00           
    --------------------------------------------------------------------------- 
    ... However, after I run execute, it will now only be displayed as:

    PHP Code:
    fre educy

    educy 
    -- YEARS OF SCHOOLING
    ----------------------------------------------------------------------
                             |      
    Freq.    Percent      Valid       Cum.
    -------------------------+--------------------------------------------
    Valid   -1 NOT AVAILABLE |        101     100.00     100.00     100.00
    ---------------------------------------------------------------------- 
    meaning that I can no longer work with it.


    If anyone of you knows why this is the case, I will be very grateful for your help.
    Thank you so much and have a good evening. Sorry again for the bad formatting - absolute newby here.



  • #2
    Thanks for the details, but they don't include exactly what you did to educy. Perhaps you did something different for that variable.

    Code:
    drop if ymar1 == -10 | -1
    will not do what you want. You want it to mean

    Code:
    drop if ymar1 == -10 | ymar1 == -1
    but it does not mean that. It means

    Code:
    drop if (ymar1 == -10) | -1
    and -1 is always nonzero and therefore true by Stata's rules. https://www.stata.com/support/faqs/data-management/true-and-false/

    Code:
    drop if inlist(ymar1, -1, -10)
    would work too. You're in good company here: I've seen people guess wrongly what Stata does here (and It's not exceptional among software in parsing expressions like this).

    Comment


    • #3
      Hello Nick,

      thank you so much for your reply and your help regarding dropping the missing values correctly.
      I have attached my code below - as you will see: I have not done anything to educy besides display it using the fre command:

      Code:
      clear all
      cd C:\Users\Melis\OneDrive\Desktop\Stata\Datensätze
      use C:\Users\Melis\OneDrive\Desktop\Stata\Datensätze\ZA5276_v1-0-0.dta
      desc
      
      
      //============================================================================//
      // Datenaufbereitung - Variablenerstellung 
      //============================================================================//
      
       
      clonevar yersteskind = kh1yborn
      fre yersteskind
      replace yersteskind = . if inlist(yersteskind, -41, -10, -1) 
      drop if yersteskind == . 
      replace yersteskind = . if inlist(yersteskind, -9, -8, -7) 
      
      clonevar ymar1 = df05a // Jahr der ersten Heirat
      replace ymar1 = . if inlist(ymar1, -10, -1)
      drop if ymar1 == . 
      replace ymar1 = . if inlist(ymar1, -7, -9)
      fre ymar1 
      
      clonevar ydiv1 = df07a // Jahr der ersten Scheidung
      replace ydiv1 = . if inlist(ydiv1, -10, -1)
      drop if ydiv1 == . 
      replace ydiv1 = . if ydiv1 == -9   
      fre ydiv1 
      
      lookfor education 
      fre educy
      (Please note that the code still contains the old (and highly unsightly) way of dropping the missing values)
      There is another variable similar to
      Code:
      educy
      called
      Code:
      educ
      Somehow, educ will work fine after executing everything.


      Code:
      . desc educ
      
                    storage   display    value
      variable name   type    format     label      variable label
      -----------------------------------------------------------------------------------------
      educ            byte    %12.0g     educ       RESP: GENERAL SCHOOL LEAVING CERTIFICATE

      Code:
      . desc educy
      
                    storage   display    value
      variable name   type    format     label      variable label
      -----------------------------------------------------------------------------------------
      educy           byte    %12.0g     educy      YEARS OF SCHOOLING

      I'll now exchange the code regarding the missing values for the correct way and see if that changes anything. Thank you so much again !

      Comment


      • #4
        Thanks. The implicaiion of what you've posted is that you didn't do anything to either education variable. Please show us the results of

        Code:
        use C:\Users\Melis\OneDrive\Desktop\Stata\Datensätze\ZA5276_v1-0-0.dta
        
        tab educy  
        
        tab educ
        Last edited by Nick Cox; 23 Mar 2021, 17:27.

        Comment


        • #5
          This is what Stata produced:
          Code:
          . use C:\Users\Melis\OneDrive\Desktop\Stata\Datensätze\ZA5276_v1-0-0.dta
          (ALLBUS/GGSS Cumulation 1980-2018)
          
          . 
          . tab educy 
          
                   YEARS OF |
                  SCHOOLING |      Freq.     Percent        Cum.
          ------------------+-----------------------------------
          NAP: SCREENED OUT |         73        0.11        0.11
                  NO ANSWER |        551        0.81        0.92
                DO NOT KNOW |          1        0.00        0.92
              NOT AVAILABLE |     58,504       85.83       86.75
                          4 |          1        0.00       86.75
                          5 |         16        0.02       86.77
                          6 |         40        0.06       86.83
                          7 |        130        0.19       87.02
                          8 |      3,038        4.46       91.48
                          9 |        997        1.46       92.94
                         10 |      2,252        3.30       96.25
                         11 |        407        0.60       96.84
                         12 |        490        0.72       97.56
                         13 |        518        0.76       98.32
                         14 |        190        0.28       98.60
                         15 |        181        0.27       98.87
                         16 |        213        0.31       99.18
                         17 |        207        0.30       99.48
                         18 |        148        0.22       99.70
                         19 |         82        0.12       99.82
                         20 |         75        0.11       99.93
                         21 |         19        0.03       99.96
                         22 |         12        0.02       99.98
                         23 |          9        0.01       99.99
                         24 |          4        0.01      100.00
                         25 |          3        0.00      100.00
          ------------------+-----------------------------------
                      Total |     68,161      100.00
          
          . tab educ 
          
          RESP: GENERAL SCHOOL |
           LEAVING CERTIFICATE |      Freq.     Percent        Cum.
          ---------------------+-----------------------------------
                    DATA ERROR |          1        0.00        0.00
                     NO ANSWER |        156        0.23        0.23
                NO CERTIFICATE |      1,348        1.98        2.21
                  LOWEST LEVEL |     28,771       42.21       44.42
            INTERMEDIARY LEVEL |     20,305       29.79       74.21
          QUALI.UNIV.APPL.SCI. |      3,824        5.61       79.82
          QUALI.FOR UNIVERSITY |     12,903       18.93       98.75
          OTHER SCHOOL CERTIF. |        322        0.47       99.22
               STILL AT SCHOOL |        531        0.78      100.00
          ---------------------+-----------------------------------
                         Total |     68,161      100.00
          
          . 
          end of do-file

          Comment


          • #6
            So, they look good when you read in the dataset. It's what is happening afterwards that you need to tell us. dropping observations according to other variables could be important.

            Comment


            • #7
              That is what I am starting to suspect as well.

              Take the variable ymar1 (year of first marriage) for example. If I drop -1 and -10, using

              Code:
              clonevar ymar1 = df05a
              replace ymar1 = . if inlist(ymar1, -7, -9)
              drop if inlist(ymar1, -1, -10) 
              fre ymar1
              I get these results:


              Code:
              . drop if inlist(ymar1, -1, -10) 
              (52,941 observations deleted)
              
              . fre ymar1 
              
              ymar1 -- MARRIED: YEAR OF WEDDING, MARRIAGE 1
              -----------------------------------------------------------
                            |      Freq.    Percent      Valid       Cum.
              --------------+--------------------------------------------
              Valid   1918  |          1       0.01       0.01       0.01
                      1921  |          2       0.01       0.01       0.02
                      1922  |          1       0.01       0.01       0.03
                      1923  |          5       0.03       0.03       0.06
                      1924  |          5       0.03       0.03       0.09
                      1925  |          9       0.06       0.06       0.15
                      1926  |         10       0.07       0.07       0.22
                      1927  |         16       0.11       0.11       0.33
                      1928  |         12       0.08       0.08       0.41
                      1929  |         18       0.12       0.12       0.53
                      1930  |         21       0.14       0.14       0.67
                      1931  |         28       0.18       0.19       0.85
                      1932  |         41       0.27       0.27       1.13
                      1933  |         43       0.28       0.29       1.41
                      1934  |         55       0.36       0.37       1.78
                      1935  |         54       0.35       0.36       2.14
                      1936  |         60       0.39       0.40       2.54
                      1937  |         79       0.52       0.53       3.07
                      1938  |         79       0.52       0.53       3.59
                      1939  |         96       0.63       0.64       4.23
                      :     |          :          :          :          :
                      1991  |         75       0.49       0.50      94.63
                      1992  |         72       0.47       0.48      95.11
                      1993  |         64       0.42       0.43      95.53
                      1994  |         63       0.41       0.42      95.95
                      1995  |         64       0.42       0.43      96.38
                      1996  |         61       0.40       0.41      96.79
                      1997  |         71       0.47       0.47      97.26
                      1998  |         68       0.45       0.45      97.71
                      1999  |         64       0.42       0.43      98.14
                      2000  |         42       0.28       0.28      98.42
                      2001  |         26       0.17       0.17      98.59
                      2002  |         31       0.20       0.21      98.80
                      2003  |         22       0.14       0.15      98.95
                      2004  |         21       0.14       0.14      99.09
                      2005  |         21       0.14       0.14      99.23
                      2006  |         19       0.12       0.13      99.35
                      2007  |         24       0.16       0.16      99.51
                      2008  |         29       0.19       0.19      99.71
                      2009  |         27       0.18       0.18      99.89
                      2010  |         17       0.11       0.11     100.00
                      Total |      15002      98.57     100.00           
              Missing .     |        218       1.43                      
              Total         |      15220     100.00                      
              -----------------------------------------------------------
              
              .
              please notice the total number of observations, which is 15002.

              if I now go ahead and do the same for the first year of divorce using

              Code:
              clonevar ydiv1 = df07a 
              replace ydiv1 = . if inlist(ydiv1, -9)  
              drop if inlist(ydiv1, -1, -10) 
              fre ydiv1
              The total number of observations for ymar has drastically reduced:

              Code:
              . fre ydiv1 
              
              ydiv1 -- MARRIED: YEAR OF DIVORCE, MARRIAGE 1
              ------------------------------------------------------------------
                                   |      Freq.    Percent      Valid       Cum.
              ---------------------+--------------------------------------------
              Valid   -7   REFUSED |          7       0.74       0.84       0.84
                      1932         |          1       0.11       0.12       0.96
                      1936         |          1       0.11       0.12       1.08
                      1938         |          2       0.21       0.24       1.32
                      1939         |          4       0.42       0.48       1.80
                      1940         |          4       0.42       0.48       2.28
                      1941         |          2       0.21       0.24       2.52
                      1942         |          1       0.11       0.12       2.64
                      1944         |          2       0.21       0.24       2.88
                      1945         |          5       0.53       0.60       3.48
                      1946         |          5       0.53       0.60       4.08
                      1947         |         10       1.06       1.20       5.28
                      1948         |          8       0.85       0.96       6.24
                      1949         |          3       0.32       0.36       6.60
                      1950         |          8       0.85       0.96       7.56
                      1951         |         11       1.17       1.32       8.88
                      1952         |          9       0.95       1.08       9.96
                      1953         |          4       0.42       0.48      10.44
                      1954         |         12       1.27       1.44      11.88
                      1955         |          5       0.53       0.60      12.48
                      :            |          :          :          :          :
                      1989         |         16       1.69       1.92      86.91
                      1990         |         11       1.17       1.32      88.24
                      1991         |          9       0.95       1.08      89.32
                      1992         |          8       0.85       0.96      90.28
                      1993         |          7       0.74       0.84      91.12
                      1994         |          3       0.32       0.36      91.48
                      1995         |          5       0.53       0.60      92.08
                      1996         |         11       1.17       1.32      93.40
                      1997         |          3       0.32       0.36      93.76
                      1998         |          7       0.74       0.84      94.60
                      1999         |          7       0.74       0.84      95.44
                      2000         |          7       0.74       0.84      96.28
                      2001         |          6       0.64       0.72      97.00
                      2002         |          4       0.42       0.48      97.48
                      2003         |          1       0.11       0.12      97.60
                      2004         |          5       0.53       0.60      98.20
                      2005         |          7       0.74       0.84      99.04
                      2006         |          3       0.32       0.36      99.40
                      2007         |          3       0.32       0.36      99.76
                      2008         |          2       0.21       0.24     100.00
                      Total        |        833      88.24     100.00           
              Missing .            |        111      11.76                      
              Total                |        944     100.00                      
              ------------------------------------------------------------------
              
              . fre ymar1 
              
              ymar1 -- MARRIED: YEAR OF WEDDING, MARRIAGE 1
              -----------------------------------------------------------
                            |      Freq.    Percent      Valid       Cum.
              --------------+--------------------------------------------
              Valid   1926  |          1       0.11       0.12       0.12
                      1927  |          3       0.32       0.35       0.47
                      1930  |          1       0.11       0.12       0.59
                      1931  |          2       0.21       0.24       0.82
                      1932  |          2       0.21       0.24       1.06
                      1933  |          3       0.32       0.35       1.41
                      1934  |          1       0.11       0.12       1.53
                      1935  |          1       0.11       0.12       1.65
                      1936  |          4       0.42       0.47       2.12
                      1937  |          8       0.85       0.94       3.06
                      1938  |          8       0.85       0.94       4.00
                      1939  |          3       0.32       0.35       4.35
                      1940  |          9       0.95       1.06       5.41
                      1941  |          6       0.64       0.71       6.12
                      1942  |          9       0.95       1.06       7.18
                      1943  |          2       0.21       0.24       7.41
                      1944  |         12       1.27       1.41       8.82
                      1945  |          3       0.32       0.35       9.18
                      1946  |         10       1.06       1.18      10.35
                      1947  |         14       1.48       1.65      12.00
                      :     |          :          :          :          :
                      1982  |         18       1.91       2.12      89.41
                      1983  |          7       0.74       0.82      90.24
                      1984  |         10       1.06       1.18      91.41
                      1985  |         11       1.17       1.29      92.71
                      1986  |          7       0.74       0.82      93.53
                      1987  |          6       0.64       0.71      94.24
                      1988  |          8       0.85       0.94      95.18
                      1989  |          4       0.42       0.47      95.65
                      1990  |          5       0.53       0.59      96.24
                      1991  |          4       0.42       0.47      96.71
                      1992  |          3       0.32       0.35      97.06
                      1993  |          1       0.11       0.12      97.18
                      1994  |          3       0.32       0.35      97.53
                      1995  |          4       0.42       0.47      98.00
                      1996  |          1       0.11       0.12      98.12
                      1997  |          2       0.21       0.24      98.35
                      1998  |          2       0.21       0.24      98.59
                      1999  |          5       0.53       0.59      99.18
                      2000  |          5       0.53       0.59      99.76
                      2001  |          2       0.21       0.24     100.00
                      Total |        850      90.04     100.00           
              Missing .     |         94       9.96                      
              Total         |        944     100.00                      
              -----------------------------------------------------------
              ... from 15002 observations to just 850.
              I take it that the problem probably lies with the dropping as with each variable I drop -1 and -10, the other variables loose observations as well until there is virtually nothing left at one point. I tried rephrasing the drop to my original method, but the problem stayed the same. I then suspected that it might have something to do with the "-1", but even replacing -1 as a random number and dropping said number instead hasn't changed the outcome

              I don't really understand why this is happening or how I may work around it.


              Thank you so much for your patience - as a university student at her wit's end, I really appreciate it.

              Comment


              • #8
                Sorry too, because as a more experienced user it wasn't occurring to me that you were misunderstanding how drop works. (That is factual and not intended to be patronising.)

                drop variable

                drops an entire variable (column, if you wish) and has no consequences for any other variables. On the other hand

                drop

                with a condition on a variable's values drops entire observations (rows, if you wish) and so removes values for other variables too.

                Your dataset size is moderate by present standards and you are unlikely to have memory or speed issues, so on the face of it you don't need to drop anything. You should focus on cleaning up codes using recode or some such command.

                Comment


                • #9
                  Hello Nick,
                  please excuse my late reply!
                  I see now, thank you so much, I've now perfectly understood why Stata did what it did. In this case, I'll keep the missings, even if my stsum sadly only shows missing values for the duration variable.
                  Thanks again for clearing everything up and have a nice evening,

                  Melissa

                  Comment

                  Working...
                  X