subtracting last from first rows

Don Spady

Join Date: Mar 2014

Posts: 47
#1

subtracting last from first rows

14 Mar 2015, 10:31

I have a dataset that has repeated values of a variable measured at different times as indicated by a variable 'seqsij' which takes the values of 1,2,3...N for each new value of a variable sij_sp within a given ID. Each value of seqsij and sij_sp constitues a new row. I want to subtract the value of the LAST measure of sij_sp from the FIRST measure of sij_sp.

I tried the codes suggested by Nick Cox in response to a query by Roman Mostazir on 02Mar 2015; i.e.

bysort id (pre) : gen change = kimsob[_n] - kimsob[1]

but my code is

#

Code:

bysort id (seqsij) :gen sijdif = sij_sp[_N] - sij_sp[1] #

Code:

When I do this I get all missing values.

I also tried the suggestion of Klaudia Erhardt in response to the same query where Klaudia used
bysort id: gen change = cond(kinsob != kinsob[_n-1], kinsob - kinsob[_n-1],.)

but my code was
#

Code:

bysort id (seqsij): gen sijdif =cond(sij_sp != sij_sp[_n-1], sij_sp - sij_sp[_n-1],.) #

and this worked, but it does not give the FIRST - LAST value; it just provides the value of the difference between the current value and the immediately preceding one.

If I try

#

Code:

bysort id (seqsij sij_sp): gen sijdif2 = cond(sij_sp[1]!= sij_sp[_N],sij_sp[1] - sij_sp[_N],.) #

I get all missing values.

Where is my error?

Many thanks.

Don
Tags: None
Nick Cox

Join Date: Mar 2014

Posts: 35724
#2

14 Mar 2015, 10:57

The allusion is to http://www.statalist.org/forums/foru...acting-by-rows where I suggested using _N (not _n) .

Let's start from the top and assume an identifier, time variable and outcome variable id time y. Then

Code:

bysort id (time) : gen change = y[_N] - y[1]

would be change from first to last except that if either of those values of y were missing, then so too would the change be missing. If you have missings, you would need to segregate them or otherwise ignore them. One way to do that would be

Code:

gen ispresent = !missing(y) bysort ispresent id (time) : gen change = y[_N] - y[1] bysort id (change) : replace change = change[1]

As I understand it, the equivalent for your code would be

Code:

gen ispresent = !missing(y) bysort ispresent id (seqsij) : gen sijdif = sij_sp[_N] - sig_sp[1] bysort id (seqsij) : replace sijdif = sijdif[1]
Comment
Don Spady

Join Date: Mar 2014

Posts: 47
#3

14 Mar 2015, 11:21

There is an abundance of missing data for each variable in the dataset, save for id and time (which I call 'edate'); however, for the variables seqsij and sij_sp there are no missing data. Any given id may have 'n' rows of data; seqsij and sij_sp may be present in only 3 or 4 of these rows, but for every seqsij value there is a corresponding sij_sp value and my interpretation of _N is that it reflects the last value of seqsij for that id, not the last observation for the ID; hence the need to bysort id (seqsij).

In such an instance, do I need the 'missing' step.
At any rate, my time variable is called edate. So I tried

#

Code:

gen ispresent = !missing(edate) bysort ispresent id (seqsij) : gen sijdif = sij_sp[_N] - sij[_sp[1] bysort id (seqsij) : replace sijdif = sijdif[1] #

When I did the first bysort command, and then did a sum sijdif, I got all missing values
The second bysort command had 0 changes made.

Don
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#4

14 Mar 2015, 11:45

You say that

1. For seqsij and sij_sp there are no missing data.

and also that

2. seqsij and sij_sp may be present in only 3 or 4 of the rows, meaning observations, for each id.

On the face of it #2 contradicts #1 and would explain missing values.

Missing naturally, means empty for string variables and system missing . or higher for numeric variables.

Presumably these data are confidential but the lack of a concrete example is most frustrating here. Is it possible to post
the results of

Code:

describe seqsij and sij_sp gen nmissing = missing(seqsij) + missing(sij_sp) tab nmissing

?

Last edited by Nick Cox; 14 Mar 2015, 12:02.
Comment

Don Spady

Join Date: Mar 2014
Posts: 47

14 Mar 2015, 12:06

Nick
I did what you suggested and got values of
nmissing 0 = 550
nmissing 2 = 5648
Meaning that there are far more missing than non-missing.

There is nothing overly confidential about the data. Here is a snippet.

#

Code:

 id
edate
seqsij
sij_sp

4
11-Feb-03
1
0

4
27-Jan-05
2
1

4
31-Aug-06
3
5

4
11-Sep-07
4
2

4
31-Jul-03
.
.

6
16-Nov-05
.
.

6
11-May-05
.
.

7
16-Jun-05
1
0

7
20-Dec-05
2
2

7
1-Jun-07
3
5

7
29-Aug-07
.
.

7
12-Feb-07
.
.

7
8-Jan-03
.
.

7
4-Feb-05
.
.

8
29-Nov-04
1
1

8
29-Dec-05
2
2

8
11-Nov-07
3
4

8
24-Apr-02
.
.


  #

Don

Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

14 Mar 2015, 12:32

Thanks. What I suggested works, but you really do need to do work with missings on seqsij not edate. (There is a stray typo in the last code block of my #2, but you translated to one of your variables, just the wrong one.)

Code:

. clear

. input id str9 edate     seqsij  sij_sp

            id      edate     seqsij     sij_sp
  1. 4       "11-Feb-03"     1       0
  2. 4       "27-Jan-05"     2       1
  3. 4       "31-Aug-06"     3       5
  4. 4       "11-Sep-07"     4       2
  5. 4       "31-Jul-03"     .       .
  6. 6       "16-Nov-05"     .       .
  7. 6       "11-May-05"     .       .
  8. 7       "16-Jun-05"     1       0
  9. 7       "20-Dec-05"     2       2
 10. 7       "1-Jun-07"      3       5
 11. 7       "29-Aug-07"     .       .
 12. 7       "12-Feb-07"     .       .
 13. 7       "8-Jan-03"      .       .
 14. 7       "4-Feb-05"      .       .
 15. 8       "29-Nov-04"     1       1
 16. 8       "29-Dec-05"     2       2
 17. 8       "11-Nov-07"     3       4
 18. 8       "24-Apr-02"     .       .
 19. end

.
. gen ispresent = !missing(seqsij)

. bysort ispresent id (seqsij) : gen diff = sij_sp[_N] - sij_sp[1]
(8 missing values generated)

. bysort id (seqsij) : replace diff = diff[1]
(6 real changes made)

.
. list , sepby(id)

     +----------------------------------------------------+
     | id       edate   seqsij   sij_sp   ispres~t   diff |
     |----------------------------------------------------|
  1. |  4   11-Feb-03        1        0          1      2 |
  2. |  4   27-Jan-05        2        1          1      2 |
  3. |  4   31-Aug-06        3        5          1      2 |
  4. |  4   11-Sep-07        4        2          1      2 |
  5. |  4   31-Jul-03        .        .          0      2 |
     |----------------------------------------------------|
  6. |  6   16-Nov-05        .        .          0      . |
  7. |  6   11-May-05        .        .          0      . |
     |----------------------------------------------------|
  8. |  7   16-Jun-05        1        0          1      5 |
  9. |  7   20-Dec-05        2        2          1      5 |
 10. |  7    1-Jun-07        3        5          1      5 |
 11. |  7   29-Aug-07        .        .          0      5 |
 12. |  7    8-Jan-03        .        .          0      5 |
 13. |  7    4-Feb-05        .        .          0      5 |
 14. |  7   12-Feb-07        .        .          0      5 |
     |----------------------------------------------------|
 15. |  8   29-Nov-04        1        1          1      3 |
 16. |  8   29-Dec-05        2        2          1      3 |
 17. |  8   11-Nov-07        3        4          1      3 |
 18. |  8   24-Apr-02        .        .          0      3 |
     +----------------------------------------------------+

Last edited by Nick Cox; 14 Mar 2015, 12:36.

Comment

Don Spady

Join Date: Mar 2014

Posts: 47
#7

14 Mar 2015, 12:48

Nick
It works just the way I want. MANY thanks.
Next time I will post sample data as a matter of course.
Don
Comment

Announcement