Identify first and last occurrences in different categories of a variable

Hanne Ehrlinder

Join Date: Apr 2021

Posts: 3
#1

Identify first and last occurrences in different categories of a variable

15 Apr 2021, 13:31

Hi,

I am in the process of cleaning health care data received from a drug registry and would be so grateful to get help with the following:

The data is panel data consisting of the identifier variable "StudyID", a categorical variable "ATC" constituting 9 different drug regimens and the date variable "EDATUM" which is the date that a patient collects a certain drug at the pharmacy. There is one row for every EDATUM and since most patients collect a drug multiple times and also switches between different drugs during our follow-up, the data file now consists of appx 100 000 rows.

I want to keep only the first and the last EDATUM for every drug that a patient uses. If a patient switch to a new drug and then switch back to the prior used drug, I want to keep the first and the last EDATUM for both the times the drug was used.

I will further convert the data into wide format but since the data now consists of (too) many rows per patient, I believe it is better to start with cleaning out the EDATUMS that are not of interest (all EDATUMS in-between the first and the last of every used drug).

ID ATC EDATUM

1 A 2010-01-01

1 A 2010-02-02

1 A 2010-03-03

1 B 2010-04-04

1 B 2010-05-05

1 B 2010-06-06

1 A 2010-07-07

1 A 2010-08-08

1 A 2010-09-09

Thank you in advance!

Kind regards,
Hanne

Last edited by Hanne Ehrlinder; 15 Apr 2021, 13:46.
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30147
#2

15 Apr 2021, 14:08

Code:

* Example generated by -dataex-. For more info, type help dataex clear input byte id str2 atc str10 edatum 1 "A " "2010-01-01" 1 "A " "2010-02-02" 1 "A " "2010-03-03" 1 "B " "2010-04-04" 1 "B " "2010-05-05" 1 "B " "2010-06-06" 1 "A " "2010-07-07" 1 "A " "2010-08-08" 1 "A " "2010-09-09" end gen date = daily(edatum, "YMD") assert missing(date) == missing(edatum) format date %td by id (date), sort: gen spell = sum(atc != atc[_n-1]) by id spell (date), sort: gen first_edatum = date[1] by id spell (date): gen last_edatum = date[_N] by id spell: keep if _n == 1 drop edatum date format *edatum %td

Because of the way you posted your example data, I cannot tell if your edatum variable is a Stata internal format numeric date, or if it is a string. I include the code to change it to string, but, evidently, skip that part if it is already a correct date variable, and just -rename- it date before the rest of the code. (By the way, with the examples given, I cannot tell if your edatum variable is YMD or YDM. I wroute the code assuming YMD--change it if necessary.) None of these ambiguities would have arisen had the example data been posted using the -dataex- command.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35754
#3

15 Apr 2021, 17:01

See also https://www.stata.com/support/faqs/d...t-occurrences/
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#4

15 Apr 2021, 19:16

See also the "First and last" section in Nick Cox's -egenmore- collection of user contributed egen functions. They do what Clyde does with explicit subscripting.
Comment
Hanne Ehrlinder

Join Date: Apr 2021

Posts: 3
#5

17 Apr 2021, 13:55

Thank you all of you for your help!

As you Clyde Schechter wrote, my example was unclear regarding the date format and I apologize for not using dataex, I will from now on.

This is a dataex showing one StudyID (LG692) with the variables ATC and EDATUM:

----------------------- copy starting from the next line -----------------------

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input str10 StudyID str7 ATC str10 EDATUM " LG692" "B01AC06" " 8/23/2016" " LG692" "B01AC06" " 2/3/2020" " LG692" "B01AC06" " 2/19/2018" " LG692" "B01AC06" " 7/20/2012" " LG692" "B01AC06" "11/11/2019" " LG692" "B01AC06" " 3/2/2020" " LG692" "B01AC06" " 11/7/2012" end

------------------ copy up to and including the previous line -----------------

The variable EDATUM is a str10 in the format %10s and displayed as MDY. Ex. "9/25/2017".

It seems like I need to convert it to YMD, that you wrote that you assumed it was, for your command to create the first_EDATUM and last_EDATUM variables to be correct.

Could you help me here? When I run this command:

gen date = daily(edatum, "YMD") assert missing(date) == missing(edatum) format date %td it says "91,738 missing values generated" after the first line and "assertion is false" after the second line. When I browse, there are no values in "date".

How can I convert the EDATUM variable (that is a str10 displayed as MDY) to the correct date format (which I believe is YMD)?

NB, the dataex above does not include the issue of identifying the first and last occurrences for different ATCs when a person switches back to a regimen that was already used, this dataex was only to show the format of the variables. I believe your suggestion of command above to create the "first and last variables" works just fine as long as the date variable is correct.

Thank you, I very much appreciate your help!

//Hanne
Comment
Joro Kolev

Join Date: Aug 2018

Posts: 3050
#6

18 Apr 2021, 00:39

Try :

Code:

. gen date = daily( EDATUM , "MDY") . format date %td
Comment
Hanne Ehrlinder

Join Date: Apr 2021

Posts: 3
#7

20 Apr 2021, 13:25

Thank you Joro Kolev, it worked with the date command! Thereafter Clyde Schecter's suggestion of command worked perfectly!

I am so grateful for the help from all of you! Now, my file of drug registry data is cleaned and well structured and ready to be analyzed. Thank you!

Kind regards,
Hanne
Comment

ID	ATC	EDATUM
1	A	2010-01-01
1	A	2010-02-02
1	A	2010-03-03
1	B	2010-04-04
1	B	2010-05-05
1	B	2010-06-06
1	A	2010-07-07
1	A	2010-08-08
1	A	2010-09-09

Announcement

Identify first and last occurrences in different categories of a variable

Comment

Comment

Comment

Comment

Comment

Comment