Creation of a longitudinal dataset from an original messy dataset

Michael Duarte Goncalves

Join Date: Oct 2022
Posts: 500

Creation of a longitudinal dataset from an original messy dataset

30 Apr 2024, 04:18

Hi everyone,

I would like to create a longitudinal dataset from an uncleaned dataset, and I don't know where to start. I need help, please. Basically, my dataset looks like:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str28 municipality_1 str32 municipality_2 str28 municipality_3 str32 municipality_4 str17 power_mw str7 power_kw str29 renewables_type int(publication_date_boe_ddmmyyyy boe_resolution_date_ddmmyyyy)
"Yebes"               " Horche"               ""                     ""                 "55"     "" "Solar" 23146 23120
"Yebes"               " Horche"               ""                     ""                 "67.5"   "" "Solar" 23145 23120
"Yepes"               " Numancia de La Sagra" " Pantoja"             " Añover de Tajo" "242.35" "" "Solar" 23162 23121
"Yunquera de Henares" " Fontanar"             " Tórtola de Henares" " Guadalajara"     "42,5"   "" "Solar" 23176 23166
"Zaragoza"            " Teruel"               " Tarragona"           "Barcelona."       "38.2"   "" "Solar" 23224 23204
end
format %td publication_date_boe_ddmmyyyy
format %td boe_resolution_date_ddmmyyyy

I have some variables representing each municipality involved in a renewable construction project in Spain, from -municipality_1- to -municipality_26-. Some projects could contain 1 or more municipalities involved. I have the full dates of publication of the status of the project (-publication_date_boe_ddmmyyyy-), and the decision (-boe_resolution_date_ddmmyyyy-).

Here's what I'd ideally like to have. This is an example and does not necessarily reflect the reality given above:

municipality_1	quarter	new_installations_dummy	cum_new_installations_dummy	new_capacity_installed_mw	cumul_new_capacity_installed_mw	new_capacity_installed_kw	cumul_new_capacity_installed_kw	renewables_type (encoded)	count_other_municipalities_wo_first	municipality_2	municipality_3
"Yebes"	2018q1	0	0	.	.	.	.	.	0
"Yebes"	2018q2	0	0	.	.	.	.	.	0
"Yebes"	2018q3	1	1	62.55	62.55	.	.	Solar	1	"Barcelona"
"Yebes"	2018q4	0	1	.	.	.	.	.	0
"Yebes"	2019q1	0	1	.	.	.	.	.	0
"Yebes"	2019q2	1	2	350.10	412.65	.	.	Solar	2	"Zaragoza"	"Teruel"
"Yebes"	2019q3	0	2	.	.	.	.	.	0
"Yebes"	2019q4	0	2	.	.	.	.	.	0
...	...	...	...	...	...	...	...	...	...	...	...

I'd like to have the same number of quarters. In other words, I'd like the time variable for each municipality to start at 2017q1 up to 2023q4, even if there are a lot of 0s and missing values. Is this possible?
Otherwise, is it possible to have as many rows as quarters?

Thank you very much in advance for your help and time!

Best regards,

Michael

Last edited by Michael Duarte Goncalves; 30 Apr 2024, 05:13.

Tags: None

Devra Golbe

Join Date: Apr 2014
Posts: 170

30 Apr 2024, 10:11

Does this do what you want?. The code assumes you want to assign observations to quarters using the publication date rather than the resolution date.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input str28 municipality_1 str32 municipality_2 str28 municipality_3 str32 municipality_4 str17 power_mw str7 power_kw str29 renewables_type int(publication_date_boe_ddmmyyyy boe_resolution_date_ddmmyyyy)
"Yebes"               " Horche"               ""                     ""                 "55"     "" "Solar" 23146 23120
"Yebes"               " Horche"               ""                     ""                 "67.5"   "" "Solar" 23145 23120
"Yepes"               " Numancia de La Sagra" " Pantoja"             " Añover de Tajo" "242.35" "" "Solar" 23162 23121
"Yunquera de Henares" " Fontanar"             " Tórtola de Henares" " Guadalajara"     "42,5"   "" "Solar" 23176 23166
"Zaragoza"            " Teruel"               " Tarragona"           "Barcelona."       "38.2"   "" "Solar" 23224 23204
end
format %td publication_date_boe_ddmmyyyy
format %td boe_resolution_date_ddmmyyyy

*Generate the quarterly date from the publication date
gen qtr=qofd(publication)
format qtr %tq

encode municipality_1, gen(muni_1)

tempfile raw
save `raw'
keep muni_1
duplicates drop
expand 28
bysort muni_1 : gen qtr = yq(2016, 4) + _n
format qtr %tq

tsset muni_1 qtr

*merge your original data back
merge 1:m muni_1 qtr using `raw'

list if municipality_1 !=""

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY

Comment

Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#3

06 May 2024, 01:49

Good morning Professor Devra Golbe,

Sorry for the late reply.
I had a hard time finding this. You solved my problem in one post.

Thank you very much for your suggestion! It is exactly what I was looking for.

Kind regards,

Michael
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#4

06 May 2024, 04:41

Could anyone explain me why in post #2 expand 28 is used, please?

I understand that this command replaces each observation with 28 copies of the observation (so the first one is retained, and 27 new observations are created). But why exactly 28?
Is there a reason for that?

Thank you in advance for your help!

All the best,
Michael

Edit: Could it be that Prof.Golbe uses expand 28 since there are 28 quarters between 2017q1 and 2023q4?

Last edited by Michael Duarte Goncalves; 06 May 2024, 04:49.
Comment
Devra Golbe

Join Date: Apr 2014

Posts: 170
#5

06 May 2024, 06:26

Yes indeed- 28 quarters between 2017q1 and 2023q4. I should have commented that line to explain why I picked that number.

Devra Golbe
Professor Emerita, Dept. of Economics
Hunter College, CUNY
1 like
Comment
Michael Duarte Goncalves

Join Date: Oct 2022

Posts: 500
#6

06 May 2024, 06:36

Ok, I understand now.

Thank you so much for your answer, Prof. Devra Golbe.
Lovely day.

Michael
Comment

Announcement

Creation of a longitudinal dataset from an original messy dataset

Comment

Comment

Comment

Comment

Comment