Expanding the observations as much as the number of months between start month and last month

James Park

Join Date: May 2017
Posts: 97

Expanding the observations as much as the number of months between start month and last month

06 Nov 2023, 22:22

No the Start and End might be in different years. Then simply adding 1 will not work.

Company Start End
A 2018m12 2019m02
B 2016m12 2017m03

Using stata, from the above I want to make the below data.

Company Month
A 2018m12
A 2019m01
A 2019m02
B 2016m12
B 2017m01
B 2017m02
B 2017m03

Code:

clear
input str6 Company Start End
"A" "2018m12" "2019m02"
"B" "2016m12" "2017m03"
end

ChatGPT's solution is this, but it seems it doesn't use forval correctly, but maybe this provides a hint.

Code:

clear

* Create your original dataset
input str1 Company str7 Start str7 End
A 2018m12 2019m02
end

* Extract the year and month components from Start and End
gen start_year = year(monthly(Start, "Ym"))
gen start_month = month(monthly(Start, "Ym"))
gen end_year = year(monthly(End, "Ym"))
gen end_month = month(monthly(End, "Ym"))

* Initialize a counter variable
gen counter = 1

* Create an empty dataset to hold the results
tempfile expanded_data
save "`expanded_data'"

* Loop through the years and months, and generate Month variable
forval year = start_year(1)end_year {
    local end_month = 12
    if `year' == `end_year' {
        local end_month = `end_month'
    }
    forval month = start_month(1)`end_month' {
        gen Month = "`year'm0`month'"
        save "`expanded_data'", append
        local counter = `counter' + 1
    }
}

* Merge the expanded data with the original dataset to get the Company variable
use "`expanded_data'", clear
sort Month
gen counter = _n
sort Company counter
drop counter
merge 1:1 counter Company using original_dataset
drop counter

* Display the result
list Company Month

Tags: data cleaning, data management, panel

Wei Hai Deng

Join Date: Nov 2023

Posts: 35
#2

07 Nov 2023, 03:07

Hi James,

ChatGPT rarely has anything useful except for running a naive regression or summing some variables together. Some sort of STATA co-pilot would be awesome but alas we shall have to wait.

From your example, it looks like you're essentially converting your data from wide to long format. Three lines of code is enough:

Code:

*** Generate data *** clear input str1 Company str7 Start str7 End "A" "2018m12" "2019m02" "B" "2016m12" "2017m03" end *** Generate new variables *** encode Company, gen(Company2) // encode to numeric gen date1 = monthly(Start,"YM") // gen date variable gen date2 = monthly(End,"YM") format date1 date2 %tm reshape long date, i(Company2) j(tmp) // wide to long tsset Company2 date // declare as timeseries with Company2 as id tsfill // fill gaps in timeseries drop Company Start End tmp

First is just converting your string variables to something Stata can interpret. I named the date variables date1 and date2 so reshape can easily transform these to new rows, tsset as timeseries with "date" as your time variables, fill gaps with tsfill.

Alternatively, but messier, generate your own long format with expand:

Code:

gen datediff = date2 - date1 // difference in months expand = datediff sort Company date1

Then bysort Company (Start): Start = Start +1 or something, but haven't thought this one through as I think my first solution should be sufficient.

Best,
Wei
1 like
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35754

07 Nov 2023, 03:16

The problem here seems to start with your data example, which won't work as desired: your input statement must flag that the dates are string variables and further that they are at least str7.

Once that is fixed, the task is standard and can be achieved more directly.

Code:

clear
input str6 Company str7 (Start End) 
"A" "2018m12" "2019m02"
"B" "2016m12" "2017m03"
end

foreach v in Start End {
    
    gen `v'_Date = monthly(`v', "YM")
    
}

expand End_Date-Start_Date+1 

bysort Company : gen Date = Start_Date + _n - 1 
format *Date %tm 

list, sepby(Company)
    +-------------------------------------------------------------+
     | Company     Start       End   Start_~e   End_Date      Date |
     |-------------------------------------------------------------|
  1. |       A   2018m12   2019m02    2018m12     2019m2   2018m12 |
  2. |       A   2018m12   2019m02    2018m12     2019m2    2019m1 |
  3. |       A   2018m12   2019m02    2018m12     2019m2    2019m2 |
     |-------------------------------------------------------------|
  4. |       B   2016m12   2017m03    2016m12     2017m3   2016m12 |
  5. |       B   2016m12   2017m03    2016m12     2017m3    2017m1 |
  6. |       B   2016m12   2017m03    2016m12     2017m3    2017m2 |
  7. |       B   2016m12   2017m03    2016m12     2017m3    2017m3 |
     +-------------------------------------------------------------+

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35754
#4

07 Nov 2023, 04:08

#2 wasn’t visible to me when I was writing.
1 like
Comment
Wei Hai Deng

Join Date: Nov 2023

Posts: 35
#5

07 Nov 2023, 04:54

Seems we wrote our answers at the same time. Interestingly, you following through with the expand/bysort. Thank you Nick for teaching me about including expressions straight in expand. Guess it depends on what your goal is James. expand duplicates your observations for _N observations, tsfill creates blanks for all variables except Company and date.
Comment

Announcement

Expanding the observations as much as the number of months between start month and last month

Comment

Comment

Comment

Comment