Create a variable for every new observation and calculate the percentage for the new variables

Patrick Nussmann

Join Date: Mar 2017
Posts: 2

Create a variable for every new observation and calculate the percentage for the new variables

29 Mar 2017, 10:20

Hello,

I am doing some analysis in the US airline industry. At the moment I am trying to compute in which countries the airline fly (in percentage of the Airtime for each Airline in each Quarter), depending on the airtime. Therefore I have a table like this:

Airtime	Airline_ID	Origin	Destination	Year	Quarter
60	1	US	CA	2016	1
30	1	US	US	2016	1
30	1	CA	US	2016	1
60	2	US	JAM	2016	1
30	2	CA	US	2016	2

The Airline always start or lands in the US. Now I want to create for each country a new Variable and compute the percentage of airtime for each. If it is a international flight each country accounts for the half Airtime. So the result should be something like this:

Airtime	Airline_ID	Origin	Destination	Year	Quarter	Sum_Airtime	US	CA	JAM
60	1	US	CA	2016	1	120	0.25	0.25
30	1	US	US	2016	1	120	0.25
30	1	CA	US	2016	1	120	0.125	0.125
60	2	US	JAM	2016	1	60	0.5		0.5
30	2	CA	US	2016	2	30	0.5	0.5

After this there have to be one final step. Every Airline should only have one row per Quarter

Airline_ID	Year	Quarter	Sum_Airtime	US	CA	JAM
1	2016	1	120	0.625	0.375
2	2016	1	60	0.5		0.5
2	2016	2	30	0.5	0.5

The real table contains around 400,000 observations and probably every country there is in the world. I hope someone of you have a solution for my problem and sorry if my English isn´t the best =)

Thanks

Tags: None

William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

29 Mar 2017, 13:18

Welcome to Statalist, Patrick.

Here is some sample code to start you on your way.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(airtime airline_id) str2 origin str3 destination int year byte quarter 60 1 "US" "CA" 2016 1 30 1 "US" "US" 2016 1 30 1 "CA" "US" 2016 1 60 2 "US" "JAM" 2016 1 30 2 "CA" "US" 2016 2 end generate yrqtr = yq(year,quarter) format yrqtr %tq drop year quarter bysort airline_id yrqtr: egen sum_airtime = total(airtime) rename (origin destination) (ctryorig ctrydest) generate timeorig = airtime/2 generate timedest = airtime/2 generate id = _n reshape long ctry time, i(id) j(od) string generate pcttime = time/sum_airtime collapse (max) sum_airtime (sum) pcttime, by(airline_id yrqtr ctry) order airline_id yrqtr sum_airtime

Code:

. list, sepby(airline_id yrqtr) +-----------------------------------------------+ | airlin~d yrqtr sum_ai~e ctry pcttime | |-----------------------------------------------| 1. | 1 2016q1 120 CA .375 | 2. | 1 2016q1 120 US .625 | |-----------------------------------------------| 3. | 2 2016q1 60 JAM .5 | 4. | 2 2016q1 60 US .5 | |-----------------------------------------------| 5. | 2 2016q2 30 CA .5 | 6. | 2 2016q2 30 US .5 | +-----------------------------------------------+

You will note some differences from what you requested.

1. I combined your year and quarter variables into a Stata Internal Format quarterly variable. For your analyses, you will generally want a single variable for your date, not a pair. If you are unfamiliar with Stata dates and times, Chapter 24 (Working with dates and times) of the Stata User's Guide PDF is well written and has lots of examples. But it is, of course, impossible to remember all the details for long. Everyone who uses Stata with any regularity needs to read this chapter, and probably re-read it periodically as well. Fortunately help datetime is also very well organized and has lots of internal links to help you quickly track down the right function. So if you are familiar with the general concept and have read the manual chapter a few times, most of the time you can find what you need in yq function.

2. I have a separate observation for each airline_id/year/quarter/country rather than string all the countries out onto one observation. That would be what is called a wide layout of the data; what I have provided is a long layout of the data. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. I leave it as an exercise for you to use the documentation in help reshape to learn how to use the reshape wide command to make a wide layout if you insist.

3. You will note that I have used the user-written dataex command to present your data in a form that Stata can read in. Luckily, your data was reasonably straightforward, but in general, data presented with dataex is easier for others to use to help you. Before your next post, I recommend you review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question, including the use of code blocks for posting code and results and dataex for presenting data.
1 like
Comment
Patrick Nussmann

Join Date: Mar 2017

Posts: 2
#3

30 Mar 2017, 01:58

Hello William,

that was a very helpful answer and a very valuable lesson for me and my Stata skills, thank you so much. It took me a while to fully understand what you have done but it works better than my first idea. And thanks for the hint with the variable for date. This will definitely become very useful in further research.

However, I have one final question to your code. in your line:
collapse (max) sum_airtime (sum) pcttime, by(airline_id yrqtr ctry)
Why do you use (max). Isn´t it unnecessary in this case, because the sum_airtime (by (airline_id yrqtr ctry)) have to be the same?
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#4

30 Mar 2017, 09:04

Hello Patrick -

There is no good reason for using (max) sum_airtime rather than (mean) sum_airtime. As a mathematician by training, when I want one value of a set of identical values, it doesn't occur to me to add up n copies of sum_airtime, then divide the total by n to reconstitute the original value. Taking the max to pick out a single value is what came to mind.

But your question showed me that I'd followed your instructions for the calculations to be performed more closely than I need have. Here's a revised version of my code that avoids the "problem" entirely by deferring the calculation of sum_airtime until after the collapse.

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(airtime airline_id) str2 origin str3 destination int year byte quarter 60 1 "US" "CA" 2016 1 30 1 "US" "US" 2016 1 30 1 "CA" "US" 2016 1 60 2 "US" "JAM" 2016 1 30 2 "CA" "US" 2016 2 end generate yrqtr = yq(year,quarter) format yrqtr %tq drop year quarter rename (origin destination) (ctryorig ctrydest) generate timeorig = airtime/2 generate timedest = airtime/2 generate id = _n reshape long ctry time, i(id) j(od) string collapse (sum) time, by(airline_id yrqtr ctry) bysort airline_id yrqtr: egen sum_airtime = total(time) generate pcttime = time/sum_airtime order airline_id yrqtr sum_airtime

Code:

. list, sepby(airline_id yrqtr) +------------------------------------------------------+ | airlin~d yrqtr sum_ai~e ctry time pcttime | |------------------------------------------------------| 1. | 1 2016q1 120 CA 45 .375 | 2. | 1 2016q1 120 US 75 .625 | |------------------------------------------------------| 3. | 2 2016q1 60 JAM 30 .5 | 4. | 2 2016q1 60 US 30 .5 | |------------------------------------------------------| 5. | 2 2016q2 30 CA 15 .5 | 6. | 2 2016q2 30 US 15 .5 | +------------------------------------------------------+

Let me close by thanking you for having the interest in learning and taking the time to puzzle out how my solution worked. Some members just copy-and-paste and hope it works. And the members who respond have many different motivations. I enjoy teaching, and quietly following Statalist was a big part of my learning Stata, so it's gratifying to know that someone has learned from my response.

I would like to pass on to you a piece of advice which may be unnecessary, but given your demonstrated interest in learning, I would not like to think you might not be aware of this approach to increasing your Stata knowledge. When I began using Stata in a serious way, I started as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through Stata's Help menu. The objective in doing this was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and manual.

Stata supplies exceptionally good documentation that amply repays the time spent studying it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.
1 like
Comment

Announcement

Create a variable for every new observation and calculate the percentage for the new variables

Comment

Comment

Comment