Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Create a variable for every new observation and calculate the percentage for the new variables

    Hello,

    I am doing some analysis in the US airline industry. At the moment I am trying to compute in which countries the airline fly (in percentage of the Airtime for each Airline in each Quarter), depending on the airtime. Therefore I have a table like this:
    Airtime Airline_ID Origin Destination Year Quarter
    60 1 US CA 2016 1
    30 1 US US 2016 1
    30 1 CA US 2016 1
    60 2 US JAM 2016 1
    30 2 CA US 2016 2
    The Airline always start or lands in the US. Now I want to create for each country a new Variable and compute the percentage of airtime for each. If it is a international flight each country accounts for the half Airtime. So the result should be something like this:
    Airtime Airline_ID Origin Destination Year Quarter Sum_Airtime US CA JAM
    60 1 US CA 2016 1 120 0.25 0.25
    30 1 US US 2016 1 120 0.25
    30 1 CA US 2016 1 120 0.125 0.125
    60 2 US JAM 2016 1 60 0.5 0.5
    30 2 CA US 2016 2 30 0.5 0.5
    After this there have to be one final step. Every Airline should only have one row per Quarter
    Airline_ID Year Quarter Sum_Airtime US CA JAM
    1 2016 1 120 0.625 0.375
    2 2016 1 60 0.5 0.5
    2 2016 2 30 0.5 0.5
    The real table contains around 400,000 observations and probably every country there is in the world. I hope someone of you have a solution for my problem and sorry if my English isn´t the best =)

    Thanks

  • #2
    Welcome to Statalist, Patrick.

    Here is some sample code to start you on your way.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte(airtime airline_id) str2 origin str3 destination int year byte quarter
    60 1 "US" "CA"  2016 1
    30 1 "US" "US"  2016 1
    30 1 "CA" "US"  2016 1
    60 2 "US" "JAM" 2016 1
    30 2 "CA" "US"  2016 2
    end
    generate yrqtr = yq(year,quarter)
    format yrqtr %tq
    drop year quarter
    bysort airline_id yrqtr: egen sum_airtime = total(airtime)
    rename (origin destination) (ctryorig ctrydest)
    generate timeorig = airtime/2
    generate timedest = airtime/2
    generate id = _n
    reshape long ctry time, i(id) j(od) string
    generate pcttime = time/sum_airtime
    collapse (max) sum_airtime (sum) pcttime, by(airline_id yrqtr ctry)
    order airline_id yrqtr sum_airtime
    Code:
    . list, sepby(airline_id yrqtr)
    
         +-----------------------------------------------+
         | airlin~d    yrqtr   sum_ai~e   ctry   pcttime |
         |-----------------------------------------------|
      1. |        1   2016q1        120     CA      .375 |
      2. |        1   2016q1        120     US      .625 |
         |-----------------------------------------------|
      3. |        2   2016q1         60    JAM        .5 |
      4. |        2   2016q1         60     US        .5 |
         |-----------------------------------------------|
      5. |        2   2016q2         30     CA        .5 |
      6. |        2   2016q2         30     US        .5 |
         +-----------------------------------------------+
    You will note some differences from what you requested.

    1. I combined your year and quarter variables into a Stata Internal Format quarterly variable. For your analyses, you will generally want a single variable for your date, not a pair. If you are unfamiliar with Stata dates and times, Chapter 24 (Working with dates and times) of the Stata User's Guide PDF is well written and has lots of examples. But it is, of course, impossible to remember all the details for long. Everyone who uses Stata with any regularity needs to read this chapter, and probably re-read it periodically as well. Fortunately help datetime is also very well organized and has lots of internal links to help you quickly track down the right function. So if you are familiar with the general concept and have read the manual chapter a few times, most of the time you can find what you need in yq function.

    2. I have a separate observation for each airline_id/year/quarter/country rather than string all the countries out onto one observation. That would be what is called a wide layout of the data; what I have provided is a long layout of the data. The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. I leave it as an exercise for you to use the documentation in help reshape to learn how to use the reshape wide command to make a wide layout if you insist.

    3. You will note that I have used the user-written dataex command to present your data in a form that Stata can read in. Luckily, your data was reasonably straightforward, but in general, data presented with dataex is easier for others to use to help you. Before your next post, I recommend you review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post, looking especially at sections 9-12 on how to best pose your question, including the use of code blocks for posting code and results and dataex for presenting data.

    Comment


    • #3
      Hello William,

      that was a very helpful answer and a very valuable lesson for me and my Stata skills, thank you so much. It took me a while to fully understand what you have done but it works better than my first idea. And thanks for the hint with the variable for date. This will definitely become very useful in further research.

      However, I have one final question to your code. in your line:
      collapse (max) sum_airtime (sum) pcttime, by(airline_id yrqtr ctry)
      Why do you use (max). Isn´t it unnecessary in this case, because the sum_airtime (by (airline_id yrqtr ctry)) have to be the same?

      Comment


      • #4
        Hello Patrick -

        There is no good reason for using (max) sum_airtime rather than (mean) sum_airtime. As a mathematician by training, when I want one value of a set of identical values, it doesn't occur to me to add up n copies of sum_airtime, then divide the total by n to reconstitute the original value. Taking the max to pick out a single value is what came to mind.

        But your question showed me that I'd followed your instructions for the calculations to be performed more closely than I need have. Here's a revised version of my code that avoids the "problem" entirely by deferring the calculation of sum_airtime until after the collapse.
        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input byte(airtime airline_id) str2 origin str3 destination int year byte quarter
        60 1 "US" "CA"  2016 1
        30 1 "US" "US"  2016 1
        30 1 "CA" "US"  2016 1
        60 2 "US" "JAM" 2016 1
        30 2 "CA" "US"  2016 2
        end
        generate yrqtr = yq(year,quarter)
        format yrqtr %tq
        drop year quarter
        rename (origin destination) (ctryorig ctrydest)
        generate timeorig = airtime/2
        generate timedest = airtime/2
        generate id = _n
        reshape long ctry time, i(id) j(od) string
        collapse (sum) time, by(airline_id yrqtr ctry)
        bysort airline_id yrqtr: egen sum_airtime = total(time)
        generate pcttime = time/sum_airtime
        order airline_id yrqtr sum_airtime
        Code:
        . list, sepby(airline_id yrqtr)
        
             +------------------------------------------------------+
             | airlin~d    yrqtr   sum_ai~e   ctry   time   pcttime |
             |------------------------------------------------------|
          1. |        1   2016q1        120     CA     45      .375 |
          2. |        1   2016q1        120     US     75      .625 |
             |------------------------------------------------------|
          3. |        2   2016q1         60    JAM     30        .5 |
          4. |        2   2016q1         60     US     30        .5 |
             |------------------------------------------------------|
          5. |        2   2016q2         30     CA     15        .5 |
          6. |        2   2016q2         30     US     15        .5 |
             +------------------------------------------------------+
        Let me close by thanking you for having the interest in learning and taking the time to puzzle out how my solution worked. Some members just copy-and-paste and hope it works. And the members who respond have many different motivations. I enjoy teaching, and quietly following Statalist was a big part of my learning Stata, so it's gratifying to know that someone has learned from my response.

        I would like to pass on to you a piece of advice which may be unnecessary, but given your demonstrated interest in learning, I would not like to think you might not be aware of this approach to increasing your Stata knowledge. When I began using Stata in a serious way, I started as have others here, by reading my way through the Getting Started with Stata manual relevant to my setup. Chapter 18 then gives suggested further reading, much of which is in the Stata User's Guide, and I worked my way through much of that reading as well. All of these manuals are included as PDFs in the Stata installation (since version 11) and are accessible from within Stata - for example, through Stata's Help menu. The objective in doing this was not so much to master Stata as to be sure I'd become familiar with a wide variety of important basic techniques, so that when the time came that I needed them, I might recall their existence, if not the full syntax, and know how to find out more about them in the help files and manual.

        Stata supplies exceptionally good documentation that amply repays the time spent studying it. The path I followed surfaces the things you need to know to get started in a hurry and to work effectively.

        Comment

        Working...
        X