Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • How to set the origin and destination for a connecting flight?

    I am working with airline data. The data looks like the following example:

    Itinerary Origin Destination Break
    1 A B
    1 B C X
    1 C B
    1 B A X

    Where itinerary is the ticket number of a specific person (I used the number 1 here to simplify), and 'break' creates an X after which flight a passenger took a break, meaning it was their final destination.
    In this example, I am looking at a round trip connecting flight between cities A and C (where the passenger flew from city A to a layover in city B and eventually to his destination in city C) (I don't care where the layover was), so what I want to get from the above data is the following:

    Itinerary Origin Destination
    1 A C
    1 C A

    How can I condense the data to not include the layover locations for each itinerary?
    Last edited by Irina Z; 03 May 2015, 15:05.

  • #2
    I'm going to assume that the data can be a bit more complex: sometimes there will be only one flight, sometimes there could be more than two. The most important assumption is that the data are sorted with the flights in consecutive order (as they are in your example) regardless of complexity. So this is a problem involving spells of data, where each spell consists of a series of flights leading up to and including a break == "X" observation..

    Code:
    gen obsno = _n // RETAIN SORT ORDER FOR USE LATER
    
    // IDENTIFY AND NUMBER THE SPELLS
    gen spell_start = (Itinerary!= Itinerary[_n-1] | Break[_n-1] == "X")
    gen spell = sum(spell_start)
    by spell (obsno), sort: gen initial_origin = Origin[1]
    by spell(obsno): gen final_destination = Destination[_N]
    
    // IF DESIRED YOU CAN DROP ALL BUT ONE OBSERVATION PER SPELL
    // NOW AND SHORTEN THE VARIABLE NAMES
    by spell (obsno): keep if _n == 1
    drop Origin Destination spell spell_start obsno Break
    rename initial_origin Origin
    rename final_destination Destination
    Notes: You didn't say if you wanted to add the overall origin and destination to the existing data set (which is what the first two blocks of code together accomplish), or if you want to replace this with a data set containing only the initial and final destinations of each itinerary.

    I deliberately did not include a verification that the origin of each flight within a spell equal the destination of the preceding flight in that spell, because I can imagine that people sometimes fly into cities served by multiple airports and then transfer among them, e.g. JFK->ORD, MDW->LAX is a possible trip from New York to Los Angeles with a layover involving both O'Hare and Midway in Chicago.

    Comment


    • #3
      Hi Clyde,

      I am doing the same thing as Irina and I have tried using your commands to generate the origin and destination endpoints based on the trip break (X). However my results were not what I had expected. My current data table is in this format:

      ID SeqNum Coupons OriginAirport OriginCity Quarter DestAirport DestCity
      20131423 1 4 10257 30257 1 14100 34100
      20131423 2 4 14100 34100 1 11066 31066 X
      20131423 3 4 11066 31066 1 14100 34100
      20131423 4 4 14100 34100 1 10257 30257 X

      But after running your commands, I get these results for some of the itinerary IDs:

      ID SeqNum Coupons OriginAirfield OriginCity Quarter DestAirport DestCity Origin Destination
      20131423 1 4 10257 30257 1 14100 34100 31066 30257
      20131423 2 4 14100 34100 1 11066 31066 X 31066 30257
      20131423 3 4 11066 31066 1 14100 34100 31066 30257
      20131423 4 4 14100 34100 1 10257 30257 X 31066 30257

      This is surprising as other itineraries showed the correct Origin and Destination, such as the one below

      ID SeqNum Coupons OriginAirfield OriginCity Quarter DestAirport DestCity Origin Destination
      20131424 1 4 10257 30257 1 14100 34100 30257 31066
      20131424 2 4 14100 34100 1 11066 31066 X 30257 31066
      20131424 3 4 11066 31066 1 14100 34100 31066 30257
      20131424 4 4 14100 34100 1 10257 30257 X 31066 30257

      This happens randomly throughout my whole dataset and I was wondering whether you will be able to tell me what is wrong with my codes? I have used the exact same codes that you have posted above and my data has being sorted correctly based on sequence.

      Thank you for your time.

      Tony

      Comment

      Working...
      X