Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Calculating distance between two variables and generating new variable

    I would like to create a variable called spill which is given as the sum of the distances between vectors of each row multiplied by the stock value. For example, consider

    firm us euro asia africa stock year
    A 1 4 3 5 46 2001
    A 2 0 1 3 889 2002
    B 2 3 1 1 343 2001
    B 0 2 1 3 43 2002
    C 1 3 4 2 345 2001
    I would like to create a vector which basically takes the distance between two firms at time t and generates the spill variable. For example, take for Firm A in the year 2001 it would be 0.204588 (which is the cosine distance between firm A and B at time t i.e, in 2001 ( 1, 4, 3,5) and (2,3,1,1) (i.e. similarity between the investments in us, euro, asia, africa) and then multiplied by 343, and then to calculate the distance between A and C in 2001 as .10528 * 345 , hence the spill variable is = .2 * 343+.1 * 345 = 103.1 for the year 2001 for firm A. Can anyone please advise?

    Thank you!


  • #2
    Any thoughts or ideas?

    Comment


    • #3
      You can use -joinby- to join the dataset to itself and create a dataset of pairs of firms. From there you could loop over your variable list to create the cosine distance *scalar*, whose exact formula I don't recall:

      Code:
      clear
      input str1 firm us euro asia africa stock year
      A 1 4 3 5 46 2001
      A 2 0 1 3 889 2002
      B 2 3 1 1 343 2001
      B 0 2 1 3 43 2002
      C 1 3 4 2 345 2001
      end
      //
      tempfile temp
      save `temp'
      rename (firm us euro asia africa) (firm_1 us_1 euro_1 asia_1 africa_1)
      joinby year using `temp'
      // Eliminate self-pairs and duplicates
      keep if (firm_1 < firm)
      gen distance = 0
      foreach c of varlist us euro asia africa {
         replace distance = distance + `c' * `c'_1
         etc.
      }
      replace distance = distance * stock

      Comment


      • #4
        Here's a no loop solution:

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str1 firm byte(us euro asia africa) int(stock year)
        "A" 1 4 3 5  46 2001
        "A" 2 0 1 3 889 2002
        "B" 2 3 1 1 343 2001
        "B" 0 2 1 3  43 2002
        "C" 1 3 4 2 345 2001
        end
        
        rename (us euro asia africa) value=
        reshape long value, i(firm year) j(region) string
        
        bysort firm year: egen double norm = total(value^2)
        replace norm = sqrt(norm)
        
        tempfile main
        save "`main'"
        
        rename (firm value norm) =0
        joinby year region using "`main'"
        
        keep if firm < firm0
        order year firm firm0 region
        
        sort year firm firm0 region
        by year firm firm0: egen double dist = total(value * value0)
        replace dist = 1 - dist / (norm * norm0)
        
        list , sepby(year firm firm0)
        and the results:
        Code:
        . list , sepby(year firm firm0)
        
             +-------------------------------------------------------------------------------------------+
             | year   firm   firm0   region   value0   stock       norm0   value        norm        dist |
             |-------------------------------------------------------------------------------------------|
          1. | 2001      A       B   africa        1     343   3.8729833       5   7.1414284   .20458832 |
          2. | 2001      A       B     asia        1     343   3.8729833       3   7.1414284   .20458832 |
          3. | 2001      A       B     euro        3     343   3.8729833       4   7.1414284   .20458832 |
          4. | 2001      A       B       us        2     343   3.8729833       1   7.1414284   .20458832 |
             |-------------------------------------------------------------------------------------------|
          5. | 2001      A       C   africa        2     345   5.4772256       5   7.1414284   .10520751 |
          6. | 2001      A       C     asia        4     345   5.4772256       3   7.1414284   .10520751 |
          7. | 2001      A       C     euro        3     345   5.4772256       4   7.1414284   .10520751 |
          8. | 2001      A       C       us        1     345   5.4772256       1   7.1414284   .10520751 |
             |-------------------------------------------------------------------------------------------|
          9. | 2001      B       C   africa        2     345   5.4772256       1   3.8729833   .19861231 |
         10. | 2001      B       C     asia        4     345   5.4772256       1   3.8729833   .19861231 |
         11. | 2001      B       C     euro        3     345   5.4772256       3   3.8729833   .19861231 |
         12. | 2001      B       C       us        1     345   5.4772256       2   3.8729833   .19861231 |
             |-------------------------------------------------------------------------------------------|
         13. | 2002      A       B   africa        3      43   3.7416574       3   3.7416574   .28571429 |
         14. | 2002      A       B     asia        1      43   3.7416574       1   3.7416574   .28571429 |
         15. | 2002      A       B     euro        2      43   3.7416574       0   3.7416574   .28571429 |
         16. | 2002      A       B       us        0      43   3.7416574       2   3.7416574   .28571429 |
             +-------------------------------------------------------------------------------------------+

        Comment


        • #5
          I am confused with the solutions , I have a huge dataset of over 1 million records and making a combinations would probably make it difficult for computation. I am little unclear with the second solution, I need to have values for firm for firm A in 2001 and a value called spill, I don't need any region values (they are used to create the distance vector).

          Comment


          • #6
            As you say, the region values are used to create the measure. Once you understand what's going on, you can discard them:
            Code:
            rename dist spill
            by year firm firm0: keep if _n == 1
            drop region value* norm*
            In terms of forming all pairwise combinations being infeasible, any advice will depend on details you have not shared.

            Comment


            • #7
              The problem you describe, *as I understand it*, by its essence involves a measure for every distinct pair of firms that occur within each year. Any way the dataset is organized, that's going to be big if you have one million observations. Perhaps I'm misunderstanding what you want? It would help if you could give an illustration of how you want the final dataset to be organized. And, if you could tell how many years you have and how many firms you have per year, that would give a sense of whether what you want is possible. If your 1e6 observations comprise lots of firms and not many years, it's going to be tough.

              Comment


              • #8
                Originally posted by Robert Picard View Post
                Here's a no loop solution:

                Code:
                * Example generated by -dataex-. To install: ssc install dataex
                clear
                input str1 firm byte(us euro asia africa) int(stock year)
                "A" 1 4 3 5 46 2001
                "A" 2 0 1 3 889 2002
                "B" 2 3 1 1 343 2001
                "B" 0 2 1 3 43 2002
                "C" 1 3 4 2 345 2001
                end
                
                rename (us euro asia africa) value=
                reshape long value, i(firm year) j(region) string
                
                bysort firm year: egen double norm = total(value^2)
                replace norm = sqrt(norm)
                
                tempfile main
                save "`main'"
                
                rename (firm value norm) =0
                joinby year region using "`main'"
                
                keep if firm < firm0
                order year firm firm0 region
                
                sort year firm firm0 region
                by year firm firm0: egen double dist = total(value * value0)
                replace dist = 1 - dist / (norm * norm0)
                
                list , sepby(year firm firm0)
                and the results:
                Code:
                . list , sepby(year firm firm0)
                
                +-------------------------------------------------------------------------------------------+
                | year firm firm0 region value0 stock norm0 value norm dist |
                |-------------------------------------------------------------------------------------------|
                1. | 2001 A B africa 1 343 3.8729833 5 7.1414284 .20458832 |
                2. | 2001 A B asia 1 343 3.8729833 3 7.1414284 .20458832 |
                3. | 2001 A B euro 3 343 3.8729833 4 7.1414284 .20458832 |
                4. | 2001 A B us 2 343 3.8729833 1 7.1414284 .20458832 |
                |-------------------------------------------------------------------------------------------|
                5. | 2001 A C africa 2 345 5.4772256 5 7.1414284 .10520751 |
                6. | 2001 A C asia 4 345 5.4772256 3 7.1414284 .10520751 |
                7. | 2001 A C euro 3 345 5.4772256 4 7.1414284 .10520751 |
                8. | 2001 A C us 1 345 5.4772256 1 7.1414284 .10520751 |
                |-------------------------------------------------------------------------------------------|
                9. | 2001 B C africa 2 345 5.4772256 1 3.8729833 .19861231 |
                10. | 2001 B C asia 4 345 5.4772256 1 3.8729833 .19861231 |
                11. | 2001 B C euro 3 345 5.4772256 3 3.8729833 .19861231 |
                12. | 2001 B C us 1 345 5.4772256 2 3.8729833 .19861231 |
                |-------------------------------------------------------------------------------------------|
                13. | 2002 A B africa 3 43 3.7416574 3 3.7416574 .28571429 |
                14. | 2002 A B asia 1 43 3.7416574 1 3.7416574 .28571429 |
                15. | 2002 A B euro 2 43 3.7416574 0 3.7416574 .28571429 |
                16. | 2002 A B us 0 43 3.7416574 2 3.7416574 .28571429 |
                +-------------------------------------------------------------------------------------------+
                I'm currently bothered with a similar question and your solution is of great help! Thank you sooooo much!

                Comment

                Working...
                X