Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dyad-Year to Directed-Dyad Year

    Hello,

    I apologize if this has been asked before. I tried to thoroughly look through the forum.

    I have trade data from COW that is in dyad-year (non-directed dyad), however, due to the nature of my DV, these data have to be converted to directed-dyad year.

    So,
    importer1 importer2 flow1 flow2
    A B .01 .02
    .
    .
    B C .03 .04
    .
    .


    has to be converted to
    importer1 importer2 flow1 flow2
    A B .01 .02
    .
    .
    B A .02 .01
    .
    .
    B C .03 .04
    .
    .
    C B .04 .03

    I have no idea how to begin to go about this.

    The closest links I found in this site are:
    http://www.statalist.org/forums/foru...ic?view=thread

    and

    http://www.statalist.org/forums/foru...ing-many-dyads

    However it seems the unit of analyses are from country-year to directed-dyad year.

    Many thanks in advanced,
    MW


  • #2
    You describe the problem in terms of dyad-year pairs, but you don't seem to have a year variable. I'll assume that's just an omission in the way you showed your example data and that there really is a variable called year. This should do it.

    Code:
    isid importer1 importer2 year
    gen temp_importer = importer1
    gen temp_flow = flow1
    expand 2
    by importer1 importer2 year, sort: replace importer1 = importer2 if _n == 2
    by importer1 importer2 year: replace importer2 = temp_importer if _n == 2
    by importer1 importer2 year: replace flow1 = flow2 if _n == 2
    by importer1 importer2 year: replace flow2 = temp_flow if _n == 2
    drop temp_importer temp_flow
    Note: Not tested. Beware of typos.

    The reason it's not tested is that the way you posted your data makes it difficult to import into Stata to try it. In the future, please use the -dataex- command to post example data on this forum. You can install the -dataex- command by running -ssc install dataex-. The simple instructions for use are in -help dataex-. By using -dataex- you enable those who want to help you to quickly and easily create a completely faithful replica of your example data in Stata with a simple copy/paste operation.

    Comment


    • #3
      Originally posted by Clyde Schechter View Post
      You describe the problem in terms of dyad-year pairs, but you don't seem to have a year variable. I'll assume that's just an omission in the way you showed your example data and that there really is a variable called year. This should do it.

      Code:
      isid importer1 importer2 year
      gen temp_importer = importer1
      gen temp_flow = flow1
      expand 2
      by importer1 importer2 year, sort: replace importer1 = importer2 if _n == 2
      by importer1 importer2 year: replace importer2 = temp_importer if _n == 2
      by importer1 importer2 year: replace flow1 = flow2 if _n == 2
      by importer1 importer2 year: replace flow2 = temp_flow if _n == 2
      drop temp_importer temp_flow
      Note: Not tested. Beware of typos.

      The reason it's not tested is that the way you posted your data makes it difficult to import into Stata to try it. In the future, please use the -dataex- command to post example data on this forum. You can install the -dataex- command by running -ssc install dataex-. The simple instructions for use are in -help dataex-. By using -dataex- you enable those who want to help you to quickly and easily create a completely faithful replica of your example data in Stata with a simple copy/paste operation.
      Clyde, my apologies. I'm terribly new here. But thank you for the dataex reference. I installed it and this is were the first few lines from the COW bilateral dataset:
      Code:
      input double ccode1 int(ccode2 year) str32(importer1 importer2) double(flow1 flow2) byte(source1 source2) double(bel_lux_alt_flow1 bel_lux_alt_flow2 china_alt_flow1 china_alt_flow2 version)
       2 20 1960 "United States of America" "Canada"    3452  4147.1 2 2 -9 -9 -9 -9 2.01
      20 20 1960 "United States of America" "Canada"    3452  4147.1 2 2 -9 -9 -9 -9 2.01
       2 20 1961 "United States of America" "Canada"    3555  4263.3 2 2 -9 -9 -9 -9 2.01
      I also apologize, forgot to mention the year variable. And obviously, the ccode1 and ccod2 variables are better, as they are not strings. Unfortunately, I tried to re-write your code using ccode1 and ccode2, and got the error message of:

      Code:
      variables ccode1 ccode2 year do not uniquely identify the observations
      I'm not sure what's going on here. Hopefully I can figure it out and run it again.

      Comment


      • #4
        Well, it means that for some combination(s) of ccode1 ccode2 and year, there is more than one observation. Based on the way you describe your data set, that most likely means that there are errors in your data that you will need to fix. After all, if there is more than one observation for each dyad-year your problem is indeterminate. Alternatively, your expectation that there should be just one observation for each such combination may be incorrect: perhaps the data are further subclassified by types of products (agricultural, electronics, etc.) and there is a variable somewhere that identifies this. (What is the version variable?)

        If you have obtained this data from an external source (I don't know what COW might be other than a shouting bovine ), and have not made modifications to it, I can only say that it is common for publicly released data sets to have errors. If you have curated this data set yourself, there may be errors in the code you used to create it. Anyway, the first step at this point is to find out why there are some multiple records for the same year-dyad. And that begins with identifying them. So I would start with:

        Code:
        duplicates tag ccode1 ccode2 year, gen(flag)
        browse if flag
        That will show you all the offending multiples. Then you can decide which are correct and which are errors, and ponder what to do about it. Sometimes when there are multiple records like this they are all exact duplicates of each other. In that case, -duplicates drop- solves the problem. If they differ in some ways, however, then you have to decide whether one is correct and the others are errors that should be dropped, or perhaps the multiple records need to be combined in some way (added up, averaged, whatever). In the latter situation it is likely that -collapse- will be your friend.

        I should point out that even in the short data example you posted there is a hint of a problem. In the first observation USA is associated with ccode2, but in the second observation USA is associated with ccode20, whereas ccode20 is elsewhere associated with Canada. If there are inconsistencies in the linkage between ccode and country, this could cause this problem and many others.

        Finally, as an aside, I notice a lot of -9's in your data. It is common in some alternative software to use numbers like -9 to stand for missing value. That practice does not work in Stata and the -9's will be treated as the number -9 in any calculations you perform with those variables. If those are supposed to represent missing values, you need to fix that. Fortunately, it is very easy:

        Code:
        mvdecode bel_lux_alt_flow* china_alt_flow*, mv(-9)

        Comment


        • #5
          Originally posted by Clyde Schechter View Post
          Well, it means that for some combination(s) of ccode1 ccode2 and year, there is more than one observation. Based on the way you describe your data set, that most likely means that there are errors in your data that you will need to fix. After all, if there is more than one observation for each dyad-year your problem is indeterminate. Alternatively, your expectation that there should be just one observation for each such combination may be incorrect: perhaps the data are further subclassified by types of products (agricultural, electronics, etc.) and there is a variable somewhere that identifies this. (What is the version variable?)

          If you have obtained this data from an external source (I don't know what COW might be other than a shouting bovine ), and have not made modifications to it, I can only say that it is common for publicly released data sets to have errors. If you have curated this data set yourself, there may be errors in the code you used to create it. Anyway, the first step at this point is to find out why there are some multiple records for the same year-dyad. And that begins with identifying them. So I would start with:

          Code:
          duplicates tag ccode1 ccode2 year, gen(flag)
          browse if flag
          That will show you all the offending multiples. Then you can decide which are correct and which are errors, and ponder what to do about it. Sometimes when there are multiple records like this they are all exact duplicates of each other. In that case, -duplicates drop- solves the problem. If they differ in some ways, however, then you have to decide whether one is correct and the others are errors that should be dropped, or perhaps the multiple records need to be combined in some way (added up, averaged, whatever). In the latter situation it is likely that -collapse- will be your friend.

          I should point out that even in the short data example you posted there is a hint of a problem. In the first observation USA is associated with ccode2, but in the second observation USA is associated with ccode20, whereas ccode20 is elsewhere associated with Canada. If there are inconsistencies in the linkage between ccode and country, this could cause this problem and many others.

          Finally, as an aside, I notice a lot of -9's in your data. It is common in some alternative software to use numbers like -9 to stand for missing value. That practice does not work in Stata and the -9's will be treated as the number -9 in any calculations you perform with those variables. If those are supposed to represent missing values, you need to fix that. Fortunately, it is very easy:

          Code:
          mvdecode bel_lux_alt_flow* china_alt_flow*, mv(-9)




          Clyde,

          Thanks again. You have been most helpful here and on the forum in general.

          For some reason, the variables are being doubled when I use your code. The only thing I did was that I changed the importer to ccode. Here is the original (prior to running the code):
          Code:
          input double ccode1 int(ccode2 year) str32(importer1 importer2) double(flow1 flow2) byte(source1 source2) double(bel_lux_alt_flow1 bel_lux_alt_flow2 china_alt_flow1 china_alt_flow2 version)
           2 20 1960 "United States of America" "Canada"     3452  4147.1 2 2 -9 -9 -9 -9 2.01
           2 20 1961 "United States of America" "Canada"     3555  4263.3 2 2 -9 -9 -9 -9 2.01
           2 20 1962 "United States of America" "Canada"   3979.1    4427 2 2 -9 -9 -9 -9 2.01
           2 20 1963 "United States of America" "Canada"   4169.3  4535.9 2 2 -9 -9 -9 -9 2.01
          This is the code I used:
          Code:
          
          isid ccode1 ccode2 year
          
          gen temp_importer = ccode1
          
          gen temp_flow = flow1
          
          expand 2
          
          by ccode1 ccode2 year, sort: replace ccode1 = ccode2 if _n == 2
          
          by ccode1 ccode2 year: replace ccode2 = temp_importer if _n == 2
          
          by ccode1 ccode2 year: replace flow1 = flow2 if _n == 2
          
          by ccode1 ccode2 year: replace flow2 = temp_flow if _n == 2
          
          drop temp_importer temp_flow
          Then, I get the error:
          Code:
          run "/var/folders/_4/yrdp6ksd4yq3kz2zd75qtph00000gn/T//SD91151.000000"
          not sorted
          r(5);
          I guess instead of making:

          Code:
          20 2 1960 "Canada" "United States"    4147.1  3452
          It's duplicating the values instead?

          And I'm terribly sorry, COW is where I got the data(from and external source). It came in an Excel sheet (.csv). I cleaned it (took out years that I didn't need), and I imported as an excel sheet (but not as CSV).

          Thank you for pointing out the missing values (-9). I hadn't done anything to them yet (again, thanks for the code!). I was just trying to see if I can get the dyad-year to directed dyad-year.
          Last edited by Monica White; 16 Jan 2017, 18:14.

          Comment


          • #6
            Update: I also used the importer variables, and they duplicated as well.

            Comment


            • #7
              Sorry, my mistake. I overlooked the fact that when I -replace ccode1 = ccode2...-, I break the sort order, and so the subsequent command doesn't work anymore. The code below takes a slightly different approach that doesn't founder on that problem:

              Code:
              clear
              input double ccode1 int(ccode2 year) str32(importer1 importer2) double(flow1 flow2) byte(source1 source2) double(bel_lux_alt_flow1 bel_lux_alt_flow2 china_alt_flow1 china_alt_flow2 version)
               2 20 1960 "United States of America" "Canada"     3452  4147.1 2 2 -9 -9 -9 -9 2.01
               2 20 1961 "United States of America" "Canada"     3555  4263.3 2 2 -9 -9 -9 -9 2.01
               2 20 1962 "United States of America" "Canada"   3979.1    4427 2 2 -9 -9 -9 -9 2.01
               2 20 1963 "United States of America" "Canada"   4169.3  4535.9 2 2 -9 -9 -9 -9 2.01
               end
               
               isid ccode1 ccode2 year
              
              gen temp_importer = ccode1
              
              gen temp_flow = flow1
              
              expand 2
              by ccode1 ccode2 year, sort: gen flag = (_n == 2)
              
              replace flow1 = flow2 if flag
              replace flow2 = temp_flow if flag
              replace ccode1 = ccode2 if flag
              replace ccode2 = temp_importer if flag
              drop temp_importer temp_flow flag
              And it produces correct results on your example.

              Comment


              • #8
                Originally posted by Clyde Schechter View Post
                Sorry, my mistake. I overlooked the fact that when I -replace ccode1 = ccode2...-, I break the sort order, and so the subsequent command doesn't work anymore. The code below takes a slightly different approach that doesn't founder on that problem:

                Code:
                clear
                input double ccode1 int(ccode2 year) str32(importer1 importer2) double(flow1 flow2) byte(source1 source2) double(bel_lux_alt_flow1 bel_lux_alt_flow2 china_alt_flow1 china_alt_flow2 version)
                2 20 1960 "United States of America" "Canada" 3452 4147.1 2 2 -9 -9 -9 -9 2.01
                2 20 1961 "United States of America" "Canada" 3555 4263.3 2 2 -9 -9 -9 -9 2.01
                2 20 1962 "United States of America" "Canada" 3979.1 4427 2 2 -9 -9 -9 -9 2.01
                2 20 1963 "United States of America" "Canada" 4169.3 4535.9 2 2 -9 -9 -9 -9 2.01
                end
                
                isid ccode1 ccode2 year
                
                gen temp_importer = ccode1
                
                gen temp_flow = flow1
                
                expand 2
                by ccode1 ccode2 year, sort: gen flag = (_n == 2)
                
                replace flow1 = flow2 if flag
                replace flow2 = temp_flow if flag
                replace ccode1 = ccode2 if flag
                replace ccode2 = temp_importer if flag
                drop temp_importer temp_flow flag
                And it produces correct results on your example.
                Clyde,
                Thank you so much! Also, I appreciate your patience! I've learned so much.

                Comment


                • #9
                  Monica, if you are going to work with COW (Correlates of War Project) data you might find it useful to use EUGene. You can download it at http://www.eugenesoftware.org/.

                  Comment


                  • #10
                    Originally posted by Richard Stoll View Post
                    Monica, if you are going to work with COW (Correlates of War Project) data you might find it useful to use EUGene. You can download it at http://www.eugenesoftware.org/.
                    Hi Richard,

                    EUGene was my first choice, but where I am currently, I only have Apple computers.

                    However, I can probably start learning EUGene next month. Do you know if there are any help forums for it online? Tried looking for them, but couldn't find anything. I'm working with both Correlates of War data, but other data as well.

                    Comment

                    Working...
                    X