Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • #31
    Perhaps somebody here is familiar enough with the particular survey data you are working with to advise you. But most of us here are not. So if you don't get a response from somebody within, say, 12 hours, I suggest you post back and use the -dataex- command to post example data from the two data sets you are trying to -merge-. The problem is that your data do not conform to the expectations of the code you have written. But without knowing what is actually in the data, it is impossible to advise.

    If you are running version 17, 16 or a fully updated version 15.1 or 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

    When asking for help with code, always show example data. When showing example data, always use -dataex-.

    Comment


    • #32
      Hi everyone, I'm trying to estimate a spatial panel. I have two datasets; in the first, I have the variables _ID, _CY and _CX with 853 observations of each state municipality, and in the second dataset I have my panel data with 11,089 observations from 2008 to 2020. My code used is:
      spshape2dta mg.shp, replace
      use mg
      spset
      merge m:1 _ID using "mg.dta"

      The results are 11,089 observations matched. But, when I try create the contiguity matrix using the command:

      spmatrix create contiguity W

      The Stata returns the code r(459) variable _ID does not uniquely identify observations in the master data. Could somebody help me with this problem? What have I been doing wrong?


      Comment


      • #33
        Dear all,
        I want to merge the two data sets (DTA1 with DTA2). The data set DTA1 contains information on bilateral trade flows between country_k and country_j at year (t), where as DTA2 is a gravity dataset obtained from CEPII
        (http://www.cepii.fr/CEPII/en/bdd_mod..._item.asp?id=8) which contains information on bilateral and unilateral variables between k and j at year t.
        Therefor i wrote the following code
        Code:
        use DTA1, clear
        
        merge m:1 year country_k  country_j using DTA2
        I get an error message, variables year country_k country_j do not uniquely identify observations in the using data.
        Then I checked for duplicated in (using data) by running

        Code:
         use DTA2, clear
        sort year country_k  country_j
        quietly by year country_k  country_j:  gen dup = cond(_N==1,0,_n)
        Then if i drop duplicate but keeping the first occurrence by running
        Code:
        drop if dup 1
        This worked fine. But, i need to carefully drop the duplicates, not exactly in this manner.

        I am posting the dataex example here to explain my point.

        Code:
        * Example generated by -dataex-. To install: ssc install dataex
        clear
        input str3(country_k country_j) int year double x1 byte x2 double x3 byte(x4 y1 y2) double(y3 f1 f2 f3 f4) byte dup
        "AAT" "BJS" 1991    23 43    45.01 . 1 0  5785.35    12     .    36    45 0
        "ABW" "ANT" 1990 21.36  .   567002 0 . . 2487.632 11.35 36.27     .    24 1
        "ABW" "ANT" 1990     .  .   567002 . . .        . 11.35     .     .     . 2
        "ABW" "ANT" 1990     .  .        . . . . 2487.632     .     . 12.56    24 3
        "ABW" "ANT" 1990     .  .        . . . .        .     .     .     .     . 4
        "STU" "BWF" 1993     .  .    37248 . 0 .        .     .     .     . 17.27 1
        "STU" "BWF" 1993  30.7  0    37248 . 0 1  7623.23     . 56.12 23.67 17.27 2
        "STU" "BWF" 1993     .  .        . . . .  7623.23     .     . 23.67     . 3
        "JKS" "FPW" 2010     .  . 4589.765 . . 1        .     .     . 23.14     . 1
        "JKS" "FPW" 2010 76.45  1 4589.765 0 . 1 8767.346     .     . 23.14 32.87 2
        end
        In this example the first-row in triplet ABW-ANT-1990 contains the maximum information on all the variables from x1-f4, while in the other duplicates in this triplet has some missing values, so drop if dup>1 will retain the row containing the maximum information on all the columns. However, for STU-BFW-1993 the second row contains data on most of the variables and the first row is mostly missing, so here drop if dup>1 will drop the row which is more informative. Similarly in JKS-FPW-2010, the second row will be deleted, which is most information than1st-row in this triplet.

        I want to ask, how should i drop duplicates and retain the unique occurrence in such a manner that rows (duplicates) containing less information are dropped while as that occurance is retained across each triplet which contain information on most of the variable (columns) . That is,
        for ABW-ANT-1990 drop 2nd, 3rd, and 4th row (duplicate)
        for STU-BFW-1993 drop 1st, 3rd row
        for JKS-FPW-2010, drop 2nd row

        Please get back to me, i shall be very thankful

        (Ridwan)




        Comment


        • #34
          First, in order for this approach to be viable and meaningful, it is crucial that where two observations for the same country pair and year have non-missing values for a variable, they must have the same value. Otherwise the data are contradictory, which is a much bigger problem and requires a wholly different approach. Therefore this code begins by verifying that the non-missing data values are always consistent, before then retaining one of the observations (randomly and irreproducibly selected) that has the fewest missing values. (There could be two or more observations tied for the lowest number of missing values.)
          Code:
          ds country_k country_j year dup, not
          local vbles `r(varlist)'
          foreach v of varlist `vbles' {
              by country_k country_j year (`v'), sort: assert `v' == `v'[1] | missing(`v')
          }
          
          egen mcount = rowmiss(`vbles')
          by country_k country_j year mcount, sort: keep if _n == 1
          That said, I think that rather than simply choosing one of the observations with lowest number of missing values, it would be better to combine all of the observations for a country pair and year into a single observation that picks as many non-missing values from among them as possible. So what I would do instead is:
          Code:
          ds country_k country_j year dup, not
          local vbles `r(varlist)'
          foreach v of varlist `vbles' {
              by country_k country_j year (`v'), sort: assert `v' == `v'[1] | missing(`v')
          }
          
          collapse (firstnm) `vbles', by(country_k country_j year)
          With this approach, if a variable has a non-missing value in any of the country pair year observations, you will retain it, even if the only non-missing value happens to fall in an observation that has a lot of other variables with missing values.
          Last edited by Clyde Schechter; 21 Mar 2023, 09:33.

          Comment


          • #35
            Thank you Clyde Schechter for explaining it so well. This was really helpful.

            Untill now i was doing a more ad-hoc approach to address the issue at hand by running the following code

            Code:
            sort year country_k country_j
            quietly by  year country_k country_j :  gen dup = cond(_N==1,0,_n)
            
            egen m1 = rowmiss( x1 x2 x3 y1 y2 y3 f1 f2 f3 f4)
            
            bysort year country_k country_j: egen m2=min(m1)
            sort year country_k country_j dup
            keep if m1==m2
            This dropped the duplicates by retaining the rows with minimum missing values on the variables from x1-f4. However the duplicates still existed for some country pair at some years with missing values for all the variables /columns. Therefore, i had to again drop those duplicates by running
            Code:
            drop m1 m2 dup
            sort year country_k country_j
            quietly by  year country_k country_j :  gen dup = cond(_N==1,0,_n)
            
            drop if dup>1

            The last code you just sent me achieved the same thing (as in ad-hoc approach) in a more systematic and clean way. It saved me a lot of time. I am very thankful to you.



            Thanks and regards,
            (Ridwan)

            Comment


            • #36
              Hi Ridwan,

              The Stata response on " variables year country_k country_j do not uniquely identify observations in the using data." It is commonly happened when your data is not uniquely identified due to duplicates. You should drop the duplicates to proceed with the merging.

              Try the steps below to find out the duplicates in your data set:

              duplicates example year country_k country_j

              duplicates list year country_k country_j

              duplicates tag year country_k country_j, gen(duple) -----this generate new var "duple"

              tab duple

              These commands is to analyse the duplicates within your data set (with your specified variable),

              Then use this command to solve:

              duplicates drop year country_k country_j, force (this will delete the duplicates values and hence obs drops)


              now you can attempt to merge m:1 or 1:m

              Comment


              • #37
                #36 gives good advice, until it reaches -duplicates drop year country_k country_j, force-. There are two reasons you shouldn't use the -force- option, at least not yet:
                1. There may be observations having the same values of country_k country_j and year that have inconsistent, conflicting values for other variables. If you use -force-, Stata will pick one of those observations (at random, and not reproducibly) to keep and discard the rest. But you need to keep the one that is correct! (If there is one. Maybe you need to combine them in some way. Or maybe all of them are just wrong and then you need to delete all of them without leaving one behind.)
                2. Even if all of the duplicate observations on country_k country_j and year are entirely consistent on all variables, you have to wonder why they are there in the first place. It usually means that there was an error in the creation of the data set. So you should trace back the creation of the data set and find the error that led to this condition in the first place. Where there is one mistake, you may find others as well. There is non point in analyzing incorrect data.
                So, after you have carefully inspected the offending observations and corrected the data set so that each combination of country_k country_j and year is associated with a single correct set of values for all variables, you will have no need to drop anything. In short, Stata is telling you that your data set is incorrect. Fix the data: don't just bludgeon it and blunder on.

                Comment

                Working...
                X