Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Dropping observations that do not appear annually or for every id in a panel dataset

    Hello Everyone! I have a panel dataset that consists of about 100 exporting countries (id) and the products (product_cat) they exported over a period from 1995 to 2021 alongside the values of the exports (value). At the most disaggregated level, the panel has a total of over 5 million observations with each country exporting around 5000 products. What I hope to do is to have a final dataset wherein I KEEP observations that capture the products exported by every country in the panel from 1995 to 2018 and DROP the ones that are not exported by all countries in the same time period. With the given example below ( a quasi illustration of what my data currently looks like), I would like to keep product 10111 because it appears in both 2015 and 2016 and is exported by all sampled countries (4 5 and 6). However, I would like to drop 10519 because it only appears in the export baskets of country 4 and 5 but not 6. In the same vein, I would drop product 60310 because it does not appear for every recorded year. Is there a code that could do this without me manually telling Stata to drop category 60310 or the others mentioned. Any input would be appreciated and further clarity can be given. Thank you.
    Code:
    * Example generated by -dataex-. To install: ssc install dataex
    clear
    input byte id int year long(product_cat value)
    4 2015 10111      1
    4 2015 10519     45
    4 2015 10600      9
    4 2015 30110    568
    4 2015 30410      1
    4 2015 30490 786678
    4 2015 30613     77
    4 2015 30749     68
    4 2015 40630     26
    5 2015 10111      99
    5 2015 10519     88
    5 2015 10600      9
    5 2015 30110      3
    5 2015 30410     90
    5 2015 30490      1
    5 2015 30613     77
    5 2015 30749     66
    6 2015 10111      1677
    6 2015 10519      2
    6 2015 10600      9
    6 2015 30110    456
    6 2015 30613     77
    6 2015 30749     68
    6 2015 40630    657
    6 2015 60310      2
    4 2016 10111      1
    4 2016 10600      9
    4 2016 30110    568
    4 2016 30410      1
    4 2016 30490 786678
    4 2016 30613     77
    4 2016 30749     68
    4 2016 40630     26
    5 2016 10111      1
    5 2016 10600      9
    5 2016 30110      3
    5 2016 30490      1
    5 2016 30613     77
    5 2016 30749     66
    6 2016 10111      1
    6 2016 10519      2
    6 2016 10600      9
    6 2016 30110    456
    6 2016 30613     77
    6 2016 40630    657
    6 2016 60310      2
    end
    label var id "Country ID " 
    label var product_cat "HS6 digit product category" 
    label var value "export value in USD "

  • #2

    Code:
    distinct id
    local n_countries `r(ndistinct)'
    
    distinct year
    local n_years `r(ndistinct)'
    
    isid product id year, sort
    
    by product (id year): keep if _N == `n_countries'*`n_years'
    -distinct- is written by Nick Cox, and it is available from SSC.

    Comment


    • #3
      Thank you very much. This indeed was helpful.

      Comment


      • #4
        distinct was written by Gary Longton and myself and is maintained through the Stata Journal, except that the latest version is available in public only here at the time of writing.

        Code:
        *! 1.4.3 NJC 3 September 2023        
        * 1.4.2 NJC 15 August 2022          
        * 1.4.1 NJC 10 October 2020          
        * 1.4.0 NJC 24 August 2020          
        * 1.3.0 NJC/GML 22 July 2015         
        * 1.2.1 NJC 1 March 2012         
        * 1.2.0 NJC 15 September 2008
        * 1.1.1 GML NJC 26 February 2002
        * 1.1.0 GML NJC 25 February 2002
        * 1.0.0 21 November 2001
        program distinct, rclass sortpreserve byable(recall)
            version 8.0
            syntax [varlist] [if] [in] [, MISSing Abbrev(int 32) Joint ///
            MINimum(int 1) MAXimum(int -1) Width(int -1)               ///
            SORT SORT2(str asis) ]
        
            // sort undocumented 
        
            // options for minimum and maximum 
            if `maximum' == -1 local maximum . 
        
            if `minimum' > `maximum' { 
                local swap `minimum' 
                local minimum `maximum' 
                local maximum `swap' 
                di as txt "min(`maximum') max(`minimum') interpreted as min(`minimum') max(`maximum')" 
            }
        
            // column widths 
            if "`joint'" == "" { 
                local vnl = 1 
                foreach v of local varlist { 
                    local vnl = max(`vnl', length(abbrev("`v'", `abbrev')))
                }
                local w1 = `vnl' + 2 
            }
            else local w1 = 11 
        
            // width() option undocumented 
            if `width' > 0 { 
                local w2 = max(11, `width') 
            }
            else { 
                local w2 = 11 
                if _N > 1e7 local w2 = 5 + floor(log10(_N)) 
            } 
        
            // calculate body of table 
            if "`joint'" != "" { 
                if "`missing'" != "" marksample touse, novarlist 
                else marksample touse, strok 
        
                quietly count if `touse' 
                if r(N) == 0 error 2000 
         
                tempvar vals 
                bysort `touse' `varlist': gen byte `vals' = (_n == 1) * `touse'
                su `vals' if `touse', meanonly 
                
                local N = r(N) 
                local ndistinct = r(sum) 
        
                if r(sum) >= `minimum' & r(sum) <= `maximum' {
                    mata : mout = ("(jointly)", "`r(N)'", "`r(sum)'")   
                }
                else mata : mout = J(0, 0, .) 
            } 
        
            else { 
        
                if "`sort'" != "" local sort distinct 
                if "`sort2'" != "" local sort `sort' `sort2' 
        
                foreach opt in total distinct alpha descending { 
                    local `opt' 0 
                } 
        
                foreach word of local sort { 
                    local word = lower("`word'") 
                    local length = length("`word'")
        
                    if "`word'" == "d" { 
                        di as err "sort() ambiguous: distinct or descending?"
                        exit 198 
                    } 
         
                    if "`word'" == substr("total", 1, `length') {
                        local total 1 
                    } 
                    else if "`word'" == substr("distinct", 1, `length') { 
                        local distinct 1 
                    } 
                    else if "`word'" == substr("alpha", 1, `length') { 
                        local alpha 1 
                    } 
                    else if "`word'" == substr("descending", 1, `length') {  
                        local descending 1 
                    }
                    else { 
                        di as err "sort() option invalid?" 
                        exit 198
                    }
                }
        
                local tosort = `alpha' + `total' + `distinct' 
                if `tosort' > 1  | (`tosort' == 0 & `descending') { 
                    di as err "sort request invalid"
                    exit 198 
                }
                         
                mata : mout = J(0, 3, "") 
                foreach v of local varlist { 
                    tempvar touse vals 
                    mark `touse' `if' `in' 
                    if "`missing'" == "" markout `touse' `v', strok 
                    bys `touse' `v' : gen byte `vals' = (_n == 1) * `touse' 
                    su `vals' if `touse', meanonly
        
                    // last such assignments are returned 
                    local N = r(N) 
                    local ndistinct = r(sum) 
        
                    local vname = abbrev("`v'", `abbrev')  
                    if r(sum) >= `minimum' & r(sum) <= `maximum' { 
                        mata : mout = mout \ ("`vname'", "`r(N)'", "`r(sum)'") 
                    }
                    drop `touse' `nvals' 
                } 
        
                if `tosort' { 
                    local updn = cond(`descending', -1, 1) 
        
                    if `distinct' { 
                        mata : ndistinct = strtoreal(mout[, 3]) 
                        mata : mout = mout[order(ndistinct, `updn'),] 
                    } 
                    else if `total' { 
                        mata: ntotal = strtoreal(mout[, 2]) 
                        mata : mout = mout[order(ntotal, `updn'),] 
                    } 
                    else mata: mout = mout[order(mout, `updn'), ]
                }  
            }
            
            mata : st_local("I", strofreal(rows(mout))) 
        
            // exit if nothing to show 
            if `I' == 0 {
                noi display as txt ///
                "note: no distinct values satisfy specification" 
                return scalar N = 0 
                return scalar ndistinct = 0 
                exit 0 
            } 
         
            // show top of table 
            di 
            tempname mytab 
            .`mytab' = ._tab.new, col(3) lmargin(0)
               .`mytab'.width `w1'  | `w2' `w2' 
            .`mytab'.sep, top
              .`mytab'.titles " " "total"  "distinct"  
            .`mytab'.sep
        
            // show body of table 
            forval i = 1/`I' {
                forval j = 1/3 {  
                    mata: st_local("col`j'", mout[`i',`j'])
                }
                .`mytab'.strcolor "green" "yellow" "yellow" 
                .`mytab'.row "`col1'" "`col2'" "`col3'" 
            } 
        
            * commented out 1.4.3 
            * this was to tidy up, but it frustrated some uses 
            * mata mata clear 
        
            // show bottom of table 
            .`mytab'.sep, bottom
        
            return scalar N = `N'

        Comment

        Working...
        X