Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • duplicate observations by group following a pattern

    Hello,

    I have a dataset of 1000 firms (id_firm) over 10 years (year). Each year, a firm produces a single 2-digit output (idmain2_out) using several 2-digit inputs (id2_in), so that each observation is described by id_firm year idmain2_out id2_in as is shown below:

    Code:
    id_firm year idmain2_out id2_in
    1 1990 44 01
    1 1990 44 02
    1 1991 50 20
    ...
    2 1990 28 33
    ...
    3 1990 44 01
    3 1990 44 06
    It follows that every year, to produce each output (year idmain2_out), a group of inputs (id2_in) is used; and firms that produce an output use some of those inputs and some others not. For example, in 1990 idmain2_out={44} was produced using id2_in = {01, 02, 06}; firm 1 used id2_in = {01, 02} to produce idmain2_out={44}, while firm 3 used id2_in={02, 06} to do it.

    By id_firm year idmain2_out, I want to create(expand) new observations so that it shows all the inputs that were used by all firms that produced that output that year, along with an indicator variable (use) which takes the value 1 if the firm used that input and 0 if the firm didn't use that input. The new dataset should look the following way

    Code:
    id_firm year idmain2_out id2_in use
    1 1990 44 01 1
    1 1990 44 02 1
    1 1990 44 06 0
    1 1991 50 20 1
    ...
    2 1990 28 33 1
    ...
    3 1990 44 01 1
    3 1990 44 02 0
    3 1990 44 06 1
    I haven't figured out a way to do this so I appreciate your help very much. (I also want to know how to create appropriate lists of observations in the forum )

    Best,
    Gustavo

  • #2
    Code:
    clear*
    
    input id_firm year idmain2_out id2_in
    1 1990 44 01
    1 1990 44 02
    1 1991 50 20
    2 1990 28 33
    3 1990 44 01
    3 1990 44 06
    end
    format id2_in %02.0f
    
    list, noobs clean
    
    levelsof id2_in, local(inputs)
    foreach inp of local inputs {
     gen byte use`inp' = (id2_in == `inp')
    }
    collapse (max) use*, by(id_firm year idmain2_out)
    reshape long use, i(id_firm year idmain2_out) j(id2_in)
    format id2_in %02.0f
    
    list, noobs clean

    Comment


    • #3
      One way is

      Code:
      clear
      set more off
      
      input ///
      id_firm year idmain2_out id2_in
      1 1990 44 01
      1 1990 44 02
      1 1991 50 20
      2 1990 28 33
      3 1990 44 01
      3 1990 44 06
      end
      
      gen used = 1
      
      tempfile orig
      save "`orig'"
      
      list, sepby(id_firm)
      
      *----- what you want -----
      
      duplicates drop year idmain2_out id2_in, force
      drop id_firm used
      
      tempfile outin
      save "`outin'"
      
      // list all inputs used each year, for each output
      sort year idmain2_out id2_in
      list, sepby(year idmain2_out)
      
      clear
      use "`orig'"
      
      // keep one obs per (firm year output)
      bysort id_firm year : keep if _n == 1
      drop id2_in used
      
      // bring in inputs
      joinby year idmain2_out using "`outin'"  
      
      // bring in usage
      merge 1:1 id_firm year id2_in using "`orig'", nogen
      replace used = 0 if missing(used)
      
      // final result
      sort id_firm year id2_in
      list, sepby(id_firm year)

      You should:

      1. Read the FAQ carefully.

      2. "Say exactly what you typed and exactly what Stata typed (or did) in response. N.B. exactly!"

      3. Describe your dataset. Use list to list data when you are doing so. Use input to type in your own dataset fragment that others can experiment with.

      4. Use the advanced editing options to appropriately format quotes, data, code and Stata output. The advanced options can be toggled on/off using the A button in the top right corner of the text editor.

      Comment


      • #4
        Originally posted by Clyde Schechter View Post
        Code:
        clear*
        
        input id_firm year idmain2_out id2_in
        1 1990 44 01
        1 1990 44 02
        1 1991 50 20
        2 1990 28 33
        3 1990 44 01
        3 1990 44 06
        end
        format id2_in %02.0f
        
        list, noobs clean
        
        levelsof id2_in, local(inputs)
        foreach inp of local inputs {
        gen byte use`inp' = (id2_in == `inp')
        }
        collapse (max) use*, by(id_firm year idmain2_out)
        reshape long use, i(id_firm year idmain2_out) j(id2_in)
        format id2_in %02.0f
        
        list, noobs clean

        Thank you for your help Clyde. In fact, I wanted to reshape wide the database after adding the observations. You did it the other way round. However, I still need to remove the inputs that no firm used to produce the output in a particular year. As result of the code, the observation {id_firm, year, idmain2_out, id2_in, use} = {1,1990,44,20,0} should not exist or, alternatively, the variable use should display a missing value because in that year (1990) no firm used id2_in=20 to produce dimain2_out=44.

        Comment


        • #5
          So I think you can clean up the observations you don't want with this code immediately after what I showed you above.

          Code:
          egen ever_used = max(use), by(id2_in year idmain2_out)
          drop if ever_used == 0
          drop ever_used
          
          list, noobs clean
          And, of course, you can -reshape wide-, if you like, after that. Just remember, though, that the vast majority of Stata commands work much more felicitously with long layouts than with wide. So think about it carefully.

          Comment


          • #6
            clyde sir what is the command of finding out duplicate data??

            Comment


            • #7
              what is the command of finding out duplicate data?
              Either one of them :
              Code:
              duplicates report varlist
              duplicates list varlist
              duplicates tag varlist
              See help duplicates for further details.

              Comment

              Working...
              X