Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Replacing missing values with criterion

    Hello everyone,
    I am working with panel data, and I would like to ask you again for help.
    I would like to replace missing values in master data with values from using data, based on time criterion.

    Master file Merging.dta
    Code:
    clear
    input double(crsp_fundno crsp_portno) float mofd
    1788 . 405
    1788 . 406
    1788 . 407
    1788 . 408
    1788 . 409
    1788 . 410
    1788 . 411
    1788 . 412
    1788 . 413
    1788 . 414
    1788 . 415
    1788 . 416
    1788 . 417
    1788 . 418
    1788 . 419
    1788 . 420
    1788 . 421
    1788 . 422
    1788 . 423
    1788 . 424
    1788 . 425
    .... . ...
    1788 . 600
    1788 . 601
    1788 . 602
    1788 . 603
    1788 . 604
    1788 . 605
    1788 . 606
    end
    Using file FundnotoPortno.dta

    Code:
    clear
    input double(crsp_fundno crsp_portno) float(begmofd endmofd)
    1788 1009555 582 604
    1788 1023119 605 740
    2118 1028804 611 740
    end
    Begmofd - starting month
    Endmofd - ending month
    As in the example above of the using file, the data shows that a different crsp_portno can apply to the same crsp_fundno, but over different time period. Some crsp_fundnos can have only one range.
    The example of the master file is an extreme one, as the the earliest time record is 405, which is outside of the 582-604 month range. Although 405 is lower than 582, I still would like to replace those missing crsp_portno values from master file with the crsp_portno from the using file, using the lower range.
    May I kindly ask you to tell me how should I perform it? I have never before run such a command, and therefore it seems quite complex for me.

    Thank you very much for any help or advice!

    Best,
    RafaƂ

  • #2
    Code:
    clear
    input double(crsp_fundno crsp_portno) float mofd
    1788 . 405
    1788 . 406
    1788 . 407
    1788 . 408
    1788 . 409
    1788 . 410
    1788 . 411
    1788 . 412
    1788 . 413
    1788 . 414
    1788 . 415
    1788 . 416
    1788 . 417
    1788 . 418
    1788 . 419
    1788 . 420
    1788 . 421
    1788 . 422
    1788 . 423
    1788 . 424
    1788 . 425
    1788 . 600
    1788 . 601
    1788 . 602
    1788 . 603
    1788 . 604
    1788 . 605
    1788 . 606
    end
    tempfile merging
    save `merging'
    
    clear
    input double(crsp_fundno crsp_portno) float(begmofd endmofd)
    1788 1009555 582 604
    1788 1023119 605 740
    2118 1028804 611 740
    end
    tempfile fundnotoportno
    save `fundnotoportno'
    
    use `fundnotoportno', clear
    by crsp_fundno (begmofd), sort: assert begmofd[_n+1] > endmofd
    rename crsp_portno replacement
    by crsp_fundno (begmofd), sort: replace begmofd = . if _n == 1
    rangejoin mofd begmofd endmofd using `merging', by(crsp_fundno)
    drop if missing(mofd)
    replace crsp_portno = replacement if missing(crsp_portno)
    drop replacement begmofd endmofd
    -rangejoin- is written by Robert Picard. To use it, you must also have -rangestat-, by Robert Picard, Nick Cox, and Roberto Ferrer. Both are available from SSC.

    Note: As requested, if mofd is before the earliest begmofd in Fundnotoportno.dta, it will be matched with the first crsp_portno for that firm. You did not say what you want to do if mofd is after the latest endmofd. The above code will make no match at all for such observations. If, however, you wish to use the last crsp_portno for that firm, you can modify the code by inserting:
    Code:
    by crsp_fundno (begmofd): replace endmofd = . if _n == _N
    immediately before the -rangejoin- command.

    Comment


    • #3
      Thank you Clyde for your reply. I am more than grateful. Yet, I have some problem with first line...

      When I run the
      by crsp_fundno (begmofd), sort: assert begmofd[_n+1] > endmofd
      I got answer
      10 contradictions in 75,313 observations
      assertion is false
      r(9);
      75,313 means all observations.

      Then, it seems that for 10 fundnos the time ranges could be in wrong order.
      Do you know how can I find these contradictions?

      Comment


      • #4
        Ok I created a dummy and found those that were false. Sometimes it happens that the endmofd at range 1 is the same as the begmofd at the range 2. Now I have other problem - when I run rangejoin mofd begmofd endmofd using `merging', by(crsp_fundno) , invalid file specification warning r(198); pops out. How should I deal with this?
        Last edited by Rafal Krol; 22 Dec 2021, 08:34.

        Comment


        • #5
          Clyde Schechter it is done. I made a mistake in the process.

          Everything worked, but the problem now is that the created file has only crsp_fundo's that appeared in the "fundnotoportno.dta" and not the crsp_fundno's that were originally in the "merging.dta" before running the code.

          The thing is that in the "merging.dta" file there were crsp_fundos that already had their portnos and crsp_fundos with missing portnos. I have to use the fundnotoportno.dta to fill in the crsp_fundos with missing values.
          What would be the best way to merge the data now? Merge by crsp_fundno mofd using only the replacement variable and then run replace crsp_portno = replacement if missing(crsp_portno) ? Or is there other convenient method using rangejoin?

          Comment


          • #6
            The thing is that in the "merging.dta" file there were crsp_fundos that already had their portnos and crsp_fundos with missing portnos. I have to use the fundnotoportno.dta to fill in the crsp_fundos with missing values.
            What would be the best way to merge the data now? Merge by crsp_fundno mofd using only the replacement variable and then run replace crsp_portno = replacement if missing(crsp_portno) ? Or is there other convenient method using rangejoin?
            I don't understand what is happening. The very command -replace crsp_portno = replacement if missing(crsp_portno)-, is, in fact, already part of the code I proposed in #2, so I don't understand how you ended up with originally existing crsp_portno values being overwritten. I think you should post the exact code you are using along with a new example of your data (using -dataex-, of course) that illustrates the problem and I'll try to troubleshoot it.

            Comment


            • #7
              Yes, sure. Apologies for the problem.
              fundnotoportno.dta - file mapping available fundnos to portnos. Some fundnos keep same portno over entire life time, some change the portno over time.

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double(crsp_fundno crsp_portno) float(begmofd endmofd)
                72 1005195 582 740
                73 1005195 582 740
               105 1000880 602 604
               105 1022431 605 740
               139 1003340 582 740
               184 1003387 582 740
               198 1021159 614 740
               883 1009701 582 604
               883 1028709 605 740
              1047 1018208 582 740
              1552 1024495 605 740
              1586 1018140 582 740
              1788 1009555 582 604
              1788 1023119 605 740
              1966 1025414 605 740
              1966 1001693 582 604
              1981 1001039 582 604
              1981 1022988 605 740
              1998 1024317 605 740
              2118 1028804 611 740
              2298 1005324 582 604
              2298 1022432 605 740
              2308 1003268 582 740
              2413 1003657 602 604
              2413 1023261 605 740
              2420 1022432 644 740
              2424 1003268 582 740
              2611 1005653 582 740
              2621 1003387 582 740
              2698 1004114 582 740
              2700 1004697 569 604
              2700 1027806 605 740
              2701 1027807 605 740
              2702 1027808 605 740
              2703 1021446 605 740
              2711 1026815 605 740
              2711 1004575 599 604
              2739 1018240 597 740
              2745 1002776 545 604
              2745 1027459 605 740
              2746 1027458 605 740
              2746 1002775 545 604
              2747 1001992 540 604
              2747 1028185 605 740
              2748 1001248 534 604
              2748 1023966 605 740
              2751 1005209 522 740
              2752 1004496 526 740
              2754 1001992 522 604
              2754 1028185 605 740
              2755 1001991 522 604
              2755 1021185 605 740
              2756 1004105 522 740
              2757 1000018 522 604
              2757 1021157 605 740
              2758 1021157 605 740
              2758 1000018 522 604
              2759 1021072 605 740
              2759 1001247 522 604
              2760 1001247 522 604
              2760 1021072 605 740
              2761 1000396 522 604
              2761 1024488 605 740
              2762 1024488 605 740
              2762 1000396 522 604
              2763 1023966 605 740
              2763 1001248 522 604
              2764 1021063 605 740
              2765 1021063 605 740
              2766 1018103 729 740
              2767 1005366 522 740
              2768 1001250 522 740
              2769 1010284 579 740
              2770 1010284 579 740
              2771 1000396 522 604
              2771 1024488 605 740
              2772 1010284 579 740
              2774 1005209 522 740
              2775 1003875 522 740
              2776 1003696 522 740
              2777 1026442 605 740
              2777 1002202 522 604
              2781 1017669 720 740
              2785 1010378 579 604
              2785 1021132 605 740
              2787 1017671 720 740
              2788 1017671 720 740
              2795 1003327 522 740
              2801 1009520 579 604
              2801 1021270 605 740
              2804 1010378 579 604
              2804 1021132 605 740
              2805 1021270 605 740
              2805 1009520 579 604
              2806 1004647 551 740
              2807 1002202 560 604
              2807 1026442 605 740
              2808 1002776 563 604
              2808 1027459 605 740
              2809 1003210 563 740
              end
              Merging.dta - main data sample that needs missing values to be replaced with the values from the fundnotoportno.dta.. Example consists of fundnos (1,2) that are not in the fundnotoportno.dta, fundnos that are in the sample but have their portno assigned before (72,73), and a fundno that changes its portno over time (1788)

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double(crsp_fundno crsp_portno) float mofd
              1 . 446
              1 . 447
              1 . 448
              2 . 446
              2 . 447
              2 . 448
              2 . 449
              2 . 450
              2 . 451
              2 . 452
              2 . 453
              3 . 457
              3 . 458
              3 . 459
              3 . 460
              3 . 461
              3 . 462
              72 1005195 416
              72 1005195 417
              72 1005195 418
              72 1005195 419
              72 1005195 420
              72 1005195 421
              72 1005195 422
              72 1005195 423
              72 1005195 424
              72 1005195 425
              72 1005195 426
              72 1005195 427
              72 1005195 428
              72 1005195 429
              72 1005195 430
              72 1005195 431
              72 1005195 432
              72 1005195 433
              72 1005195 434
              72 1005195 435
              72 1005195 436
              72 1005195 437
              72 1005195 438
              72 1005195 439
              72 1005195 440
              72 1005195 441
              72 1005195 442
              72 1005195 443
              73 1005195 441
              73 1005195 442
              73 1005195 443
              73 1005195 444
              73 1005195 445
              73 1005195 446
              73 1005195 447
              73 1005195 448
              73 1005195 449
              73 1005195 450
              73 1005195 451
              73 1005195 452
              73 1005195 453
              73 1005195 454
              73 1005195 455
              73 1005195 456
              73 1005195 457
              73 1005195 458
              1788 . 405
              1788 . 406
              1788 . 407
              1788 . 408
              1788 . 409
              1788 . 410
              1788 . 411
              1788 . 412
              1788 . 413
              1788 . 414
              1788 . 415
              1788 . 416
              1788 . 417
              1788 . 418
              1788 . 419
              1788 . 420
              1788 . 421
              1788 . 422
              1788 . 423
              1788 . 424
              1788 . 425
              1788 . 600
              1788 . 601
              1788 . 602
              1788 . 603
              1788 . 604
              1788 . 605
              1788 . 606
              end
              Code used
              Code:
              use "C:\Users\User\Desktop\Data\merging.dta"
              tempfile merging
              save `merging'
              use "C:\Users\User\Desktop\Data\FundnoToPortno.dta"
              tempfile fundnotoportno
              save `fundnotoportno'
              use `fundnotoportno', clear
              by crsp_fundno (begmofd), sort: assert begmofd[_n+1] >= endmofd
              rename crsp_portno replacement
              by crsp_fundno (begmofd), sort: replace begmofd = . if _n == 1
              by crsp_fundno (begmofd), sort: replace endmofd = . if _n == _N
              ssc install rangejoin
              rangejoin mofd begmofd endmofd using `merging', by(crsp_fundno)
              drop if missing(mofd)
              replace crsp_portno = replacement if missing(crsp_portno)
              drop replacement begmofd endmofd
              end
              This is extract from the final file after running the code (from the first observation in the sample, thus fundos from 1 to 71, inclusive, are missing)

              Code:
              * Example generated by -dataex-. To install: ssc install dataex
              clear
              input double(crsp_fundno crsp_portno) float mofd
              72 1005195 416
              72 1005195 417
              72 1005195 418
              72 1005195 419
              72 1005195 420
              72 1005195 421
              72 1005195 422
              72 1005195 423
              72 1005195 424
              72 1005195 425
              72 1005195 426
              72 1005195 427
              72 1005195 428
              72 1005195 429
              72 1005195 430
              72 1005195 431
              72 1005195 432
              72 1005195 433
              72 1005195 434
              72 1005195 435
              72 1005195 436
              72 1005195 437
              72 1005195 438
              72 1005195 439
              72 1005195 440
              72 1005195 441
              72 1005195 442
              72 1005195 443
              72 1005195 444
              72 1005195 445
              72 1005195 446
              72 1005195 447
              72 1005195 448
              72 1005195 449
              72 1005195 450
              72 1005195 451
              72 1005195 452
              72 1005195 453
              72 1005195 454
              72 1005195 455
              72 1005195 456
              72 1005195 457
              72 1005195 458
              72 1005195 459
              72 1005195 460
              72 1005195 461
              72 1005195 462
              72 1005195 463
              72 1005195 464
              72 1005195 465
              72 1005195 466
              72 1005195 467
              72 1005195 468
              72 1005195 469
              72 1005195 470
              72 1005195 471
              72 1005195 472
              72 1005195 473
              72 1005195 474
              72 1005195 475
              72 1005195 476
              72 1005195 477
              72 1005195 478
              72 1005195 479
              72 1005195 480
              72 1005195 481
              72 1005195 482
              72 1005195 483
              72 1005195 484
              73 1005195 441
              73 1005195 442
              73 1005195 443
              73 1005195 444
              73 1005195 445
              73 1005195 446
              73 1005195 447
              73 1005195 448
              73 1005195 449
              73 1005195 450
              73 1005195 451
              73 1005195 452
              73 1005195 453
              73 1005195 454
              73 1005195 455
              73 1005195 456
              73 1005195 457
              73 1005195 458
              73 1005195 459
              73 1005195 460
              73 1005195 461
              73 1005195 462
              73 1005195 463
              73 1005195 464
              73 1005195 465
              73 1005195 466
              73 1005195 467
              73 1005195 468
              73 1005195 469
              73 1005195 470
              73 1005195 471
              end
              I hope it illustrates the problem better now. Thank you for help Sir.



              Comment


              • #8
                Well, I see what happened. It turns out we can't use -rangejoin- for this. But there is an easy workaround.

                Code:
                use `fundnotoportno', clear
                by crsp_fundno (begmofd), sort: assert begmofd[_n+1] >= endmofd
                rename crsp_portno replacement
                by crsp_fundno (begmofd), sort: replace begmofd = . if _n == 1
                by crsp_fundno (begmofd), sort: replace endmofd = . if _n == _N
                joinby crsp_fundno using `merging', unmatched(using)
                keep if inrange(mofd, begmofd, endmofd)
                replace crsp_portno = replacement if missing(crsp_portno)
                drop replacement begmofd endmofd
                Now the observations with crsp_fundno < 73 are preserved. They do not have any matching crsp_portno, because the merging file doesn't have this information.

                I want to point out a problem with your revision -by crsp_fundno (begmofd), sort: assert begmofd[_n+1] >= endmofd-. The original had >, not >=. And that's important. If you have observations where begmofd[_n+1] == endmofd, then the data are invalid. Because if the date in the file merging happens to be that value, then it is ambiguous which observation in fundnotoportno to pair it with: there are two conflicting values in play. The code above will actually end up with duplicate observations, one with each of the conflicting values. But it means you have not identified a solution.

                So you need to clean up that data set. If fundnotoportno has observations like that, you have to resolve the contradiction between them somehow. Otherwise the problem is ill-posed and has no consistent solution.

                Comment


                • #9
                  Yes, now the results are perfect! Thank you Sir!

                  In regards to -by crsp_fundno (begmofd), sort: assert begmofd[_n+1] >= endmofd-, I found the solution - 10 observations had overlapping ranges - thus I manually corrected them, and then run the code. In fact, there was extra "=" after ">" , but it changed nothing as none of the ranges overlapped after correction. Thus, if changes nothing - I think should be deleted to avoid such situations. Sorry for being too sloppy...

                  Comment


                  • #10
                    Hello once again.
                    At first I would like to thank you Clyde for your commitment in helping me.
                    I started to work with my project again (posponed due to health issues), and I noticed that the code doesn't work the way I wanted, again because of my faulty explanation.
                    The problem is that in the first message I wrote "I still would like to replace these missing crsp_portno values from master file with the crsp_portno from the using file, using the lower range" - relating to the extreme example provided in first message - AND forgot to add "only until the endmofd from the first range, and then apply right crsp_portno from the second range onwards according to right ranges. For example, this is what happens now.

                    This is the range
                    Click image for larger version

Name:	1.png
Views:	1
Size:	2.7 KB
ID:	1666096

                    And this is the result
                    Click image for larger version

Name:	2.png
Views:	1
Size:	6.2 KB
ID:	1666097

                    From mofd of 605 the right crsp_portno is 1027806 and currently it is the same as in the first range i.e. 1004697.

                    This is the code I used:
                    Code:
                    tempfile merging
                    save `merging'
                    use "$datafolder1\FundnoToPortno.dta"
                    tempfile fundnotoportno
                    save `fundnotoportno'
                    use `fundnotoportno', clear
                    by crsp_fundno (begmofd), sort: assert begmofd[_n+1] > endmofd
                    rename crsp_portno replacement
                    by crsp_fundno (begmofd), sort: replace begmofd = . if _n == 1
                    by crsp_fundno (begmofd), sort: replace endmofd = . if _n == _N
                    joinby crsp_fundno using `merging', unmatched(using)
                    keep if inrange(mofd, begmofd, endmofd)
                    replace crsp_portno = replacement if missing(crsp_portno)
                    drop replacement begmofd endmofd _merge

                    I will be more than grateful for any help.
                    Thank you and my apologies

                    Comment


                    • #11
                      The problem you are having comes from the lines
                      Code:
                      by crsp_fundno (begmofd), sort: replace begmofd = . if _n == 1
                      by crsp_fundno (begmofd), sort: replace endmofd = . if _n == _N
                      The difficulty is that the first one changes the value of begmofd to . in the first observation. Then when the second one sorts on begmofd (within crsp_fundno), what was originally the first observation is now sorted to last (because in every context except -inrange()- or -rangejoin-) missing value is larger than any real number. It, therefore, now also get endmofd replaced with missing value. Consequently, your crosswalk between crsp_fundno and crsp_portno looks like this:

                      Code:
                      . list, noobs clean
                      
                          crsp_f~o   replac~t   begmofd   endmofd  
                              1788    1023119       605       740  
                              1788    1009555         .         .  
                              2118    1028804         .         .  
                              2700    1027806       605       740  
                              2700    1004697         .         .
                      which is clearly not what you want.

                      The following code fixes this problem. (I use slightly different names for somethings, but this is of no importance here.)
                      Code:
                      clear
                      input double(crsp_fundno crsp_portno) float mofd
                      1788 . 405
                      1788 . 406
                      1788 . 407
                      1788 . 408
                      1788 . 409
                      1788 . 410
                      1788 . 411
                      1788 . 412
                      1788 . 413
                      1788 . 414
                      1788 . 415
                      1788 . 416
                      1788 . 417
                      1788 . 418
                      1788 . 419
                      1788 . 420
                      1788 . 421
                      1788 . 422
                      1788 . 423
                      1788 . 424
                      1788 . 425
                      1788 . 600
                      1788 . 601
                      1788 . 602
                      1788 . 603
                      1788 . 604
                      1788 . 605
                      1788 . 606
                      2700 . 596
                      2700 . 597
                      2700 . 598
                      2700 . 599
                      2700 . 600
                      2700 . 601
                      2700 . 602
                      2700 . 603
                      2700 . 604
                      2700 . 605
                      2700 . 606
                      2700 . 607
                      end
                      tempfile master
                      save `master'
                      
                      clear
                      input double(crsp_fundno crsp_portno) float(begmofd endmofd)
                      1788 1009555 582 604
                      1788 1023119 605 740
                      2118 1028804 611 740
                      2700 1004697 569 604
                      2700 1027806 605 740
                      end
                      tempfile fundno_to_portno
                      save `fundno_to_portno'
                      
                      use `fundno_to_portno', clear
                      by crsp_fundno (begmofd), sort: assert begmofd[_n+1] > endmofd
                      by crsp_fundno (begmofd): replace endmofd = . if _n == _N
                      by crsp_fundno (begmofd): replace begmofd = . if _n == 1
                      rename crsp_portno replacement
                      tempfile crosswalk
                      save `crosswalk'
                      
                      use `master', clear
                      joinby crsp_fundno using `crosswalk', unmatched(master)
                      keep if inrange(mofd, begmofd, endmofd)
                      replace crsp_portno = replacement if missing(crsp_portno)
                      and the output for crsp_fudno 2700 is just what you say you want:
                      Code:
                      . list crsp_fundno crsp_portno mofd if crsp_fundno == 2700, noobs clean
                      
                          crsp_f~o   crsp_p~o   mofd  
                              2700    1004697    596  
                              2700    1004697    597  
                              2700    1004697    598  
                              2700    1004697    599  
                              2700    1004697    600  
                              2700    1004697    601  
                              2700    1004697    602  
                              2700    1004697    603  
                              2700    1004697    604  
                              2700    1027806    605  
                              2700    1027806    606  
                              2700    1027806    607
                      Note that I have interchanged the order of the two replacement commands, so that the sort order is not changed between them. Note also that I use -by- without the -,sort-. This is a precaution: if I had made the same mistake you did, this would have thrown an error message on the spot because for the second one, the data would no longer be sorted in the way required for the -by-. So I would have discovered this problem immediately.

                      This is a good practice to follow in general when using a series of -by- commands. If the data should not change order during that series, use -, sort- only on the first one. That way if things go awry, the code will break and you find the problem right away, instead of ending up with puzzling results much later in the code.
                      Last edited by Clyde Schechter; 24 May 2022, 11:18.

                      Comment


                      • #12
                        I would never thought that such a simple idea would translate in a complex code such as this. Thank you very much Clyde; it worked! I cannot describe how grateful I am.

                        Comment

                        Working...
                        X