Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Spell data and cross-tabulations

    Dear Statalist users,

    I have a question about a spell-dataset type. I am trying to merge survey data with the admin. dataset of respondents w.r.t. reported receipt of state benefits to investigate the extent of misreporting. There is test data for the latter dataset but not for the former; survey data offers no test data. The variable of interest in survey data is called hl0012 (1= Yes, benefit receipt, 2=No, -1=missing). And I would compare both sources based on the survey year in the survey dataset that is called (syear) and captures only the year in which the survey was conducted and respondents were asked. The spell variable is more detailed that reflects a date as shown in the ex. below.

    My first question is how to best compare both givens in the two datasets using a dummy variable that would allow for cross-tabulation. And should I split the spells that stretch over multiple years for an individual or leave the episodes as given in the original data (there is a code suggested for this episode splitting that I have and creates span_year "Number of years spanned by the spell")? Ideally, the below example should be adjusted to include syear and the receipt dummy variable. My sequence of work in actual data is as follows, I started with the admin. data side, then merged the survey data and kept data if matched==3. I wanted initially to see the extent of misreporting for two survey years, however could be seen also for the entire given period.

    clear
    input long persnr byte(spell quelle) int(begepi endepi begorig endorig)
    457031 12 16 16437 16538 16437 17166
    457031 14 16 16539 16696 16437 17166
    457031 17 16 16697 17011 16437 17166
    457031 20 16 17012 17166 16437 17166
    457031 23 16 17167 17409 17167 17409
    457149 20 16 16437 16719 16437 17166
    457149 22 16 16720 17166 16437 17166
    457149 24 16 17167 17324 17167 17488
    457149 26 16 17325 17395 17167 17488
    457149 28 16 17396 17488 17167 17488
    457149 30 16 17489 17903 17489 17903
    457186 2 16 16607 16694 16607 16728
    457186 4 16 16695 16728 16607 16728
    457229 3 16 17442 17484 17442 17484
    457229 5 16 17485 17573 17485 18785
    457229 8 16 17574 17795 17485 18785
    457229 11 16 17796 17897 17485 18785
    457229 14 16 17898 17943 17485 18785
    457229 17 16 17944 18216 17485 18785
    457229 20 16 18217 18262 17485 18785
    457229 23 16 18263 18489 17485 18785
    457229 25 16 18490 18785 17485 18785
    457333 20 16 16437 16477 16437 16477
    457340 39 16 18723 18780 18723 18975
    457340 41 16 18781 18942 18723 18975
    457340 44 16 18943 18975 18723 18975
    457516 3 16 16437 16895 16437 17166
    457516 5 16 16896 17112 16437 17166
    457516 7 16 17113 17166 16437 17166
    457516 9 16 17167 17297 17167 17475
    457516 12 16 17298 17475 17167 17475
    457516 15 16 17476 17499 17476 17499
    457516 18 16 17500 17531 17500 18585
    457516 21 16 17532 17897 17500 18585
    457516 24 16 17898 18262 17500 18585
    457516 27 16 18263 18585 17500 18585
    457516 30 16 18586 18613 18586 18932
    457516 33 16 18614 18627 18586 18932
    457516 34 16 18614 18627 18614 18851
    457516 36 16 18628 18851 18586 18932
    457516 37 16 18628 18851 18614 18851
    457516 39 16 18852 18932 18586 18932
    457679 1 16 16492 16605 16492 16605
    457679 2 16 16606 16701 16606 16710
    457679 4 16 16702 16710 16606 16710
    457679 6 16 16711 16863 16711 16863
    457679 8 16 16864 16867 16864 16991
    457679 10 16 16868 16991 16864 16991
    457679 12 16 16992 17148 16992 17148
    457679 14 16 17149 17166 17149 17166
    457679 16 16 17167 17972 17167 17972
    457679 18 16 17973 17975 17973 17975
    457679 22 16 18092 18384 18092 18384
    457679 24 16 18385 18627 18385 18627
    457720 5 16 17025 17058 17025 17166
    457720 8 16 17059 17166 17025 17166
    457720 11 16 17167 17177 17167 17177
    457720 16 16 17295 17360 17295 17360
    457727 26 16 16437 16586 16437 16875
    457727 28 16 16587 16661 16437 16875
    457727 31 16 16662 16789 16437 16875
    457727 33 16 16790 16875 16437 16875
    457847 17 16 16581 16712 16581 16809
    457847 19 16 16713 16722 16581 16809
    457847 22 16 16723 16753 16581 16809
    457847 24 16 16754 16780 16581 16809
    457847 26 16 16781 16788 16581 16809
    457847 28 16 16789 16809 16581 16809
    457847 30 16 16810 16814 16810 16814
    457847 32 16 16815 17060 16815 17060
    457847 34 16 17061 17073 17061 17073
    457847 36 16 17074 17166 17074 17166
    457847 38 16 17167 17685 17167 17992
    457847 40 16 17686 17746 17167 17992
    457847 42 16 17747 17992 17167 17992
    457847 45 16 18272 18504 18272 18504
    457866 13 16 17287 17336 17287 17522
    457866 15 16 17337 17479 17287 17522
    457866 17 16 17480 17492 17287 17522
    457866 19 16 17493 17522 17287 17522
    457866 21 16 17523 17612 17523 17622
    457866 24 16 17613 17622 17523 17622
    457873 11 16 17194 17200 17194 17415
    457873 13 16 17201 17207 17194 17415
    457873 15 16 17208 17213 17194 17415
    457873 17 16 17214 17269 17194 17415
    457873 19 16 17270 17275 17194 17415
    457873 21 16 17276 17415 17194 17415
    457873 23 16 17416 17438 17416 17438
    457873 24 16 17439 17507 17439 17522
    457873 26 16 17508 17522 17439 17522
    458028 11 16 16520 16557 16520 16557
    458028 13 16 16558 17062 16558 17166
    458028 15 16 17063 17166 16558 17166
    458028 17 16 17167 17639 17167 18238
    458028 19 16 17640 17662 17167 18238
    458028 21 16 17663 17704 17167 18238
    458028 23 16 17705 17787 17167 18238
    458028 25 16 17788 17877 17167 18238
    458028 27 16 17878 17901 17167 18238
    end
    format %tdD_m_CY begepi
    format %tdD_m_CY endepi
    format %tdD_m_CY begorig
    format %tdD_m_CY endorig
    label values quelle quelle_de
    label def quelle_de 16 "16 benefit receipt", modify
    label values begorig miss_de
    label values endorig miss_de

    where:
    Episode start date (begepi) & Episode end date (endepi) & Observation counter per person (spell)
    begorig: The original start date of the observation corresponds to the original start date of the notification.
    endorig: The original end date of the notification
    [/CODE]

    Thank you in advance!

  • #2
    Maybe somebody else understands what you have written and can respond directly. But I'm confused. You speak of two data sets, but only show example data for one. In the one you do show, there are two different time intervals shown: begepi to endepi and begorig to endorig. I can't figure out which of these intervals corresponds to the period in which the person is receiving benefits. If you show example data from the other data set and clarify the one you did show, I can probably show you an efficient way to combine the data sets for the purpose of cross-checking the survey self-report of benefits receipt with the administrative data set.

    Comment


    • #3
      Dear Clyde,

      As always, many thanks.

      I wanted to clarify a few things. Unfortunately, only test data for the admin. dataset side and not for both is provided. The test data I show above slightly deviates from my actual data (mainly the categories in the quelle variable-I kept only the one I am interested in my dataset, 16 in the test data). And the survey data includes a yearly panel, with pid (personal identifiers), (hid, household identifiers) and (syear, survey year). I can try to generate synthetic data ex. of this if needed. I would be also happy to know if the dataex can assign arbitrary IDs after or before the code generation that mimic the same ID structure.

      About the admin. data, it might be confusing with the two intervals. I think begepi should be the one to go with. The documentation defines as episode start date and I would think of the episode as the spell (in this case, the spell of the benefit type I kept in my quelle/source variable). The long description of this variable mentions "the start date of the split episode is always equal to or greater than the start date of the original observation". Orig in the other interval variable thus refer to original. And this relates to my question on whether I would have to use a code for splitting this variable or leave it intact. The above dataex is using the test data as it is. I can show you the code I use for splitting if this step should be part of my data preparation. This is also addressed in the same link.

      My cross-checks of misreporting could be either inspected for the entire period as handed over/ observed in the admin data or simply look at two years only (e.g., 2018, 2017). In doing so, since my original data analysis was done for household heads as it is a household benefit. I was thinking about limiting the checks to household heads only (that is, let's say if, hhead==1).

      I imagined doing the analysis this way: I started with the admin data, used gen jahr = year(begepi) and then generated jahr in the survey data=syear and merged both based on (jahr and pid). And I merged first the personal dataset (as this is what is present in the admin. data) then merged the household dataset to get the household IDs (hids) and to get the full structure since hids would be needed too.

      Additional info.: describe begepi code gives integer. And this code below did not work due to a mismatch error-also when I remove 'double'
      Code:
      gen double date_begepi = daily(begepi, "DMY")
      Last edited by Hend She; 02 Aug 2023, 16:11.

      Comment


      • #4
        As you didn't show example data for the survey data set, I have made up a toy example to illustrate the approach I would take. I am working on the assumption, which I believe is true in your description, that in the survey data you do not have an exact date--you have a year (jahr). This leads inherently to some inaccuracy with the administrative spell data. For example, if the spell runs from 1jun2007 to 30sepjun2007 and the survey for 2007 says either yes or no to having received benefits, you cannot tell whether the report in the survey is accurate or not because it depends on when in the year 2007 the respondent is referring to. There is no way to solve this problem without knowing to what date, or range of dates, the respondent is referring in the survey.

        Be that as it may, I would approach this problem like this:
        Code:
        clear
        input long persnr byte(spell quelle) int(begepi endepi begorig endorig)
        457031 12 16 16437 16538 16437 17166
        457031 14 16 16539 16696 16437 17166
        457031 17 16 16697 17011 16437 17166
        457031 20 16 17012 17166 16437 17166
        457031 23 16 17167 17409 17167 17409
        457149 20 16 16437 16719 16437 17166
        457149 22 16 16720 17166 16437 17166
        457149 24 16 17167 17324 17167 17488
        457149 26 16 17325 17395 17167 17488
        457149 28 16 17396 17488 17167 17488
        457149 30 16 17489 17903 17489 17903
        457186 2 16 16607 16694 16607 16728
        457186 4 16 16695 16728 16607 16728
        457229 3 16 17442 17484 17442 17484
        457229 5 16 17485 17573 17485 18785
        457229 8 16 17574 17795 17485 18785
        457229 11 16 17796 17897 17485 18785
        457229 14 16 17898 17943 17485 18785
        457229 17 16 17944 18216 17485 18785
        457229 20 16 18217 18262 17485 18785
        457229 23 16 18263 18489 17485 18785
        457229 25 16 18490 18785 17485 18785
        457333 20 16 16437 16477 16437 16477
        457340 39 16 18723 18780 18723 18975
        457340 41 16 18781 18942 18723 18975
        457340 44 16 18943 18975 18723 18975
        457516 3 16 16437 16895 16437 17166
        457516 5 16 16896 17112 16437 17166
        457516 7 16 17113 17166 16437 17166
        457516 9 16 17167 17297 17167 17475
        457516 12 16 17298 17475 17167 17475
        457516 15 16 17476 17499 17476 17499
        457516 18 16 17500 17531 17500 18585
        457516 21 16 17532 17897 17500 18585
        457516 24 16 17898 18262 17500 18585
        457516 27 16 18263 18585 17500 18585
        457516 30 16 18586 18613 18586 18932
        457516 33 16 18614 18627 18586 18932
        457516 34 16 18614 18627 18614 18851
        457516 36 16 18628 18851 18586 18932
        457516 37 16 18628 18851 18614 18851
        457516 39 16 18852 18932 18586 18932
        457679 1 16 16492 16605 16492 16605
        457679 2 16 16606 16701 16606 16710
        457679 4 16 16702 16710 16606 16710
        457679 6 16 16711 16863 16711 16863
        457679 8 16 16864 16867 16864 16991
        457679 10 16 16868 16991 16864 16991
        457679 12 16 16992 17148 16992 17148
        457679 14 16 17149 17166 17149 17166
        457679 16 16 17167 17972 17167 17972
        457679 18 16 17973 17975 17973 17975
        457679 22 16 18092 18384 18092 18384
        457679 24 16 18385 18627 18385 18627
        457720 5 16 17025 17058 17025 17166
        457720 8 16 17059 17166 17025 17166
        457720 11 16 17167 17177 17167 17177
        457720 16 16 17295 17360 17295 17360
        457727 26 16 16437 16586 16437 16875
        457727 28 16 16587 16661 16437 16875
        457727 31 16 16662 16789 16437 16875
        457727 33 16 16790 16875 16437 16875
        457847 17 16 16581 16712 16581 16809
        457847 19 16 16713 16722 16581 16809
        457847 22 16 16723 16753 16581 16809
        457847 24 16 16754 16780 16581 16809
        457847 26 16 16781 16788 16581 16809
        457847 28 16 16789 16809 16581 16809
        457847 30 16 16810 16814 16810 16814
        457847 32 16 16815 17060 16815 17060
        457847 34 16 17061 17073 17061 17073
        457847 36 16 17074 17166 17074 17166
        457847 38 16 17167 17685 17167 17992
        457847 40 16 17686 17746 17167 17992
        457847 42 16 17747 17992 17167 17992
        457847 45 16 18272 18504 18272 18504
        457866 13 16 17287 17336 17287 17522
        457866 15 16 17337 17479 17287 17522
        457866 17 16 17480 17492 17287 17522
        457866 19 16 17493 17522 17287 17522
        457866 21 16 17523 17612 17523 17622
        457866 24 16 17613 17622 17523 17622
        457873 11 16 17194 17200 17194 17415
        457873 13 16 17201 17207 17194 17415
        457873 15 16 17208 17213 17194 17415
        457873 17 16 17214 17269 17194 17415
        457873 19 16 17270 17275 17194 17415
        457873 21 16 17276 17415 17194 17415
        457873 23 16 17416 17438 17416 17438
        457873 24 16 17439 17507 17439 17522
        457873 26 16 17508 17522 17439 17522
        458028 11 16 16520 16557 16520 16557
        458028 13 16 16558 17062 16558 17166
        458028 15 16 17063 17166 16558 17166
        458028 17 16 17167 17639 17167 18238
        458028 19 16 17640 17662 17167 18238
        458028 21 16 17663 17704 17167 18238
        458028 23 16 17705 17787 17167 18238
        458028 25 16 17788 17877 17167 18238
        458028 27 16 17878 17901 17167 18238
        end
        format %tdD_m_CY begepi
        format %tdD_m_CY endepi
        format %tdD_m_CY begorig
        format %tdD_m_CY endorig
        label values quelle quelle_de
        label def quelle_de 16 "16 benefit receipt", modify
        label values begorig miss_de
        label values endorig miss_de
        tempfile admin_data
        save `admin_data'
        
        * Example generated by -dataex-. For more info, type help dataex
        clear
        input long persnr float(jahr benefits)
        457031 2005 1
        457031 2006 0
        457031 2007 1
        457031 2008 1
        457031 2009 0
        457031 2010 1
        457031 2011 1
        457149 2005 1
        457149 2006 0
        457149 2007 1
        457149 2008 1
        457149 2009 0
        457149 2010 0
        457149 2011 1
        457186 2005 0
        457186 2006 1
        457186 2007 0
        457186 2008 0
        457186 2009 0
        457186 2010 1
        457186 2011 0
        457229 2005 1
        457229 2006 0
        457229 2007 1
        457229 2008 0
        457229 2009 1
        457229 2010 1
        457229 2011 0
        457333 2005 0
        457333 2006 1
        457333 2007 1
        457333 2008 0
        457333 2009 1
        457333 2010 1
        457333 2011 1
        457340 2005 0
        457340 2006 0
        457340 2007 1
        457340 2008 1
        457340 2009 1
        457340 2010 0
        457340 2011 1
        457516 2005 1
        457516 2006 1
        457516 2007 0
        457516 2008 0
        457516 2009 0
        457516 2010 0
        457516 2011 1
        457679 2005 1
        457679 2006 1
        457679 2007 1
        457679 2008 0
        457679 2009 1
        457679 2010 0
        457679 2011 0
        457720 2005 0
        457720 2006 0
        457720 2007 0
        457720 2008 1
        457720 2009 1
        457720 2010 1
        457720 2011 1
        457727 2005 0
        457727 2006 1
        457727 2007 0
        457727 2008 1
        457727 2009 0
        457727 2010 1
        457727 2011 0
        457847 2005 0
        457847 2006 0
        457847 2007 0
        457847 2008 0
        457847 2009 1
        457847 2010 0
        457847 2011 0
        457866 2005 1
        457866 2006 0
        457866 2007 0
        457866 2008 1
        457866 2009 0
        457866 2010 1
        457866 2011 1
        457873 2005 0
        457873 2006 1
        457873 2007 1
        457873 2008 0
        457873 2009 0
        457873 2010 0
        457873 2011 0
        458028 2005 0
        458028 2006 0
        458028 2007 1
        458028 2008 1
        458028 2009 0
        458028 2010 0
        458028 2011 1
        end
        label values benefits benefits
        label def benefits 0 "No Benefits", modify
        label def benefits 1 "Benefits", modify
        tempfile survey_data
        save `survey_data'
        
        
        use `admin_data', clear
        gen begyear = year(begepi)
        gen endyear = year(endepi)
        assert quelle == 16
        keep persnr begyear endyear
        duplicates drop
        rangejoin jahr begyear endyear using `survey_data', by(persnr)
        merge m:1 persnr jahr using `survey_data', assert(match using)
        
        gen byte confirmed = benefits == 1 if _merge == 3
        replace confirmed = benefits == 0 if _merge == 2
        This code will link up each survey response with the administrative records whose episodes overlap the response year. For those with no such administrative records, the administrative variables will be unmatched. This enables us to then calculate the variable confirmed which indicates whether the survey response agrees with the administrative records. Now, if a respondent has multiple administrative records in the same year, corresponding to different intervals within the year, that respondent will also end up with multiple records in the combined data set for that year. You may need to then combine these to a single record if your intent is to work with this as panel data. This raises a problem related to the inherent inaccuracy that I alluded to above. The multiple records for this person may be in some cases consistent with the administrative data and in others not. You will have to make some decision about how to reconcile any such inconsistencies when you combine all of a person's records for a given year.

        If your surveys have an actual response date (a variable I will call date), not just the year, then you can do a bit better, as you now can avoid the inaccuracy alluded to earlier. The code only changes slightly:
        Code:
        use `admin_data', clear
        assert quelle == 16
        keep persnr begepi endepi
        duplicates drop
        rangejoin date begepi endepi using `survey_data', by(persnr)
        merge m:1 persnr jahr using `survey_data', assert(match using)
        
        gen byte confirmed = benefits == 1 if _merge == 3
        replace confirmed = benefits == 0 if _merge == 2
        If you have a range of dates in the survey data, that is, a begin and end date, then it is also possible to put this all together, although the code gets a bit different. If this is your situation, do post back, being sure to show example survey data, and I will show you how that would be done.

        -rangejoin- is written by Robert Picard. It is available from SSC. To use it, you must also install -rangestat-, written by Robert Picard, Nick Cox, and Roberto Ferrer, also available from SSC.

        I would be also happy to know if the dataex can assign arbitrary IDs after or before the code generation that mimic the same ID structure.
        -dataex- cannot modifiy the data in any way: all it does is create Stata code that can reproduce the data as they are. If you need to mask IDs then you must do that yourself in the data sets before using -dataex-. To use the corresponding masked IDs, I would do it as follows:
        Code:
        use `admin_data', clear
        keep persnr
        tempfile holding
        save `holding'
        use `survey_data', clear
        keep persnr
        append using `holding'
        duplicates drop
        set seed 1234
        gen double shuffle = runiform()
        sort shuffle
        gen `c(obs_t)' masked_id = _n
        keep persnr masked_id
        sort persnr
        save crosswalk, replace
        
        use `admin_data', clear
        merge m:1 persnr using crosswalk
        drop persnr
        save masked_admin_data, replace
        use `survey_data', clear
        merge m:1 persnr using crosswalk
        drop persnr
        save masked_survey_data, replace
        It is useful to retain the crosswalk file, at least temporarily, so that if questions arise using the data sets with the masked id variable you can easily trace the data back to the original data sources.

        Additional info.: describe begepi code gives integer. And this code below did not work due to a mismatch error-also when I remove 'double'
        Code:

        gen double date_begepi = daily(begepi, "DMY")
        I'm not sure what this refers to. I will only comment that there is no need to use a double for a Stata date variable. You do need a double to store a datetime variable, but we are not dealing with those here. Also combining doubles and integers in Stata will not lead to a mismatch error. Mismatch errors refer to combining string and numeric variables in ways not sanctioned by Stata syntax. Where you may have that here is where you write -daily(begepi, "DMY")-. At least in the example data you showed, begepi is a numeric variable (even though when you browse or list the data it looks like a string). But the-daily()- function expects both of its arguments to be string variables. In any case, there is no need to transform the begepi and endepi variables, at least not if they are like the ones in your -dataex- example. They are fine for use the way they are.

        Added: You also asked about things like restricting your analysis to household heads or to certain years. The answers to those questions depend on the specifics of your research questions. If your research question is confined to household heads or certain years, then your analysis should be as well. If not, then, no.
        Last edited by Clyde Schechter; 02 Aug 2023, 17:21.

        Comment


        • #5
          Awesome! Many thanks for the comprehensive and very helpful response, dear Clyde! I tried running it while the data example worked smoothly, I got some errors "variable _merge already defined" after running this line of code:
          Code:
          merge m:1 persnr jahr using `survey_data', assert(match using)
          The assert command gave me some bugs, maybe similar to this post here. To get the code running, I replaced
          Code:
          assert quelle == 16
          with
          Code:
          keep quelle==16
          , and it stopped at the merging line above

          Comment


          • #6
            variable _merge already defined
            This error did not arise running the code in the example data. It means that in your full data set (as opposed to the posted example) there is a variable named _merge already. To deal with it, just add the -nogenerate- option to the _merge command.

            Comment


            • #7
              Thanks a lot. Here above you incredibly addressed the problem and comprehensively answered my questions, I appreciate it. Yes, it is because in practice one variable needs to be merged from another dataset so that survey data becomes complete. I used nogen for merging it.I have some remarks on the above code:
              • In the posted example, we have no missings in the benefits variable, however, this exists in survey data, so I adjusted the benefits variable:
              Code:
              replace confirmed=. if benefits==.
              • The "assert(match using)" part in the merging code produces this error "after merge, not all observations from using or matched (merged result left in memory)" that made the code stop running, so I removed this assert part and made it like this:
              Code:
              merge m:1 persnr jahr using `survey_data'
              • One important aspect I think, it also appears in the exemplary data. Some personal ids have duplicates in the year variable (jahr) which is understood in light of the begin and end year intervals, however this could make a tabulation of the confirmed variable without further adjustment possibly imply double-counting (if also restricted to the household head only). This case appears for example for the first pid in the exemplary data. When begyear and endyear reflect the same year and the ID has another spell where it is listed the same begyear as in the latter case & endyear is the following year =begyear+1

              Comment


              • #8
                To avoid double-counting, you can do this:
                Code:
                egen person_year = tag(persnr jahr)
                whatever_command if tag
                The "assert(match using)" part in the merging code produces this error "after merge, not all observations from using or matched (merged result left in memory)" that made the code stop running, so I removed this assert part and made it like this:
                This implies that there are people in the admin data who have no survey data. I suppose I could have anticipated that this would be the case. In that regard, the -assert()- was really not appropriate. The complication this creates is that your data set is now cluttered with people who are not part of the survey. So probably you should -drop if _merge == 1- unless these extra people will play some role later in analysis.

                Comment


                • #9
                  Thanks a lot, dear Clyde. This is very helpful! In this sense, the confirmed variable is being set to 0 if two conditions are met: 1) the observation is in the using dataset ("survey_data") but not in the master dataset ("admin_data"), and 2) the reported value of benefits in the using dataset (survey) is 0. Thus, confirmed =0 would entail consistent information regarding benefit non-receipt across two data sources and confirmed=1 reflects confirmed benefit receipt cases. Of interest here are the (misreporting) cases that would reveal when comparing the two data sources to gauge their extent. Another concern is that admin data kind of covers two types of the same survey data (the core one that we merge above, and another innovative survey type of the same survey that happens every set of years, and in this one, I think there is no benefit receipt indicator). They only included this latter Innovation Sample for 2020 in the admin data, if I am not mistaken. So, when starting with the admin data, and afterwards merging the survey data, I expect beforehand that there would be unmatched cases with the admin. data that correspond to that other survey type (the irregularly conducted innovative survey). Don't you think, given this, one should start instead with survey data?

                  Comment


                  • #10
                    Well, -merge- treats the master and using data sets the same way. It makes no difference which data set I start with and -merge- in the other. In the end, I will end up with the same result. The only issue comes when we subsequently decide to -keep- or -drop- some observations based on whether they were matched or, if unmatched, which data set they come from.

                    If I understand correctly, your purpose here is to cross-check the administrative data with the survey data, as a way of validating the survey responses. The only way you can cross-check that is with observations that match in both data sets. If a particular person and time period are present in only one of the data sets, there is no sense in which we can determine whether the two data sets agree. In terms of how to write the code, what matters, after the -merge- and the calculation of -confirmed- is which observations you will need for the rest of your analysis. If you are not going to include unconfirmed survey responses in your subsequent analyses, then there is no point in keeping unmatched observations. But if you are going to include them, then there is no point in dropping the unmatched observations. It all depends on what you are going to do next.

                    Comment


                    • #11
                      Yes, you are absolutely right. To cross-check requires only using observations that match in both datasets. Yes, I wanted to validate the reporting in survey data and see the extent of misreporting, thus cross-checking admin. data with survey data. I also found out that the wording of the question "What is the situation now?" on that type of benefit receipt and whether it was received currently refers actually to the (month of the interview), the month in which the interview was conducted, not just the survey year - and this month of the interview exists already in the survey (float, taking 1-12 values), however, no exact date exists. This makes me think of your second proposed way albeit adjusting it from date to a different (month-year format). Especially given that the admin data has intervals running from, say month 1 to month 3. I don't know how rangejoin functions in such a case.

                      I also realized that the dataset is described to capture everyone who was surveyed and gave consent to the data linkage. Since it has spells of employment history too (quelle==1 for ex.). Based on what they wrote:
                      In the present case, the (admin data) contains data on all persons interviewed who agreed to be linked and could be linked, if they have at least one of the following states during the observation period: • employment subject to social security (recorded from 1975 onwards) • marginal part-time employment (recorded from 1999 onwards) • receipt of benefits in accordance with Social Code ..• registered as a jobseeker (recorded from 1997 onwards) • participation in an employment or training measure (recorded from 2000 onwards)
                      Based on this, the earlier "keep if quelle==16" should be reconsidered. Maybe generate a variable that is =0 and replace it =1 if quelle==16, and retain the other observations, if I am not mistaken.
                      Last edited by Hend She; 18 Aug 2023, 16:12.

                      Comment

                      Working...
                      X