Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Identifying wave-sorted datasets of a given date and appending them

    Dear Statalisters,

    I hope my request will be clear. I have, in a folder, multiple datasets for different countries and waves but they all belong to the same standardized survey. Some countries have only one wave, other countries have more waves, there's no particular rule for this. The problem with this classification is that the time in which each wave of the survey is done really depends on the country. For instance, a third wave done in country A doesn't mean it will be done at the same time than wave 3 of country B, etc. It just depends on when the previous waves of the survey were conducted for each country

    So rather than keeping countries that were surveyed in a given wave, I'd like to keep countries that were surveyed in a given date regardless of the wave. Of course, all datasets have a standardized variable month and year. How would you proceed to do such a classification? For a demo, let's say my folder is composed of the following .dta datasets (the date of the survey is in parenthesis)

    country1_W1 (July 2018)
    country1_W2 (January 2019)
    country2_W1 (September 2018)
    country3_W1 (October 2018)
    country3_W2 (February 2019)
    country3_W3 (April 2019)

    As you can see, belonging to the first wave doesn't mean that the waves were done at the same month across countries. How should I proceed if I want to 1) identify datasets ranging in a given timespan, say from September 2018 to March 2019 2) appending those datasets together ?

    the variable month gives the month number and the variable year gives the year number. So only a few combination of these two variables should be kept (9-2018, ... 3-2019).

    One thing to be mindful about: I assume there will be cases where some waves of the survey will be done at the end of one month and at the beginning of another. Say, for instance, that country3_W3 is done at the end of March2019 but also at the beginning of April 2019. If I specify that I want to identify datasets from September 2018 to March 2019, as stated above, then I'd like Stata not to forget about observations from country3_W3 to be done in April 2019 and to be a little more flexible on the boundaries.

    I feel like there's no solution other than a case-by-case scenario where I look at the date of each wave of each country and delete the file if it doesn't match my timespan. If someone could save me countless hours of this repetitive work I would be forever grateful to them.

    I hope I was clear. I can explain in other words if it isn't.

    Thank you in advance for your time,

    Hugo
    Last edited by Hugo Denis; 21 May 2022, 17:04.

  • #2
    It's not entirely clear from your description what the month and year variables contain. You indicate that country1_W1 corresponds to July 2018. But you imply that it is possible that some of the surveys are actually spill over into early August 2018. So what is in the month variable for that file? Is it always July, or is does it show whichever month that particular survey was actually carried out in?

    If it's always July, then it sounds like there is nothing that will identify the August surveys, so I think what you ask for cannot be done: the information, on your description, does not exist.

    If the variables month and year always show the actual month and year of the survey, then I think the solution is as follows. You do not provide any example data, so what I show here is pseudocode that points the way, but you will have to mark it up. First I would build an omnibus data set by appending the files all together and identifying the original source for each record. You will also need a real Stata internal format monthly date variable calculated from the separate month and year variables. That will go something like this:

    Code:
    clear*
    local filelist: dir "." files "country*_w*.dta"
    
    tempfile building
    save `building', emptyok
    
    foreach f of local filelist {
        use `"`f'"', clear
        gen survey_date = ym(year, month) // ASSUMES year, month ARE NUMERIC
        format survey_date %tm
        gen sourcefile = `"`f'"'
        append using `building'
        save `"`building'"', replace
    }
    
    save omnibus, replace
    Be warned that most situations where a large number of files are being put together like this lead to problems due to inconsistencies/incompatibilities among the files themselves. This happens even with data from sources who have a good reputation for carefully curating their data. So I strongly recommend that before you do this, you install Mark Chatfield's -precombine- program, from Stata Journal, and run all these files through it first so that you can fix up the incompatibilities ahead of time. That will be less frustrating than repeatedly having the loop stop each time it discovers a new problem and sending you back to the beginning.

    Once you have the omnibus file, if you want to do an analysis on all surveys done between September 2018 and March 2019, you can do that as follows:
    Code:
    use omnibus, clear
    gen byte tokeep = inrange(survey_date, tm(2018m9), tm(2019m3))
    by sourcefile (tokeep), sort: replace tokeep = tokeep[_N]
    keep if tokeep
    This will leave you with all surveys that were done in the range of months specified, as well as any surveys done in other months that appear in the same country-wave file.
    Last edited by Clyde Schechter; 21 May 2022, 17:29.

    Comment


    • #3
      Dear Clyde,

      Your code worked perfectly well. It's always a pleasure to discover new functions such as ym(). I will also use the -precombine- program because as you wrote, there were format-related issues when I tried to append my datasets together.

      Thanks a lot for the countless hours saved!

      Hugo

      Comment

      Working...
      X