Identifying matched observations in a large database

Pantelis Kazakis

Join Date: Aug 2014
Posts: 123

Identifying matched observations in a large database

04 Feb 2019, 13:48

Dear all,

This is relevant to the question I asked some days ago and can be found here:

I have encountered some difficulties due to the size of my database. Hence, I am starting a new thread regarding that topic.

I will start with the following example. Assume that we have a database with the following variables: id, year, treated, x1, and x2.

Treated is an indicator and informs us that a firm had an event at a specific year. My aim is to to construct a new variable that shows me the id of matched firms based on specific criteria that I name below.

There are three requirments that need to hold in order to match a firm.

1. If a company is treated at time t, the matched company must not be treated in years (-5, +5) from the event year of the treated firm. For example, if a treated firm had an event in 1990, the matched company must not be treated in the period [1985, 1995].

2. Create a variable z that is the ratio of x1 of the untreated firm to the value of x1 of the treated firm one year before the event. Then, based on z, the matched company should have values of z in the range [0.70 to 1.3]. When we compare z, we care about the values one year before the event.

3. The treated company's x2 one year before the event is closest to the x2 of the matched company one year before the event. So, for two companies that qualify for points 1 and 2, we choose as a match the company with the closest x2 to our treated company one year before the event.

In my previous thread, a solution was provided using the rangejoin command. Alas, although this works for this small sample, there are problems for large databases. In my case, there are more than 5,000 unique ids, 30+ years, and more than 20,000 events.

The problem arises with the rangejoin command and I get a Op. sys. refuses to provide memory error.
Is there a way to do the comparisons without the rangejoin command?

Thank you for your time.

A simplistic example is the following:

Code:

input id    year    treated    x1    x2
1    1980    0    157    18
1    1981    0    165    6
1    1982    0    28    4
1    1983    0    167    13
1    1984    0    77    8
1    1985    0    25    3
1    1986    0    156    11
1    1987    0    159    2
1    1988    0    101    2
1    1989    0    128    2
1    1990    1    44    15
1    1991    0    161    15
1    1992    0    195    20
1    1993    0    10    19
1    1994    0    34    7
1    1995    0    88    4
1    1996    0    83    12
1    1997    0    191    7
1    1998    0    180    20
1    1999    0    96    7
1    2000    0    177    10
2    1980    0    119    20
2    1981    0    71    18
2    1982    0    192    15
2    1983    0    142    4
2    1984    0    187    12
2    1985    0    41    8
2    1986    0    68    17
2    1987    0    74    20
2    1988    0    192    3
2    1989    0    178    7
2    1990    0    159    6
2    1991    0    43    16
2    1992    1    146    20
2    1993    0    103    2
2    1994    0    169    19
2    1995    0    99    11
2    1996    0    194    9
2    1997    0    48    19
2    1998    0    103    4
2    1999    0    106    18
2    2000    0    65    11
3    1980    0    15    8
3    1981    0    188    7
3    1982    0    138    7
3    1983    0    85    18
3    1984    1    114    7
3    1985    0    50    17
3    1986    0    122    1
3    1987    0    180    13
3    1988    0    67    4
3    1989    0    48    13
3    1990    0    63    18
3    1991    0    23    11
3    1992    0    36    4
3    1993    0    95    16
3    1994    0    188    18
3    1995    0    31    18
3    1996    1    22    3
3    1997    0    10    1
3    1998    1    11    1
3    1999    0    60    7
3    2000    0    25    7
4    1980    0    112    18
4    1981    0    197    6
4    1982    0    96    2
4    1983    0    165    2
4    1984    0    64    4
4    1985    0    71    19
4    1986    0    15    17
4    1987    0    158    7
4    1988    0    140    11
4    1989    0    156    3
4    1990    0    38    14
4    1991    0    33    14
4    1992    0    17    10
4    1993    0    183    1
4    1994    0    111    20
4    1995    0    69    12
4    1996    0    127    15
4    1997    0    24    12
4    1998    0    102    2
4    1999    0    130    17
4    2000    0    130    15
end

Tags: None

Clyde Schechter

Join Date: Apr 2014

Posts: 30172
#2

04 Feb 2019, 14:39

To be honest, that doesn't sound like all that large a data set. I've run much bigger problems on my own computer, which is not especially endowed with memory. So the first thing I'd do is trying it again after doing a cold reboot of the computer and not launching any other applications (and close any others that open automatically). That might just leave you enough memory to run it. If that doesn't solve the problem, I can see one tweak to the code that also would reduce the demand on memory. Change

Code:

// PAIR UP OBSERVATIONS WITHIN A +5 TO -5 YEAR WINDOW rangejoin year -5 5 using `copy' // ELIMINATE POSSIBLE CONTROLS THAT WERE TREATED IN THE WINDOW keep if treated == 1

to

Code:

// PAIR UP OBSERVATIONS WITHIN A +5 TO -5 YEAR WINDOW keep if treated == 1 rangejoin year -5 5 using `copy'

It appears that the vast majority of your observations are untreated. The way I wrote the code earlier was wasteful of both memory and time by initially finding matches for untreated observations and then dropping them. By dropping the untreated observations before doing -rangejoin- you should save a great deal of memory and time. Sorry I didn't pay attention to that earlier.

If that doesn't resolve the problem, post back and I'll show you other things you can do. But this should make a huge difference in the amount of memory required.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#3

04 Feb 2019, 16:58

Thank you Clyde, indeed, there are mostly zeros than ones in the database.

I have my code running because I need to compute the "z" variable before doing the matching (see point 2. in #1). My code to compute it (see previous post linked in #1) uses a loop and it needs around 5-6 hours to compute it (if all goes well). I am not sure if my code is efficient. I reckon it is not, as it takes too much time. If there are faster solutions to this, I would be happy to apply them.

Once this is done, I will reboot my PC and try the modified code for rangestat you suggested.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30172
#4

04 Feb 2019, 17:23

I don't understand why you need to compute the z variable first. In fact, I don't even see how it is possible to do so. The z-variable requires identifying the year preceding the event in the matched treatment = 1 case: so you can't compute z until you at least have a tentative match with a treatment case so you can identify the relevant event year. Are you trying to calculate the z-variable for every possible event year? That sounds both unnecessarily time-consuming and likely to result in a chaotic data set with all manner of unneeded variables that will exacerbate the memory problem.

At the original link, part of the code was:

Code:

// ELIMINATE POSSIBLE CONTROLS THAT HAVE X1 OUT OF [0.7 - 1.3] RANGE // IN THE YEAR BEFORE THE EVENT by id year id_U, sort: egen lagged_x1 = max(cond(year_U == year - 1, x1_U, .)) drop if !inrange(lagged_x1, 0.70, 1.3)

I see now that this code did not implement exactly what you were looking for, but a slight modification of it would:

Code:

// ELIMINATE POSSIBLE CONTROLS THAT HAVE Z OUT OF [0.7 - 1.3] RANGE // IN THE YEAR BEFORE THE EVENT by id year id_U, sort: egen lagged_x1 = max(cond(year_U == year - 1, x1_U, .)) drop if !inrange(lagged_x1/x1, 0.70, 1.3)

This calculates the z-ratio (without storing it in a variable) and drops those otherwise possible controls for which this ratio is out of range.
Comment
Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#5

05 Feb 2019, 05:23

Thank you Clyde.

I have done the necessary changes in my code. I did not get any memory error, however the code is still running the rangejoin command. It's been about 2 hours now.
When I go to the task manager, I can see that the CPU and memory changing (memory usage is about 27 gb). However, the status after an hour of running the code went to "not responding".
I wonder if Stata is still working and if it will, eventually, give an answer.
Comment

Clyde Schechter

Join Date: Apr 2014
Posts: 30172

05 Feb 2019, 09:09

It's difficult to know what has happened in this circumstance. I have seen Stata "not responding" in situations where it is just in the middle of a long disk operation and then return to being shown as running. If at some point you decide to "pull the plug" on this, here is some code you can use that will probably reduce the memory burden still further, and it will have the advantage of giving you a progress report every few minutes telling you how much of the job has been completed and an estimate of the remaining time.

Code:

clear*
input id    year    treated    x1    x2
1    1980    0    157    18
1    1981    0    165    6
1    1982    0    28    4
1    1983    0    167    13
1    1984    0    77    8
1    1985    0    25    3
1    1986    0    156    11
1    1987    0    159    2
1    1988    0    101    2
1    1989    0    128    2
1    1990    1    44    15
1    1991    0    161    15
1    1992    0    195    20
1    1993    0    10    19
1    1994    0    34    7
1    1995    0    88    4
1    1996    0    83    12
1    1997    0    191    7
1    1998    0    180    20
1    1999    0    96    7
1    2000    0    177    10
2    1980    0    119    20
2    1981    0    71    18
2    1982    0    192    15
2    1983    0    142    4
2    1984    0    187    12
2    1985    0    41    8
2    1986    0    68    17
2    1987    0    74    20
2    1988    0    192    3
2    1989    0    178    7
2    1990    0    159    6
2    1991    0    43    16
2    1992    1    146    20
2    1993    0    103    2
2    1994    0    169    19
2    1995    0    99    11
2    1996    0    194    9
2    1997    0    48    19
2    1998    0    103    4
2    1999    0    106    18
2    2000    0    65    11
3    1980    0    15    8
3    1981    0    188    7
3    1982    0    138    7
3    1983    0    85    18
3    1984    1    114    7
3    1985    0    50    17
3    1986    0    122    1
3    1987    0    180    13
3    1988    0    67    4
3    1989    0    48    13
3    1990    0    63    18
3    1991    0    23    11
3    1992    0    36    4
3    1993    0    95    16
3    1994    0    188    18
3    1995    0    31    18
3    1996    1    22    3
3    1997    0    10    1
3    1998    1    11    1
3    1999    0    60    7
3    2000    0    25    7
4    1980    0    112    18
4    1981    0    197    6
4    1982    0    96    2
4    1983    0    165    2
4    1984    0    64    4
4    1985    0    71    19
4    1986    0    15    17
4    1987    0    158    7
4    1988    0    140    11
4    1989    0    156    3
4    1990    0    38    14
4    1991    0    33    14
4    1992    0    17    10
4    1993    0    183    1
4    1994    0    111    20
4    1995    0    69    12
4    1996    0    127    15
4    1997    0    24    12
4    1998    0    102    2
4    1999    0    130    17
4    2000    0    130    15
end

tempfile copy
save `copy'

keep  if treated == 1
gen file = `"`copy'"'

capture program drop one_year
program define one_year
    // PAIR UP WITH ALL OSERVATIONS WITHIN PAST 5 YEARS
    local file = file[1]
    rangejoin year -5 5 using `file'
    //    ELIMINATE IF TREATED WITHIN PAST 5 YEARS
    by id_U, sort: egen byte ineligible = max(treated_U)
    drop if ineligible
    drop ineligible
    //    CALCULATE Z RATIO & KEEP ONLY IF IN 0.7-1.3 RANGE
    by id_U: egen lagged_x1 = max(cond(year_U == year-1, x1_U, .))
    gen z = lagged_x1/x1
    keep if inrange(z, 0.7, 1.3)
    drop lagged_x1
    //    NOW KEEP CLOSEST MATCH ON X2
    gen delta = abs(x2_U-x2)
    by id (delta), sort: keep if _n == 1
    keep id year x1 x2 id_U z x2_U
    exit
end

runby one_year, by(year) status

At the end of this code, Stata will be holding data showing each "treated" observation's id, year, x1, and x2, along with the id of the chosen match (id_U) and the values of z and x2 (x2_U) that led to the choice of that particular match. As before, if there are two or more observations tied for the match on x2, the tie is broken randomly and irreproducibly. Note that, at least in the sample data, there are some events for which there is no eligible match.

The gist of this approach is that is breaks up the problem into one year's worth of treatment events at a time. This reduces the memory burden because fewer observations need to be held in memory, or -rangejoin-ed in at any given time. It's basically breaking a bunch of sticks one at a time rather than trying to break the entire bound bunch. It's possible because the process can be done separately for each year's observations.

To use this code, you must install -runby-, written by Robert Picard and me, and available from SSC.

Added: I strongly recommend that before running this code you -drop- any variables in the actual data set other than id treated year x1 and x2. These are the only variables needed for the matching (you can merge the results back with the original data after the matching is done if other variables will be needed later) and by eliminating everything else you will reduce the demand on memory and also speed up all the disk read and write operations involved.

Last edited by Clyde Schechter; 05 Feb 2019, 09:16.

Comment

Pantelis Kazakis

Join Date: Aug 2014

Posts: 123
#7

05 Feb 2019, 09:29

Thanks Clyde.

I will try that code if the current one does not work. I did some changes in the database, first I used the command compress to make the file even smaller. Then, I have already kept id year treated x1 x2 only to further reduce the burden.

I will have my code running until late today. If it does not give anything, I will change to the new one you suggested.

I appreciate your assistance in this matter.
Comment

Announcement

Identifying matched observations in a large database

Comment

Comment

Comment

Comment

Comment

Comment