How to find all possible combinations of an indicator variable across a varying amount of observations

Andrew Curran

Join Date: Mar 2016

Posts: 2
#1

How to find all possible combinations of an indicator variable across a varying amount of observations

29 Mar 2016, 17:30

Hello everyone, I have used this forum many times in the past to try and troubleshoot my Stata issues but this is my first post. I have read the FAQ but if there is something I am not doing correctly, or if this is the wrong place, I apologize.

I am currently developing a paper on optimal road toll management systems and within my paper I am trying to see how the proposed system performs compared to the best possible allocation of traffic. As a part of this I am looking to also see how the system reacts to different traffic loads so it is important that the do file can easily be altered to reflect different traffic levels.

How I am accomplishing this so far is to create a certain number of observations (1 per simulated driver) that is assigned a value from a normal distribution (their unique value for using the road). Their total payoff is calculated as a function of how many other drivers have made the same lane selection, as more people enter one lane and congestion increases each individual driver receives a decreased payoff.

The segment of my do file that I use to control these factors is shown below:

clear
version 13.0
set more off

*DRIVER GENERATOR
loc Mean_Value "3" //Change these to update experiment values
loc SD_Value "0.5"
loc Drivers "25"
set obs `Drivers'
gen Driver_ID=_n
gen Value=rnormal(`Mean_Value',`SD_Value')

I have coded it so far so that there is a variable named "Lane_Choice" which will hold the values 0 or 1 depending which lane each driver is in. Then through some code which is somewhat irrelevant to my question I have the individual drivers rearrange until each one can no longer improve their payoff by switching to another lane (Note: each driver is maximizing their individual payoff, not the joint maximization of the group as a whole).

I would then like to compare the sum of all driver payoffs from this scenario with the case where they would be arranged to jointly maximize the total payoff. What I am thinking is to calculate the total payoff from each possible combination of drivers in lanes and then select the largest value. My issue lies in being able to have a code that lists each possible lane allocation that will still respond to my original macro for drivers. The point of this is to be able to quickly change the number of drivers I am simulating and run the file start to finish.

I have found some useful code that can manually be changed depending how many drivers I have but I am having issues finding a way to nicely have this expand or decrease with my number of drivers.

Example showing all possible combination of lanes for 3 drivers:

tempname myfile
postfile `myfile' str10 name time place using myfile, replace
local name "0 1" //
foreach n of local name {
forvalues time = 0/1 {

forvalues place = 0/1 {
post `myfile' ("`n'") (`time') (`place')
}
}
}
postclose `myfile'
use myfile, clear

I have also spent quite a bit of time trying to use the -input- command within a -while- loop but I always seem to receive some type of error called --Break--. Here is the code I had written:

clear
loc Drivers "4"
loc Cells=2^(`Drivers') //how many possible combinations there will be
set obs `Cells'

local w=1
while `w'<=`Drivers' {
local y=1
while `y'<=2^(`w'-1) {
input str1 Driver_`w'
1
end
local y = `y' + 1
}
local z=1
while `z'<=2^(`w'-1) {
input str1 Driver_`w'
0
end
local z = `z' + 1
}

}

What I would ideally be able to produce is some code that reacts to my specified driver amount to create a matrix of all possible combinations, such as what I am showing below:

2 Drivers
Driver_1_Choice Driver_2_Choice
1,1
0,1
1,0
0,0

But then be able to expand when I increase drivers
3 Drivers
Driver_1_Choice Driver_2_Choice Driver_3_Choice
1,1,1
0,1,1
1,0,1
0,0,1
1,1,0
0,1,0
1,0,0
0,0,0

etc.

Sorry if this has been a bit long, I just wanted to make sure I provided enough information to help me troubleshoot this.

Thanks,
Andrew
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#2

29 Mar 2016, 18:00

I can show you code that will do this, but perhaps you want to look for a different approach instead. With 25 drivers, you will have 2^25 = 33,554,432 combinations, each of which will require calculation of a payoff, and then you will have to maximize over that payoff. This brute-force approach will be slow, and depending on what else is in your data or what the calculations involved are, could push the limits of your memory.
Here is code that can do this:

Code:

clear local n_drivers 10 // OR WHATEVER NUMBER YOU WANT local n_combinations = 2^`n_drivers' set obs `n_combinations' forvalues i = 1/`n_drivers' { gen driver_`i'_choice = . } // FILL THE DATA SET WITH BINARY DIGITS OF THE NUMBERS // BETWEEN 0 AND 2^`N_DRIVERS' - 1 tempname current forvalues b = 1/`n_combinations' { scalar `current' = `b' forvalues i = 1/`n_drivers' { quietly replace driver_`i'_choice = mod(`current', 2) in `b' scalar `current' = floor(`current'/2) } }

The problem with this approach is that the run time is O(2^n) where n is the number of drivers. So while it runs in about 6 seconds for 15 drivers on my setup, that translates into about 2 hours for 25 drivers. And that's just to set up the choices.

Is there no theory that can constrain the range of configurations that need be considered? Or some theoretical derivation of a good approximation to the maximum?

Added edit: Actually the run time is worse than O(2^n), it's O(n*2^n)!

Last edited by Clyde Schechter; 29 Mar 2016, 18:10.
1 like
Comment
Carole J. Wilson

Join Date: Jan 2015

Posts: 932
#3

29 Mar 2016, 18:18

Is your input code trying to change the drivers "on the fly" without changing the do file? If that is the case, you could use the following:

Code:

noi di "Number of Drivers:" _request(ndrivers) *now code in terms of $ndrivers di "There are $ndrivers drivers in this simulation"

Stata/MP 14.1 (64-bit x86-64)
Revision 19 May 2016
Win 8.1
1 like
Comment

daniel klein

Join Date: Mar 2014
Posts: 3847

29 Mar 2016, 22:32

I think this is about building up the indicator matrix step by step. Here is a general approach. I leave the details of implementing it to work with Stata datasets to others.

Code:

/*
    Define function add_drivers() */
    
mata : 

mata clear

void add_drivers(real matrix X, real scalar n)
{
    real scalar N
    real matrix A, a
    
    X = J(2^n, 1, X)
    N = rows(X)
    
    A = J(N, n, .)
    for (i = 1; i <= n; ++i) {
        N = N/2
        a = (J(N, 1, 1)\ J(N, 1, 0))
        A[., i] = J(2^(i-1), 1, a)
    }
    
    X = (A, X)
}

end

/*
    Demonstrate */

mata : 

// start with one driver
x = 1\0
x

// add two drivers
add_drivers(x, 2)
x

// add another three
add_drivers(x, 3)
x

end

As Clyde mentions, this approach will be very slow and memory consuming with a reasonable number of drivers.

Best
Daniel

Last edited by daniel klein; 29 Mar 2016, 22:37.

Comment

Andrew Curran

Join Date: Mar 2016

Posts: 2
#5

30 Mar 2016, 03:24

Thanks Clyde, Carole and Daniel!
I have a meeting today with my thesis advisor to discuss my simulation and maybe they will have some suggestions for an alternative. This simulation will usually be running with somewhere between 15 and 35 drivers, which I believe would put me into billions of possible combinations unfortunately. I have only spent a couple of days trying to develop this method of solving my issue but so far I am somewhat convinced that it may be the best option as it has no restrictions imposed on it.

Thanks again!
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30093
#6

30 Mar 2016, 09:20

Yes, with 35 drivers you have 34,359,738,368 combinations. Based on the O(n*2^n) run time, I estimate that just creating the indicators using my code in #2 would require 170 days of run time on my computer. Except that it wouldn't run at all because I don't have 34 GB of RAM installed.

Simulations are great -- I spend a large part of my professional time doing them. But simulations informed by mathematical derivation are even better.
Comment

Announcement

How to find all possible combinations of an indicator variable across a varying amount of observations

Comment

Comment

Comment

Comment

Comment