Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Generate population data based on known population characteristics

    Hello everyone,

    My goal is to draw a random sample from a population of registered voters. However, I only have information on the number of voters in each town and the related categories which I want to use to generate the population list. I am struggling to program state to generate this list based on the relative proportions in each town and the age distribution. Any help or advice will be very much appreciated. Thanks in anticipation.

    Musah

  • #2
    It sounds like you want to draw a sample of random "individuals" from a probability distribution, and that you will define the probability distribution in terms of some town level data that you have.

    If so, that's a difficult problem. It sounds like you need to calculate the joint frequency of all combinations of characteristics, then divide each frequency by the total number of individuals, and finally draw a random sample from the resulting probability distribution, possibly using a cumulative probability distribution and a uniform random number between 0 and 1.

    I'm not aware of any stata magic for this, but if you do have to implement this yourself, I don't envy you.

    Comment


    • #3
      Hi Daniel,

      Yes, I am trying something like that but I want to avoid the complications with calculating joint probability distributions. I want to use only two variables: towns and age groups. Here is a simple example;

      Let say we know there are 10 voters in 3 towns with the following age distributions:

      Voter_ID Town Age
      1 A 18
      2 A 20
      3 A 30
      4 A 55
      5 A 80
      6 B 45
      7 B 36
      8 B 22
      9 B 71
      10 B 25

      This is the kind of data I want to generate. I know the total number of voters by town and their age distribution. If I generate the data, I will sort and randomly draw my sample from the population.
      Last edited by Musah Khalid; 13 Dec 2020, 20:38.

      Comment


      • #4
        Thank you for clarifying,

        When you say you know the age distribution, do you mean you have proportions? Frequencies? For each town, or for the entire set of towns?

        An example showing what the input data looks like may be helpful.

        Comment


        • #5
          Hello Daniel,

          I have the age distribution (frequencies) for each town. So, like the example I have above, I know the number of 18-20 yr, 21-23 yr etc in each town. The only data I have is the total number of voters, by town and age. I hope this is clear? What I want to generate the population, N, by town and age. I know how many voters are in each town and within each town how many are 18yrs, 19yrs etc.

          Musah
          Last edited by Musah Khalid; 13 Dec 2020, 22:14.

          Comment


          • #6
            Hello Musah,

            that does simplify things a bit.

            So let's assume you have a table with town names or ids as a unique identifier, and then columns one for each year of age. Cells represent the frequency of people at that age (given by the column) in the town (given by the row). Without a toy example it is hard to know how your data is formatted, but under these assumptions the problem now is that the table you have in memory is not structured for individuals, since observations are towns, not individuals. This may also make linear algebra like commands less easy to work with since you'll have more observations in the output than in the input.

            You could loop through each age variable, and for each age variable loop through each observation, and for each observation generate n (town, age) pairs where n is equal to the frequency in the given cell, and you can write those pairs to ether a new data frame (if you have that feature in your version of stata) or line by line to an output file - probably with comma separated values so that you can treat it like a csv when done.

            Still messy, but much easier than what I originally thought you were asking.
            Last edited by Daniel Schaefer; 14 Dec 2020, 09:55.

            Comment


            • #7
              I notice, by the way, that we are talking past each other a bit with respect to the example. To clarify, I see in your second post an example of what you want the resulting (output) data to look like. I do not see an example of the table you currently have and how it is formatted (input).

              I hope I've correctly understood you and such an example is no longer relevant. I just wanted to take a moment to clarify, because I know miscommunications online can be needlessly frustrating.

              Comment


              • #8
                Hello Daniel,

                Thanks for your response. I apologize if my explanations are still not clear. Let me run through what I want one more time. So, let say a state has 10 towns and 100 voters distributed equally across these towns. Let also assume every voter living in each town is either 20 or 30 years old. What I want to do is to stata to generate this data in a long-form. The first column will be ID (1,2.....100). The second column will be towns. We know there are 10 voters in each town so we just assign 10 IDs to each town. Then your last column will be age; we will code 5 IDs in each town as 20 yrs old and the other 5 as 30 yr olds. Make sense? I am just looking for an easier way to do this because I have to generate more than a million observations.

                Thanks once again for your time.

                Musah

                Comment

                Working...
                X