Dropping variables by number of non-missing values

Alex Hendel

Join Date: Nov 2018

Posts: 2
#1

Dropping variables by number of non-missing values

23 Nov 2018, 15:17

Working with Stata 15 on Mac OS Sierra, I'm attempting to work with Oklahoma voting data (available by request at https://virs.okelections.us/) to create a dataset which contains one observation for every voter since 2000, with the variables voterid (which uniquely identifies each voter) and one variable for each election in Oklahoma since 2000.

Initially, the data is in the form of a long dataset, with only voterid and electiondate for each observation. There are thus 2 variables, and ~23 million observations. voterid repeats if the voter has voted in multiple elections. In order to reshape the data so that each electiondate is its own variable, I used the following code:

tab(electiondate), gen(election_)

Because there are so many electiondates, I am left with approx 240 variables for the election_* variables generated by the previous command. Each observation now contains the voterid variable, and 240 other variables with all but one containing a value of 0 and the other containing a value of 1 for each observation. I attemped to then collapse this data so that there is just one observation per voterid, with a value of 1 for every election a person voted in, and 0 for every election they didn't:

fcollapse (max) election_*, by(voterid)

This takes too long and my computed runs out of memory. Instead, I would like to drop each variable that has fewer than 20,000 1 values, to make the collapsing process quicker. I recoded all the 0 values to .a in the hopes of using:

foreach var of election_*{
drop `var' if (count `var')<20000
}

however this returns invalid syntax.

Is there a way to drop a variable if the number of non-missing values is below a certain threshold? Alternatively, is there a better way of accomplishing my final goal of collapsing the data so that I have one observation per voterid with different variables for most elections (even if I have to drop several smaller elections)?
Tags: None
William Lisowski

Join Date: Dec 2014

Posts: 10150
#2

23 Nov 2018, 16:04

Welcome to Statalist.

First, some advice on making effective use of Statalist. Please review the Statalist FAQ linked to from the top of the page, as well as from the Advice on Posting link on the page you used to create your post. Note especially sections 9-12 on how to best pose your question.

Even the best descriptions of data are no substitute for an actual example of the data. This problem is sufficiently complex that whoever helps you will want to be able to test their code in an example of your data. And they will want the data as you have it in Stata, to ensure they're starting from the same point as you. They won't want all the data, a small subset with a few voters and a few of the 240 elections, preferably ones that some of those voters have voted in.

Please be sure to use the dataex command to show your example data. If you are running version 15.1 or a fully updated version 14.2, it is already part of your official Stata installation. If not, run ssc install dataex to get it. Either way, run help dataex and read the simple instructions for using it. dataex will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use dataex.

The more you help others understand your problem, the more likely others are to be able to help you solve your problem.

With that said, I think the key to what you want is the reshape command. I started to include some hypothetical code, but there are too many possibilities for how your data are stored for hypothetical code to be reasonably accurate.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2449
#3

23 Nov 2018, 16:31

I second William's comments about uncertainties, but it happens I've worked with some problems like yours, so I have a guess of what you're thinking of. My biggest presumption is that a voter only appears in your data file on dates on which s/he did vote.

Anyway: This can all be done in the long format, presuming your goal is to have a long file consisting of: voterid, date, and vote (1 if voted, 0 otherwise). Does this work for you?

[It's possible -fillin-, which does the hard work below, won't like the data type of your date. If so, use the -group- function of -egen- to code your date into an integer.]

The one difficulty here is that -fillin- becomes quite slow with large N, so your example with millions of voters might take a while. Perhaps someone else out there has an idea of how to efficiently do a fill-in for large N by means other than the built-in -fillin-. (Perhaps something involving -expand-??)

Here's an example of my approach with 5000 voters.

Code:

clear // simulate data set obs 5000 gen int date = ceil(runiform() * 100) // 100 possible elections, with 50% participating at each election gen voterid = _n gen byte voted = 1 tab date vote, missing // should be lots of voters missing on each date // end simulating data // // Create an observation for each voter in each election whether or not s/he voted fillin voterid date replace voted = 0 if (fillin ==1) drop fillin // tab date voted // check it out // Count number of voters in each election to facilitate dropping "small" elections. egen nvoters = total(voted), by(date) // Perhaps: drop if nvoters < 20000
1 like
Comment
Alex Hendel

Join Date: Nov 2018

Posts: 2
#4

23 Nov 2018, 17:16

Thank you to both posters above, and sorry for the poor posing of the question. I believe I've determined a solution to the problem similar to that Mike suggested. Instead of trying to drop variables after I've performed the -tab(electiondate), gen(election_)- command, I've attempted to do so before. Below is the new code that I am using. I'm currently still trying to find a balance between using as much data as possible while still allowing the -collapse- command to run.

Code:

*** Drop observations with election dates with low turnout egen gval = group(electiondate) // creates a numeric value for each date bysort gval : gen typ_freq = _N // generates a variable counting the number of observations sharing the same date drop if typ_freq < 70000 *** Reshape voter histories to be compatible for merge egen date_id = group(electiondate) tab(electiondate), gen(election_) save OK_VH_dropped.dta, replace fcollapse (max) election_*, by(voterid)
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#5

23 Nov 2018, 17:51

I agree with Mike's recommendation: you really don't want to transform your data to a wide layout with one observation per voterid, neither with tab and collapse as you are attempting nor with reshape as I described.

I'm currently still trying to find a balance between using as much data as possible while still allowing the -collapse- command to run.

Leave your data in a long layout, don't use the tab and collapse commands, and then you can use all of your data.

The experienced users here generally agree that, with few exceptions, Stata makes it much more straightforward to accomplish complex analyses using a long layout of your data rather than a wide layout of the same data. You should try to achieve what you need with the data organized as it currently is, and seek the help of Statalist in doing so. The sort of problems you will encounter trying to use your reshaped data will almost certainly be solved by reshaping the data.
Comment

Nick Cox

Join Date: Mar 2014
Posts: 36058

24 Nov 2018, 03:26

The thread has turned in a fruitful direction. Getting the OK data in a form that is OK [couldn't resist it; not sorry either] is now the focus, but here is something for anyone attracted by the thread title.

missings from the Stata Journal allows this kind of analysis:

Code:

. webuse nlswork, clear
(National Longitudinal Survey.  Young Women 14-26 years of age in 1968)

. missings report

Checking missings in all variables:
15082 observations with missing values

-------------------
          |      #
----------+--------
      age |     24
      msp |     16
  nev_mar |     16
    grade |      2
 not_smsa |      8
   c_city |      8
    south |      8
 ind_code |    341
 occ_code |    121
    union |   9296
   wks_ue |   5704
   tenure |    433
    hours |     67
 wks_work |    703
-------------------

. missings report, sort

Checking missings in all variables:
15082 observations with missing values

-------------------
          |      #
----------+--------
    union |   9296
   wks_ue |   5704
 wks_work |    703
   tenure |    433
 ind_code |    341
 occ_code |    121
    hours |     67
      age |     24
  nev_mar |     16
      msp |     16
   c_city |      8
    south |      8
 not_smsa |      8
    grade |      2
-------------------

. missings report, sort min(400)

Checking missings in all variables:
15082 observations with missing values

-------------------
          |      #
----------+--------
    union |   9296
   wks_ue |   5704
 wks_work |    703
   tenure |    433
-------------------

. return list

scalars:
                  r(N) =  28534

macros:
            r(varlist) : "union wks_ue tenure wks_work"

missings doesn't encourage or support it directly, but if you had personal rules that more than so many missings in a variable should mean that variables are dropped, then it should now be apparent what to do.

Doesn't encourage it because: you may be better off with multiple imputation, or even doing what you can with the data available. As a programmer I want that decision to rest squarely on the user's shoulders.

See https://www.stata-journal.com/articl...article=dm0085 (article will be visible to all on publication of Stata Journal 18(4)) or equally

https://www.statalist.org/forums/for...aging-missings

From that thread, please note that at the time of writing

Code:

. search dm0085_1, entry

Search of official help files, FAQs, Examples, SJs, and STBs

SJ-17-3 dm0085_1  . . . . . . . . . . . . . . . . Software update for missings
        (help missings if installed)  . . . . . . . . . . . . . . .  N. J. Cox
        Q3/17   SJ 17(3):779
        identify() and sort options have been added

provides a link to the latest version of the command files (always assuming that your Stata is updated to cover at least entries to Stata Journal 17(3)).

Announcement

Dropping variables by number of non-missing values

Comment

Comment

Comment

Comment

Comment