Difficulty aggregating variables to the unit of analysis without losing or altering data

Sean Houlihan

Join Date: Mar 2019
Posts: 3

Difficulty aggregating variables to the unit of analysis without losing or altering data

20 Mar 2019, 10:36

Hello,

I'm a first year grad student that is fairly new to Stata. I'm working on a project in which I have to clean and analyze a secondary dataset found online. The dataset I am using is Cook County, IL sentencing data. The link to the data is: https://datacatalog.cookcountyil.gov...tg8v-tm6u/data.

The dataset has 221,170 observations, with each observation being a separate charge. Multiple charges can belong to the same court case, so multiple rows can pertain to the same value for the case_id variable.There are 37 variables, but many are repetitive for the analysis I will be doing (e.g. arrest date/arraignment date/sentencing date when I only need the year). I do not need to use all of them. Essentially the data look something like this:

Charge_Id	Case_Id	Offense Category	Race	Age	Gender	Charge Count
1	1	Narcotics	White	27	Male	2
2	1	Narcotics	White	27	Male	1
3	1	Robbery	White	27	Male	1
4	2	Retail Theft	Black	33	Female	3
5	2	Burglary	Black	33	Female	1
6	3	Homicide	Hispanic	46	Male	1
7	4	Battery	White	22	Female	1

My question is: "Does the total amount of charges that a defendant receives on average vary by race?" My unit of analysis would be case_id, in other words each individual court case possibly containing multiple charges and offense categories. My DV would be total charge count, and my IV would be defendant race. I would then control for variables such as: offense category, law enforcement agency that made the arrest, age of defendant, gender, etc. The dataset is made entirely of string variables. So I destrung the ones I will use, and dummied out the non-numeric ones such as gender and race.

Where I am having difficulty is aggregating the data so that each row is a single case ID. As I said, each row represents a charge that is filed, and each charge filed could contain multiple counts. So one row might be the charge of robbery, but you were charged on 3 counts of it. Multiple rows, or charges, can pertain to a single case_id number. Therefore, each case_id number can contain charges from different offense categories (e.g. narcotics and robbery).

In an ideal world, my data would look something like this:

Case_Id	Offense Category	Race	Age	Gender	Charge Count
1	Narcotics, Robbery	White	27	Male	4
2	Retail theft, Burglary	Black	33	Female	4
3	Homicide	Hispanic	46	Male	1
4	Battery	White	22	Female	1

My main issue is aggregating the data to the case_id level without altering the values for age, race, gender, etc. I am also unsure if it is possible to combine the different offense categories from multiple charges into one variable that contains them all.

What I have tried is collapsing by case_id, but that seems to mess with some of these data values. For example, I coded race as white=0; black=1; hispanic=2, But some of the values are showing up as decimals. Because all of the values should be the same for the same case_id, I figured taking the mean value would ensure the values stayed the same. I suspect missing values or errors in the recording are responsible for the values changing when I collapse. For reference, here is my code after destringing and/or dummying out the variables:

Code:

collapse (mean) sent_phase (mean) age (mean) sex ///
 (mean) def_race (sum) charge_count (mean) arrest_year, by(case_id)

This method does give me the total charge count I want, but at the expense of altering other variables which is obviously a serious issue. I am pretty stuck with this right now and am not sure exactly the best way to proceed. Should I be approaching the collapse command differently, or not be using collapse at all to solve this issue?

Thank you in advance for your time.

Tags: None

Mike Lacy

Join Date: Apr 2014

Posts: 2415
#2

20 Mar 2019, 11:28

Using -collapse- is relevant, but you don't want things like (mean) def_race, as this asks for a variable that is the mean of the values of the race variable within case_id. Take another look at the statistics options listed under -help collapse-, which should help you out.
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30091
#3

20 Mar 2019, 11:28

I think your problem raises a few separate issues. Let's look at them one at a time.

What I have tried is collapsing by case_id, but that seems to mess with some of these data values. For example, I coded race as white=0; black=1; hispanic=2, But some of the values are showing up as decimals. Because all of the values should be the same for the same case_id, I figured taking the mean value would ensure the values stayed the same. I suspect missing values or errors in the recording are responsible for the values changing when I collapse.

Missing values will not disturb the calculation of the mean. But errors in the recording probably will. My experience, working with medical data, is that race variables are usually shot through with inconsistencies. In fact, it is almost exceptional to find a series of observations on the same person where the race is recorded the same in all of them! There are several approaches that can be used in response to that. One is to simply conclude that race is so poorly measured that it is unusable and exclude it from the analytic data. That sounds like it would undermine one of your key objectives, however, so it wouldn't be suitable. The best approach then is to try to get some external information on the correct race value for each person. Unfortunately, such data is usually not available. The next best approach is to set up an algorithm for reconciling the discrepancies. Examples of that might include using the one that appears most frequently (which also requires some rule for breaking ties), or the one that appears first, or setting up a hierarchy (e.g. if both white and black, choose white, if both black and Hispanic, choose Black, ... whatever.)

You might need to similarly reconcile inconsistencies in some of the other variables, although, in my experience, those are less frequently messed up. Age may be more of a problem, however, as, in fact, age should vary unless the arrestee has no birthdays during the time period spanned by their data. Deciding how to handle that would involve delving more deeply into how you plan to model the effects of age in your analysis.

You can easily identify case_id's with race discrepancies as follows:

Code:

by case_id (race), sort: gen byte problem = race[1] != race[_N] browse if problem

I am also unsure if it is possible to combine the different offense categories from multiple charges into one variable that contains them all.

Combining the offense categories cannot be done in -collapse-, but you can do it before you collapse:

Code:

* Example generated by -dataex-. To install: ssc install dataex clear input byte(charge_id case_id) str12 offensecategory str8 race byte age str6 gender byte chargecount 1 1 "Narcotics" "White" 27 "Male" 2 2 1 "Narcotics" "White" 27 "Male" 1 3 1 "Robbery" "White" 27 "Male" 1 4 2 "Retail Theft" "Black" 33 "Female" 3 5 2 "Burglary" "Black" 33 "Female" 1 6 3 "Homicide" "Hispanic" 46 "Male" 1 7 4 "Battery" "White" 22 "Female" 1 end by case_id offensecategory, sort: gen byte first = (_n == 1) by case_id: replace offensecategory = ", " + offensecategory if _n > 1 by case_id: gen combined_offenses = offensecategory if _n == 1 by case_id: replace combined_offenses = combined_offenses[_n-1] /// + cond(first, offensecategory, "") if _n > 1 by case_id: replace combined_offenses = combined_offenses[_N]

Then in your -collapse- you can specify -(last) combined_offenses- to bring that along.

Added:

Crossed with #2.

In the future, when showing data examples, please use the -dataex- command to do so, as I have here. If you are running version 15.1 or a fully updated version 14.2, -dataex- is already part of your official Stata installation. If not, run -ssc install dataex- to get it. Either way, run -help dataex- to read the simple instructions for using it. -dataex- will save you time; it is easier and quicker than typing out tables. It includes complete information about aspects of the data that are often critical to answering your question but cannot be seen from tabular displays or screenshots. It also makes it possible for those who want to help you to create a faithful representation of your example to try out their code, which in turn makes it more likely that their answer will actually work in your data.

When asking for help with code, always show example data. When showing example data, always use -dataex-.
Comment
Sean Houlihan

Join Date: Mar 2019

Posts: 3
#4

21 Mar 2019, 07:27

Mike,
I was thinking that the value for race would be the same for all values under a single case_id number, so taking the mean wouldn't matter. This was obviously not correct thinking, though. Thank you for your help!
Comment
Sean Houlihan

Join Date: Mar 2019

Posts: 3
#5

21 Mar 2019, 07:32

Clyde,
Okay thanks a lot. I was thinking that was the issue with the race variable but wasn't sure how to go about fixing it. I think for the purposes of this project I will opt for a hierarchy algorithm.

Your code for combining the offense categories has also worked, so thank you very much for that! I was stuck on how to figure that out for a while.

And I apologize for not using the -dataex- command. I thought I had read through the FAQ thoroughly, but obviously not thoroughly enough. In the future I will use it. Thanks again!
Comment

Announcement

Difficulty aggregating variables to the unit of analysis without losing or altering data

Comment

Comment

Comment

Comment