Totaling one field within subsets of two or more others

Michael McCulloch

Join Date: Jul 2025

Posts: 24
#1

Totaling one field within subsets of two or more others

19 Oct 2014, 17:11

Hello,
I am cleaning a data table of chemical contaminants found in food samples, in which are found four variables:
1. Batch (denoting the batch in which samples were processed)
2. Food
3. Analyte (containing the names of various molecular compounds)
4. Quantity (containing the quantity detected by a laboratory)

In some cases, the chemical compound is found in several different isomers (molecular variations). For example:

Batch Food Analyte Quantity
1 apples a-BHC 0.1
1 apples b-BHC 0.1
1 apples g-BHC 0.1
1 apples DDT 0.7
2 oranges DIE 0.4
3 pineapples PCL 0.6

I am seeking advice on how to combine the quantities of related isomers (a-BHC, b-BHC, g-BHC) into a new field "BHC (all forms)", within unique batch/food subsets. In this table example, that would be (batch==1 & food==apples).

I can specify isomer sets (there are only a few), but am hoping to avoid manual search errors by a loop that can look for pre-specified isomer sets within any unique batch/food subsets (of which there are many).
There are 6210 observations, so I am hoping to find a way to automate this process within a loop, with the resulting table looking like this.

Batch Food Analyte Quantity
1 Apples BHC (all forms) 0.3
1 apples DDT 0.7
2 oranges DIE 0.4
3 pineapples PCL 0.6

Any guidance would be appreciated!
Tags: None
Clyde Schechter

Join Date: Apr 2014

Posts: 30100
#2

19 Oct 2014, 17:43

Since you can specify a small number of isomer sets, I would first create a separate Stata data set with two variables: Analyte and Analyte_group, that looks like:

Code:

Analyte Analyte_group a-BHC BHC b-BHC BHC g-BHC BHC ...

You only need to include in this Analytes that appears as distinct isomers in the data set. Let's call this data set isomers.dta

Now go back to your original data, -merge- it with isomers.dta, and then aggregate using -collapse-:

Code:

use original_data merge m:1 Analyte using isomers, keep(match master) // Set Analyte_group = Analyte if it is not found in isomers.dta replace Analyte_group = Analyte if _merge != 3 // Aggregate over analyte groups collapse (sum) Quantity, by(Batch Food Analyte_group)

Note: This code will fail if, as in your example of output you are seeking, "apples" and "Apples" refer to the same food. So to use this code you have to carefully clean your data so that each category of food is spelled in exactly the same case-sensitive way. You can also get tripped up by leading or trailing blanks in some instances. So, at a minimum, before trying to do this, you need to clean your data with

Code:

foreach v of varlist Food Analyte { replace `v' = trim(itrim(upper(`v'))) }

Additional cleaning may be needed to eliminate misspellings or other typographical errors.
Comment

Announcement

Totaling one field within subsets of two or more others

Comment