Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Totaling one field within subsets of two or more others

    Hello,
    I am cleaning a data table of chemical contaminants found in food samples, in which are found four variables:
    1. Batch (denoting the batch in which samples were processed)
    2. Food
    3. Analyte (containing the names of various molecular compounds)
    4. Quantity (containing the quantity detected by a laboratory)

    In some cases, the chemical compound is found in several different isomers (molecular variations). For example:

    Batch Food Analyte Quantity
    1 apples a-BHC 0.1
    1 apples b-BHC 0.1
    1 apples g-BHC 0.1
    1 apples DDT 0.7
    2 oranges DIE 0.4
    3 pineapples PCL 0.6

    I am seeking advice on how to combine the quantities of related isomers (a-BHC, b-BHC, g-BHC) into a new field "BHC (all forms)", within unique batch/food subsets. In this table example, that would be (batch==1 & food==apples).

    I can specify isomer sets (there are only a few), but am hoping to avoid manual search errors by a loop that can look for pre-specified isomer sets within any unique batch/food subsets (of which there are many).
    There are 6210 observations, so I am hoping to find a way to automate this process within a loop, with the resulting table looking like this.

    Batch Food Analyte Quantity
    1 Apples BHC (all forms) 0.3
    1 apples DDT 0.7
    2 oranges DIE 0.4
    3 pineapples PCL 0.6

    Any guidance would be appreciated!

  • #2
    Since you can specify a small number of isomer sets, I would first create a separate Stata data set with two variables: Analyte and Analyte_group, that looks like:

    Code:
    Analyte    Analyte_group
    a-BHC     BHC
    b-BHC     BHC
    g-BHC     BHC
    ...
    You only need to include in this Analytes that appears as distinct isomers in the data set. Let's call this data set isomers.dta

    Now go back to your original data, -merge- it with isomers.dta, and then aggregate using -collapse-:

    Code:
    use original_data
    merge m:1 Analyte using isomers, keep(match master)
    
    // Set Analyte_group = Analyte if it is not found in isomers.dta
    replace Analyte_group = Analyte if _merge != 3
    
    // Aggregate over analyte groups
    collapse (sum) Quantity, by(Batch Food Analyte_group)
    Note: This code will fail if, as in your example of output you are seeking, "apples" and "Apples" refer to the same food. So to use this code you have to carefully clean your data so that each category of food is spelled in exactly the same case-sensitive way. You can also get tripped up by leading or trailing blanks in some instances. So, at a minimum, before trying to do this, you need to clean your data with

    Code:
    foreach v of varlist Food Analyte {
        replace `v' = trim(itrim(upper(`v')))
    }
    Additional cleaning may be needed to eliminate misspellings or other typographical errors.

    Comment

    Working...
    X