Announcement

Collapse
No announcement yet.
X
  • Filter
  • Time
  • Show
Clear All
new posts

  • Using a Matrix to Identify Anaemia

    This started as a question, but I came up with a solution that I couldn't find documented elsewhere. The method uses a matrix. I'm very new to matrices and had no confidence that the method would work - but it did! I'm sure this method is far from novel, but it was fun to try it out and discover that it works! I share it here for discussion - is this a good solution? Are there better solutions?

    Here is the problem:
    I have a large dataset of blood values for children. I want to create a binary variable anaemia that is 0 if the child is not anaemic and 1 if the child is anaemic. The normal values for haemoglobin vary by age and sex. In the past I have used published mean (SD) values to generate Z-scores, and identified anaemia as children with a haemoglobin Z-score less than -2. There are some issues with the parametric assumptions underlying that approach. Here I will be using lower bound cut-offs as published in: Staffa et al, Pediatric hematology normal ranges derived from pediatric primary care patients. Am J Hematol [Internet]. 2020 Oct [cited 2022 Dec 17];95(10).

    The relevant variables in my dataset for achieving this are:
    • haemoglobin (path_hb)
    • sex (sex)
    • date of birth (dob)
    • date of blood test (path_date)
    From these I generated a categorical variable for age at blood test (path_age), with each category corresponding to the age categories used in the reference.

    Code:
    * Example generated by -dataex-. For more info, type help dataex
    clear
    input int path_hb byte sex int(dob path_date) float path_age
    128 1 20853 21570  7
    112 1 19185 21625  9
    117 2 21397 22484  8
    102 2 21642 22486  8
     92 1 22337 22465  6
    123 1 20591 22451  8
    108 2 22363 22468  6
    126 1 19719 22486  9
     81 1 21210 22540  8
     99 2 21917 22563  7
    111 2 20887 22564  8
    133 1 20385 21624  8
     99 1 22084 22575  7
     76 1 22257 22584  7
     99 2 21312 22580  8
    106 1 22279 22635  7
    105 1 22325 22636  7
     91 2 22481 22644  6
    120 1 19505 21627  8
    103 2 20331 21629  8
    130 2 19890 21631  8
    114 1 20886 21633  8
    119 1 20965 21644  7
     89 1 21510 21646  6
    105 2 20648 21650  8
    101 2 19456 21556  8
    108 1 20751 21678  8
    117 1 21525 21680  6
    100 2 19458 21693  9
     89 1 21452 21715  7
    139 1 18245 21720  9
    113 1 21175 21726  7
    121 1 19665 21735  8
    119 1 19253 21755  9
     80 1 21279 21749  7
     93 1 20515 21756  8
    113 2 21448 21557  6
    126 1 20900 21760  8
    132 1 20744 21764  8
     91 1 20860 21788  8
    end
    format %dM_d,_CY dob
    format %dM_d,_CY path_date
    label values sex sex_lbl
    label def sex_lbl 1 "Male", modify
    label def sex_lbl 2 "Female", modify
    label values path_age path_age_lbl
    label def path_age_lbl 5 "31d to 60d", modify
    label def path_age_lbl 6 "61d to 180d", modify
    label def path_age_lbl 7 "6m to <2y", modify
    label def path_age_lbl 8 "2y to <6y", modify
    label def path_age_lbl 9 "6y to <12y", modify
    label def path_age_lbl 10 "12y to <18y", modify
    The brief is to create a binary variable anaemia that is 0 if the haemoglobin is ≥ cut-off for age and sex, and 1 if the haemoglobin is < cut-off for age and sex. My first solution was simply many iterations of:

    Code:
    replace anaemia = 1 if sex == X & path_age = Y & path_hb < Z
    I thought that looked messy, so I tried another solution. I created a matrix for the reference lower cut-offs as follows:

    Code:
    matrix input hbRef = ( 128, 128 \ /// 1: 1-3 days
    133, 130 \ /// 2: 4-7 days
    110, 120 \ /// 3: 8-14 days
    98, 102 \ /// 4: 15-30 days
    90, 89 \ /// 5: 31-60 days
    94, 96 \ /// 6: 61-180 days
    102, 103 \ /// 7: 181 days to <2 years
    107, 107 \ /// 8: 2 years to <6 years
    113, 112 \ /// 9: 6 years to <12 years
    124, 114 ) // 10: 12 years to <18 years)
    matrix colnames hbRef = "Males" "Females"
    matrix rownames hbRef = "1d to 3d" "4d to 7d" "8d to 14d" "15d to 30d" "31d to 60d" "61d to 180d" "6m to <2y" "2y to <6y" "6y to <12y" "12y to <18y"
    Which gives the following:

    Code:
    . mat list hbRef
    
    hbRef[10,2]
                   Males  Females
       1d to 3d      128      128
       4d to 7d      133      130
      8d to 14d      110      120
     15d to 30d       98      102
     31d to 60d       90       89
    61d to 180d       94       96
      6m to <2y      102      103
      2y to <6y      107      107
     6y to <12y      113      112
    12y to <18y      124      114
    Note that the columns correspond to my sex codes and the rows correspond to my path_age codes. Now I can create my anaemia variable as follows:

    Code:
    gen anaemia = .
    replace anaemia = 0 if !missing(path_hb) & !missing(path_age)
    replace anaemia = 1 if path_hb < hbRef[path_age, sex]
    label variable anaemia "Anaemic for age"
    label values anaemia yesno_lbl
    How do these methods compare? Six in one, half a dozen in the other?

  • #2
    I'm so sorry, I don't understand the problem or the solution at all. Could you please rephrase it? What's the problem in simple terms, and how did your approach overcome it?

    Comment


    • #3
      Hi Jared,
      I apologise. The 'problem' was probably just that I'm still a novice. I'll try to be clearer.

      The referenced article gives normative ranges for haemoglobin for children, categorised by age and sex:

      Click image for larger version

Name:	Table1_Staffa_2020.jpg
Views:	1
Size:	212.9 KB
ID:	1693909


      I have haemoglobin values for lots of kids at different ages, and want to categorise them as anaemic or not. I am using the first column from the reference, taking just the lower-bound. My first approach worked, but I thought it was messy. It was as follows:

      Code:
      gen anaemia = .
      replace anaemia = 0 if !missing(path_hb) & !missing(path_age)    // I have some missing data in my dataset.
      label variable anaemia "Anaemic for age"
      label values anaemia yesno_lbl    // This is my yes/no label, 0 = no, 1 = yes
      
      //    Make anaemia = 1 if the child was male, between 1 and 3 days old at the time of the blood test,
      //    and the haemoglobin was less than the cut-off in the table (I use g/L rather than g/dL).
      replace anaemia = 1 if sex == 1 & path_age == 1 & path_hb < 128
      
      //    Make anaemia = 1 if the child was male, between 1 and 3 days old at the time of the blood test,
      //    and the haemoglobin was less than the cut-off in the table (I use g/L rather than g/dL).
      replace anaemia = 1 if sex == 2 & path_age == 1 & path_hb < 128
      //    Etcetera, for the rest of the table.
      This is a terrible solution: it is unsophisticated and it doesn't scale. My new solution was to enter the reference cut-off as a matrix and use it as a look-up table.

      Code:
      . mat list hbRef
      
      hbRef[10,2]
                     Males  Females
         1d to 3d      128      128
         4d to 7d      133      130
        8d to 14d      110      120
       15d to 30d       98      102
       31d to 60d       90       89
      61d to 180d       94       96
        6m to <2y      102      103
        2y to <6y      107      107
       6y to <12y      113      112
      12y to <18y      124      114

      Since I coded the my age and sex variables to match the rows and columns of the matrix, the following line works:

      Code:
      replace anaemia = 1 if path_hb < hbRef[path_age, sex]
      I hope that was clearer. This might be an obvious solution, I'm sure it isn't the best solution. I'm keen to know how others would have achieved the same thing.

      Comment


      • #4
        I think this is a perfectly reasonable approach, at least for one-time or occasional use.

        If you need to do this on a recurring basis, the drawback is that you need to code the creation of that matrix over and over again. So a more convenient approach would be to save the same information that is in the matrix in a data set in long layout. Then you could -merge- a patient data set on the age (suitably coded) and sex whenever you needed to do anemia classification.

        And if this were something that you were going to do on a very frequent basis, it would make sense to write a general program that would take the names of the age and sex variables in any data set as arguments, the name of a new variable to hold the anemia classification results as an argument, code the age and sex compatibly with your data set that has the hemoglobin cutoffs, and do the -merge-. That would require a small investment of time to setup, but then any time you needed to classify anemia it would just be a single command to do it.

        Comment

        Working...
        X