A little help with Heatmap and commands related

Elio Gentilini

Join Date: Feb 2025

Posts: 4
#1

A little help with Heatmap and commands related

12 Feb 2025, 16:00

Dear statalists, I always read your advice and now I need a specific help, if possible.

I have a dataset with some variables, but in this case I would like to focus on: the Group variable, that contains 3 group (HC, BF-, BF+) and the gene expression variable, that contains zscore data from my genes. I think Heatmap can show the difference in expression of my genes among groups. In order to do so
After creating a zscore of my genes data I proceeded with this code

reshape long z_, I(CODE) j(Gene) string
encode Gruppo, gen(Gruppo_num)
encode Gene, gen(Gene_num)
hmap z_ Gruppo_num Gene_num, color(reds)

but I have no results because stata can't generate the graph.
I wonder if there is another way in order to generate a hat map.
I tried also
twoway contour long z_ Gruppo_num Gene_num, heatmap

but the graph is so bad.
Can you help me in order to try to generate a good heat graph and see if this is for real the best way to show my data if gene expression in my 3 groups?
Thanks a lot
Tags: None
Hemanshu Kumar

Join Date: Mar 2015

Posts: 1411
#2

12 Feb 2025, 21:48

Hi Elio, welcome to Statalist!

Please go through the Statalist FAQ, especially #12. And do consider your post from the perspective of the reader.
how is the reader supposed to replicate your problem without some example data from you?

without knowing the data or the exact error generated, what can the reader learn from "because state can't generate the graph"?

without seeing your graph, what can the reader make of a statement like "the graph is so bad"?

In short, why would you expect any meaningful response to your post?
1 like
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#3

13 Feb 2025, 02:40

Hemanshu Kumar gives frank but I think utterly fair advice.

You have 3 groups and z-scores for each gene. That suggests to me something like a profile plot or dot plot. Encoding magnitude by colour intensity is rarely as successful as its proponents seem to hope. But much depends on

how many genes we're talking about

whether gene numbering is arbitrary or has scientific meaning (your encode will just yield alphanumeric sorting)

how far you want to focus on specific genes as compared with some general pattern

and so on.
Comment
Elio Gentilini

Join Date: Feb 2025

Posts: 4
#4

13 Feb 2025, 07:29

I will try to be more clear. After I use the
twoway contour long z_ Gruppo_num Gene_num, heatmap
Stata gives me this graph

However I would like something like this one

I have 7 gene expression, 3 groups and score of gene expression, transformed to long from wide, giving this result after transformation

reshape long z_, i(CODE) j(Gene) string
(note: j = IFNALPHAmRNA IFNAR1mRNA IFNAR2mRNA IFNBETAmRNA ISG15mRNA ISG56mRNA UNC93B1mRNA)

Data wide -> long
-----------------------------------------------------------------------------
Number of obs. 35 -> 245
Number of variables 221 -> 216
j variable (7 values) -> Gene
xij variables: Gene-expression1, Gene-expression2, Gene-expression3 etc.

I can't describe the error with hmap command because stata try to generate the graph, but although loading in generation of the chart for half or one hour, just nothing appear and continue to load.
I tried using just a dot plot for my data, however I don't think it is the right way to show my them, because I would like to show how expression of a gene (a continuous data) vary among groups and try to show clusters and patterns of expression among my 3 groups.

Attached Files
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

13 Feb 2025, 09:23

If I understand correctly you have z scores in a 3 x 7 array. I still recommend a profile or dot plot, showing value labels and using scientifically or clinically sensible sort order. To show more I would need a data example: see Hemanshu Kumar's link to FAQ Advice #12 in #2 of this thread.
Comment

Elio Gentilini

Join Date: Feb 2025
Posts: 4

13 Feb 2025, 16:25

This is the dataex after transformation from wide to long. Thanks for helping

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double z_ long(Gruppo_num Gene_num)
   3.169179376360953 3 1
 -1.1323372304728825 3 2
 -1.5850098279853462 3 3
    1.88085575817867 3 4
  -.8995726233284671 3 5
   2.501236621992729 3 6
 -1.4895210355230895 3 7
  -.2579666685396823 3 1
-.003676125639154604 3 2
   .1071359496299042 3 3
  -.5394920987957598 3 4
  .49930006183104886 3 5
 -.05069297294438386 3 6
 -.30294594900929483 3 7
 -.30532666647635326 3 1
  -.4840853336160408 3 2
  .06868309651254333 3 3
 -.33190702481729506 3 4
  -.8570887586490585 3 5
  -.4638801806571499 3 6
 -.24306016555676377 3 7
  -.4929844235699497 3 1
  -.6375519800023571 3 2
   .0839843239661045 3 3
  -.4308939719530998 3 4
  -.6671351854004521 3 5
  -.6713186688150485 3 6
  .06646770679456096 3 7
  .05540131953533644 3 1
  -.5781119896188455 3 2
  -.8720599106904778 3 3
  .08339300127327866 3 4
 -.10187934548964143 3 5
   .1851890273064244 3 6
 -.29759563565159264 3 7
 -.36195860146918696 3 1
  1.7350235478606284 3 2
  1.3447829153801332 3 3
  -.3970036402953043 3 4
  1.7910060540370842 3 5
   -.149453112957819 3 6
  .15451363655565772 3 7
  .09122271814609184 3 1
  2.1152145527587796 3 2
  1.9881577132538903 3 3
  .21247661904900778 3 4
  1.1185952133706716 3 5
    .941929021703666 3 6
  -.4061488391164188 3 7
  -.4132492880224612 3 1
  1.1919539369789702 3 2
   1.542975260411882 3 3
  -.3677839294106579 3 4
  .08766603345804426 3 5
 -.26909647891645266 3 6
 -.33466133865135916 3 7
 .055401319535340526 3 1
  .21705284157965485 3 2
  -.1690588925626785 3 3
-.060594292005382105 3 4
   .8925742089849635 3 5
 -.19559454705698387 3 6
   -.878805551997333 3 7
   1.777866014318057 3 1
 -.40277271391008707 3 2
 .045929201920346746 3 3
  1.3707307263341746 3 4
  -.9549060567428004 3 5
   .5892234013583248 3 6
 -1.1801818392451102 3 7
  2.0887782563607376 3 1
 -1.4118023066355216 3 2
 -1.9883416711056943 3 3
  4.0664211043180964 3 4
 -1.2336681349596144 3 5
   2.849940108008495 3 6
 -1.7757023312547313 3 7
     2.2633716054298 3 1
  -1.345302667831052 3 2
 -1.7819065352314225 3 3
  2.1064272493412584 3 4
   .5327016631274175 3 5
  1.8700505663654956 3 6
 -1.6598954715800376 3 7
  -.5174807026609172 1 1
  -.7731591841878861 1 2
  -.4171236288885825 1 3
  -.5023738976065352 1 4
  -.5414897783077025 1 5
  -.8353956030449888 1 6
  -.4210215081169135 1 7
  -.7816801916865697 2 1
  .14844938044566408 2 2
   .4488955669220739 2 3
  -.6866095522577673 2 4
   .5735551612605914 2 5
  -.5010093786807142 2 6
   .6335838774303268 2 7
  -.5483981795539848 2 1
    .082412069980542 2 2
end
label values Gruppo_num Gruppo_num
label def Gruppo_num 1 "BF+", modify
label def Gruppo_num 2 "BF-", modify
label def Gruppo_num 3 "HC", modify
label values Gene_num Gene_num
label def Gene_num 1 "IFNALPHAmRNA", modify
label def Gene_num 2 "IFNAR1mRNA", modify
label def Gene_num 3 "IFNAR2mRNA", modify
label def Gene_num 4 "IFNBETAmRNA", modify
label def Gene_num 5 "ISG15mRNA", modify
label def Gene_num 6 "ISG56mRNA", modify
label def Gene_num 7 "UNC93B1mRNA", modify

Comment

Chen Samulsion

Join Date: Jan 2018

Posts: 926
#7

13 Feb 2025, 18:56

I can't describe the error with hmap command because stata try to generate the graph, but although loading in generation of the chart for half or one hour, just nothing appear and continue to load.

I think you mistake grammar of -hmap-. Probably you write it as hamp z_ Gruppo_num Gene_num, however, this is wrong. The right syntax should be hamp Gruppo_num Gene_num z_. That is to say, you should put the "heat" variable of interest in the last.

Code:

hmap Gruppo_num Gene_num z_, noscatter
Comment
Chen Samulsion

Join Date: Jan 2018

Posts: 926
#8

13 Feb 2025, 19:27

PS. If you want to draw dendrogram like plot in #4, you'd better use R instead. As far as I know, there's no Stata package could draw dendrogram in heatplot. https://statisticsglobe.com/heatmap-in-r
Comment

Nick Cox

Join Date: Mar 2014
Posts: 35724

14 Feb 2025, 02:06

Thanks for posting a data example. I confess readily that I know very little about this kind of data or what is scientifically or clinically interesting or important in its analysis. More statistically, I note much variability that the heat map suppresses, guess that the data example isn't complete, and wonder whether there is a more direct way to show the data and what they imply.

The principle of a heat map is that people can detect patterns and anomalies by decoding colour differences, but that rarely works well for me.

I am far from clear that this particular graph is helpful, let alone ideal, but I'd encourage more open-mindedness about how to show the data. The use of medians is a little arbitrary but more of a default summary when faced with erratic-seeming data.

Code:

* Example generated by -dataex-. To install: ssc install dataex
clear
input double z_ long(Gruppo_num Gene_num)
   3.169179376360953 3 1
 -1.1323372304728825 3 2
 -1.5850098279853462 3 3
    1.88085575817867 3 4
  -.8995726233284671 3 5
   2.501236621992729 3 6
 -1.4895210355230895 3 7
  -.2579666685396823 3 1
-.003676125639154604 3 2
   .1071359496299042 3 3
  -.5394920987957598 3 4
  .49930006183104886 3 5
 -.05069297294438386 3 6
 -.30294594900929483 3 7
 -.30532666647635326 3 1
  -.4840853336160408 3 2
  .06868309651254333 3 3
 -.33190702481729506 3 4
  -.8570887586490585 3 5
  -.4638801806571499 3 6
 -.24306016555676377 3 7
  -.4929844235699497 3 1
  -.6375519800023571 3 2
   .0839843239661045 3 3
  -.4308939719530998 3 4
  -.6671351854004521 3 5
  -.6713186688150485 3 6
  .06646770679456096 3 7
  .05540131953533644 3 1
  -.5781119896188455 3 2
  -.8720599106904778 3 3
  .08339300127327866 3 4
 -.10187934548964143 3 5
   .1851890273064244 3 6
 -.29759563565159264 3 7
 -.36195860146918696 3 1
  1.7350235478606284 3 2
  1.3447829153801332 3 3
  -.3970036402953043 3 4
  1.7910060540370842 3 5
   -.149453112957819 3 6
  .15451363655565772 3 7
  .09122271814609184 3 1
  2.1152145527587796 3 2
  1.9881577132538903 3 3
  .21247661904900778 3 4
  1.1185952133706716 3 5
    .941929021703666 3 6
  -.4061488391164188 3 7
  -.4132492880224612 3 1
  1.1919539369789702 3 2
   1.542975260411882 3 3
  -.3677839294106579 3 4
  .08766603345804426 3 5
 -.26909647891645266 3 6
 -.33466133865135916 3 7
 .055401319535340526 3 1
  .21705284157965485 3 2
  -.1690588925626785 3 3
-.060594292005382105 3 4
   .8925742089849635 3 5
 -.19559454705698387 3 6
   -.878805551997333 3 7
   1.777866014318057 3 1
 -.40277271391008707 3 2
 .045929201920346746 3 3
  1.3707307263341746 3 4
  -.9549060567428004 3 5
   .5892234013583248 3 6
 -1.1801818392451102 3 7
  2.0887782563607376 3 1
 -1.4118023066355216 3 2
 -1.9883416711056943 3 3
  4.0664211043180964 3 4
 -1.2336681349596144 3 5
   2.849940108008495 3 6
 -1.7757023312547313 3 7
     2.2633716054298 3 1
  -1.345302667831052 3 2
 -1.7819065352314225 3 3
  2.1064272493412584 3 4
   .5327016631274175 3 5
  1.8700505663654956 3 6
 -1.6598954715800376 3 7
  -.5174807026609172 1 1
  -.7731591841878861 1 2
  -.4171236288885825 1 3
  -.5023738976065352 1 4
  -.5414897783077025 1 5
  -.8353956030449888 1 6
  -.4210215081169135 1 7
  -.7816801916865697 2 1
  .14844938044566408 2 2
   .4488955669220739 2 3
  -.6866095522577673 2 4
   .5735551612605914 2 5
  -.5010093786807142 2 6
   .6335838774303268 2 7
  -.5483981795539848 2 1
    .082412069980542 2 2
end
label values Gruppo_num Gruppo_num
label def Gruppo_num 1 "BF+", modify
label def Gruppo_num 2 "BF-", modify
label def Gruppo_num 3 "HC", modify
label values Gene_num Gene_num
label def Gene_num 1 "IFNALPHAmRNA", modify
label def Gene_num 2 "IFNAR1mRNA", modify
label def Gene_num 3 "IFNAR2mRNA", modify
label def Gene_num 4 "IFNBETAmRNA", modify
label def Gene_num 5 "ISG15mRNA", modify
label def Gene_num 6 "ISG56mRNA", modify
label def Gene_num 7 "UNC93B1mRNA", modify

egen median = median(z_), by(Ge Gr)

scatter z median Gruppo_num, ms(Oh Dh) msize(*1 *2) by(Gene_num, compact row(1) note("") legend(pos(12))) xsc(r(0.8 3.2)) xla(1/3, glp(solid) tlc(none) valuelabel) ytitle(z score) yla(-2/4) xtitle("") subtitle(, size(small)) legend(order(1 "values" 2 "median")  row(1))

Click image for larger version

Name: genes.png
Views: 1
Size: 75.2 KB
ID: 1772708

Comment

Chen Samulsion

Join Date: Jan 2018
Posts: 926

#10

14 Feb 2025, 07:22

Dear Nick Cox, how about using -zmap- in this case?

Code:

sum z_, detail
zmap z Gruppo_num Gene_num, breaks(`r(p5)' `r(p10)' `r(p20)' `r(p50)' `r(p70)' `r(p90)' `r(p95)') ysc(on) xsc(on) xlabel(1/7, valuelabel) ytitle("Gruppo_num") ylabel(1/3, valuelabel) msymbol(S ..) mcolor(stc1)

Click image for larger version

Name: Graph.png
Views: 1
Size: 163.5 KB
ID: 1772715

Comment

Nick Cox

Join Date: Mar 2014

Posts: 35724
#11

14 Feb 2025, 11:09

Chen Samulsion Thanks for your interest in my zmap command (from SSC; cf. https://www.stata.com/statalist/arch.../msg01084.html)

-- but it can only show one point for each (row = group, column = gene) pair.

Much of the rationale for my previous answer is that there are repeated observations for many such pairs. (I suspect that the data example in #6 is far from complete and truncated because dataex was used with its default, without use of count().)
Comment
Elio Gentilini

Join Date: Feb 2025

Posts: 4
#12

14 Feb 2025, 15:38

Thanks for helping,
After Chen Samulsion explanation I re-created the code using hmap Gene_num Gruppo_num z_, noscatter

Although without a legend is not so clear, I think I will e forced to use heatplot of hexplot.

Tanks also to Nick Cox because yes, a plot is necessary in order to show clear data, however I was focused on hmap because heatplots are very common in order to show gene expression in subjects group, as an example see this article in nature portfolio Queiroz, M.A.F., Brito, W.R.S., Pereira, K.A.S. et al. Severe COVID-19 and long COVID are associated with high expression of STING, cGAS and IFN-α. Sci Rep 14, 4974 (2024). https://doi.org/10.1038/s41598-024-55696-0
And as you can see in the chart some genes expression change in intensity from a group to another in a specific order (es. INFAR2 or UNC93B)

do you think that maybe a violin plot could help me showing these data in their continuous form in a better way than just dot plot?
However yes, I used dataex without count() but the graph is quite the same when I write your code

Last edited by Elio Gentilini; 14 Feb 2025, 15:40.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#13

14 Feb 2025, 15:49

I can’t imagine that a violin plot would help with data like yours. As presented in your example the number of values for each gene, group pair varies but is between 1 and about 10, so trying to get a kernel density estimate out of such subsamples appears to be not needed, not practical, or both. But perhaps I misunderstand what you are considering.
Comment

Announcement