key-value format

Luis Pecht

Join Date: May 2017
Posts: 151

05 Nov 2022, 03:50

Hi,

in IMHO, currently biggest Stata´s flaw, in data management area, is the native "lack" of support for key-value (including JSON, flexible schemas). Maybe, It relegates that to the python integration (sfi) or it will debut in v18.
Anyway, most data available, nowadays, comes in such format.

See toy example below:

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(month day) int year str8 product str45 tags
1 1 2022 "product1" `"{  "Dept": "HR",  "Proj": "alpha"}"'  
1 2 2022 "product3" `"{  "Dept": "OPS",  "Proj": "beta"}"'  
1 3 2022 "product3" `"{  "Dept": "IT",  "Proj": "delta"}"'  
1 4 2022 "product1" `"{  "Dept": "IT",  "Proj": "gamma"}"'  
1 5 2022 "product3" `"{  "Dept": "OPS",  "Proj": "beta"}"'  
1 6 2022 "product1" `"{  "Dept": "HR",  "Proj": "epsilon"}"'
1 7 2022 "product4" `"{  "Dept": "HR",  "Proj": "zeta"}"'   
1 8 2022 "product4" `"{  "Dept": "MKT",  "Proj": "gamma"}"' 
end

------------------ copy up to and including the previous line ------------------

Listed 8 out of 8 observations

. list

     +----------------------------------------------------------------------+
     | month   day   year    product                                   tags |
     |----------------------------------------------------------------------|
  1. |     1     1   2022   product1     {  "Dept": "HR",  "Proj": "alpha"} |
  2. |     1     2   2022   product3     {  "Dept": "OPS",  "Proj": "beta"} |
  3. |     1     3   2022   product3     {  "Dept": "IT",  "Proj": "delta"} |
  4. |     1     4   2022   product1     {  "Dept": "IT",  "Proj": "gamma"} |
  5. |     1     5   2022   product3     {  "Dept": "OPS",  "Proj": "beta"} |
     |----------------------------------------------------------------------|
  6. |     1     6   2022   product1   {  "Dept": "HR",  "Proj": "epsilon"} |
  7. |     1     7   2022   product4      {  "Dept": "HR",  "Proj": "zeta"} |
  8. |     1     8   2022   product4    {  "Dept": "MKT",  "Proj": "gamma"} |
     +----------------------------------------------------------------------+

My question is about an advice on how to convert it to tabular format, such as below.
Possible tools would be Regular expressions, SSC JSON commands, Associative Arrays.

output formats :

Code:

     +------------------------------------------------+
     | month   day   year    product   Dept      Proj |
     |------------------------------------------------|
  1. |     1     1   2022   product1     HR     alpha |
  2. |     1     2   2022   product3    OPS      beta |
  3. |     1     3   2022   product3     IT     delta |
  4. |     1     4   2022   product1     IT     gamma |
  5. |     1     5   2022   product3    OPS      beta |
     |------------------------------------------------|
  6. |     1     6   2022   product1     HR   epsilon |
  7. |     1     7   2022   product4     HR      zeta |
  8. |     1     8   2022   product4    MKT     gamma |
     +------------------------------------------------+

or, even better,

Code:

     +-------------------------------------------------------------------------------------------------------------------------------------------------------+
     | month   day   year    product   Dept_HR   Dept_IT   Dept_OPS   Dept_MKT   Proj_alpha   Proj_beta   Proj_gamma   Proj_delta   Proj_epsilon   Proj_zeta |
     |-------------------------------------------------------------------------------------------------------------------------------------------------------|
  1. |     1     1   2022   product1         1         0          0          0            1           0            0            0              0           0 |
  2. |     1     2   2022   product3         0         0          1          0            0           1            0            0              0           0 |
  3. |     1     3   2022   product3         0         1          0          0            0           0            0            1              0           0 |
  4. |     1     4   2022   product1         0         1          0          0            0           0            1            0              0           0 |
  5. |     1     5   2022   product3         0         0          1          0            0           1            0            0              0           0 |
     |-------------------------------------------------------------------------------------------------------------------------------------------------------|
  6. |     1     6   2022   product1         1         0          0          0            0           0            0            0              1           0 |
  7. |     1     7   2022   product4         1         0          0          0            0           0            0            0              0           1 |
  8. |     1     8   2022   product4         0         0          0          1            0           0            1            0              0           0 |
     +-------------------------------------------------------------------------------------------------------------------------------------------------------+

thanks in advance.

Tags: None

Nick Cox

Join Date: Mar 2014
Posts: 35724

05 Nov 2022, 04:29

Thanks for your clear data example.

most data available, nowadays, comes in such format

Never seen any such datasets in my life. I have seen several mentions of JSON. So, this must depend on your field.

Also, you don't need Python for this or to hope that Stata 18 will oblige. It yields to a few lines of commands and functions that have mostly been in Stata for years if not decades. So, writing your own dedicated do-file or command is feasible now. In fact, it's probably already been done but if I see the word JSON I just move on and don't register details mentally.

Code:

* Example generated by -dataex-. For more info, type help dataex
clear
input byte(month day) int year str8 product str45 tags
1 1 2022 "product1" `"{  "Dept": "HR",  "Proj": "alpha"}"'  
1 2 2022 "product3" `"{  "Dept": "OPS",  "Proj": "beta"}"'  
1 3 2022 "product3" `"{  "Dept": "IT",  "Proj": "delta"}"'  
1 4 2022 "product1" `"{  "Dept": "IT",  "Proj": "gamma"}"'  
1 5 2022 "product3" `"{  "Dept": "OPS",  "Proj": "beta"}"'  
1 6 2022 "product1" `"{  "Dept": "HR",  "Proj": "epsilon"}"'
1 7 2022 "product4" `"{  "Dept": "HR",  "Proj": "zeta"}"'  
1 8 2022 "product4" `"{  "Dept": "MKT",  "Proj": "gamma"}"'
end

replace tags = subinstr(tags, "{", "", .)
replace tags = subinstr(tags, "}", "", .)
* double quotation mark removed
replace tags = subinstr(tags, uchar(34), "", .)
split tags, parse(,)
local J = r(k_new)
drop tags

ret list

forval j = 1/`J' {
    split tags`j', parse(":")
    local newname = strtoname(tags`j'1[1])
    drop tags`j'1
    rename tags`j'2 `newname'
}

drop tags*

list


     +-------------------------------------------------+
     | month   day   year    product   Dept       Proj |
     |-------------------------------------------------|
  1. |     1     1   2022   product1     HR      alpha |
  2. |     1     2   2022   product3    OPS       beta |
  3. |     1     3   2022   product3     IT      delta |
  4. |     1     4   2022   product1     IT      gamma |
  5. |     1     5   2022   product3    OPS       beta |
     |-------------------------------------------------|
  6. |     1     6   2022   product1     HR    epsilon |
  7. |     1     7   2022   product4     HR       zeta |
  8. |     1     8   2022   product4    MKT      gamma |
     +-------------------------------------------------+

Your second dataset yields to an exercise in using tabulate to create indicator variables.

This assumes that the same keys always occur and in the same order. If that's not true, you will need more complicated code. I have focused on solving the particular example without reading anything external on what is and is not allowed.

Comment

Hemanshu Kumar

Join Date: Mar 2015
Posts: 1411

05 Nov 2022, 04:31

Code:

gen Dept = ustrregexs(1) if ustrregexm(tags,`""Dept": "(.+)","')
gen Proj = ustrregexs(1) if ustrregexm(tags,`""Proj": "(.+)""')

* version 1
. li month day year product Dept Proj, noobs sep(0)

  +------------------------------------------------+
  | month   day   year    product   Dept      Proj |
  |------------------------------------------------|
  |     1     1   2022   product1     HR     alpha |
  |     1     2   2022   product3    OPS      beta |
  |     1     3   2022   product3     IT     delta |
  |     1     4   2022   product1     IT     gamma |
  |     1     5   2022   product3    OPS      beta |
  |     1     6   2022   product1     HR   epsilon |
  |     1     7   2022   product4     HR      zeta |
  |     1     8   2022   product4    MKT     gamma |
  +------------------------------------------------+

* now for version 2
levelsof Dept, local(depts)
levelsof Proj, local(projs)
foreach d of local depts {
    gen byte Dept_`d' = (Dept == "`d'")
}
foreach p of local projs {
    gen byte Proj_`p' = (Proj == "`p'")
}
drop Dept Proj tags

li, noobs sep(0)

  +-------------------------------------------------------------------------------------------------------------------------------------------+
  | month   day   year    product   Dept_HR   Dept_IT   Dept_MKT   Dept_OPS   Proj_a~a   Proj_b~a   Proj_d~a   Proj_e~n   Proj_g~a   Proj_z~a |
  |-------------------------------------------------------------------------------------------------------------------------------------------|
  |     1     1   2022   product1         1         0          0          0          1          0          0          0          0          0 |
  |     1     2   2022   product3         0         0          0          1          0          1          0          0          0          0 |
  |     1     3   2022   product3         0         1          0          0          0          0          1          0          0          0 |
  |     1     4   2022   product1         0         1          0          0          0          0          0          0          1          0 |
  |     1     5   2022   product3         0         0          0          1          0          1          0          0          0          0 |
  |     1     6   2022   product1         1         0          0          0          0          0          0          1          0          0 |
  |     1     7   2022   product4         1         0          0          0          0          0          0          0          0          1 |
  |     1     8   2022   product4         0         0          1          0          0          0          0          0          1          0 |
  +-------------------------------------------------------------------------------------------------------------------------------------------+

Last edited by Hemanshu Kumar; 05 Nov 2022, 05:07.

Comment

Luis Pecht

Join Date: May 2017

Posts: 151
#4

05 Nov 2022, 04:56

Thanks Nick and Hemanshu,

Yes Nick, in my field, a retail private held company, 90% data comes form APIs returning JSON files and so maybe, I am a minority ,here.

I do not want to spark a debate here, but JSON format is the "new" CSV file and I think it already deserves a drop-down command in File section menu.

Is not easy to implement it, due to its flexibility, as you mention in the end of #2, however Stata Corp has all needed tools to build it up..

thanks, again
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35724
#5

05 Nov 2022, 05:47

Debate means discussion which is good and even disagreement can be fine too. In essence, StataCorp want to know what is needed and by how many people: one voice asking for say NetCDF support is just one voice but many voices asking for JSON is more leverage.

As a footnote, wanting the indicator variables from a string categorical variable as addressed by Hemanshu Kumar in #3 can also be addressed in this way

Code:

encode Dept, gen(dept) dummieslab dept

where dummieslab is from SSC. I've come to dislike the term "dummies" despite being second author of that command.
1 like
Comment
daniel klein

Join Date: Mar 2014

Posts: 3860
#6

05 Nov 2022, 06:33

Stata is designed for statistically analyzing rectangular datasets. There is no obvious one-to-one mapping from generic key-value pairs to rectangular datasets if values can, themselves, contain key-value pairs or even more complex nested structures. JSON is great at combining minimal complexity (in terms of specification) with maximum flexibility but I am not sure what JSON support in Stata would look like and whether it would be all that useful. As far as I can see, you would need to write customized code for anything but the most basic key-value pairs with scalar keys and scalar (or vector) values anyway. Mata provides associative arrays to do that.

By the way, StataCorp already uses JSON format under the hood of their collect commands.
2 likes
Comment
Marc Kaulisch

Join Date: Jan 2016

Posts: 184
#7

06 Nov 2022, 07:29

I would also love to see some built-in JSON functionality in Stata but Daniel and Nick have good points in what is already achievable and constraints.

I like to add that there are two community-packages that may be of help as well. "jsonio" (https://wbuchanan.github.io/StataJSON/) and -insheetjson- (http://fmwww.bc.edu/RePEc/bocode/i/insheetjson.html)
Comment
Luis Pecht

Join Date: May 2017

Posts: 151
#8

07 Nov 2022, 10:09

So based on Nick wise advice :

In essence, StataCorp want to know what is needed and by how many people: one voice asking for say NetCDF support is just one voice but many voices asking for JSON is more leverage.

'
and to keep some governance on the topics , in the forum, I´ve just created a new entry on "Wishlist for Stata 18" (sticky topic) #524 and welcome highly interested's user ,in such JSON features, to like the post.

https://www.statalist.org/forums/for...tata-18/page35

I will share to Stata Corp later on.
Comment

Announcement

key-value format

Comment

Comment

Comment

Comment

Comment

Comment

Comment