Creating a Stata data file from a JSON formatted file

Eric Gardner

Join Date: Sep 2016

Posts: 3
#1

Creating a Stata data file from a JSON formatted file

23 Sep 2016, 10:54

Hi,

I've seen other posts on this topic but they haven't helped for the particular file I'm working with. Here is the link to that file:
https://raw.githubusercontent.com/va...PTSD_FY15.json.

I've tried to simply import the file. I saved the text from the weblink and created a .json file in the do-file editor. The code ((import delimited using "PTSD FY15.json") created a file with lots of variables and no observations (I didn't specify a delimiter although commas separate the observations). I thought maybe the commas would be viewed automatically as the delimiter.
Stata returns: import delimited using "PTSD FY15.json"
(13,448 vars, 0 obs)

I installed and tried the insheetjson command. It returns this: insheetjson using "https://raw.githubusercontent.com/vacobrydsk/VHA-Files/master/NEPEC_AnnualDataSheet_PTSD_FY15.json"
3362 observations updated/written. When I try to save this file, Stata returns 'no vars defined' r(111)... which of course is true since no variables show up in the variables window.. I'm using Stata 14.1 (Windows).

So, one strategy returns vars and no observations while the other returns observations and no variables. Sadly, these clues haven't been enough for me to solve the puzzle.

Thanks,

Eric
Tags: None

Joe Canner

Join Date: Mar 2014
Posts: 580

23 Sep 2016, 11:26

Eric,

insheetjson is definitely the way to go, but you need to create the variables first and then include them in the insheetjson command. Here's an example:

Code:

gen str30 name=""
gen str15 screen_name=""
gen str30 location=""
gen str150 description=""
gen str50 url=""
gen str30 created_at=""

gen str10 followers_count=""
gen str10 friends_count=""
gen str10 listed_count=""
gen str10 favourites_count=""
gen str10 statuses_count=""

insheetjson using file.txt, col("name" "screen_name" "location" "description" "url" "created_at" "followers_count" "friends_count" "listed_count" "favourites_count" "statuses_count")
  
save filename, replace

Regards,
Joe

Comment

Eric Gardner

Join Date: Sep 2016

Posts: 3
#3

23 Sep 2016, 13:25

Thanks Joe. I ran the following code (I used the same string length as they seem to accommodate the length of the string). The gen code worked fine.

gen str50 category=""
gen str50 item=""
gen str50 station=""
gen str30 value=""

insheetjson using PTSDFY15.json, col("category" "prev_PTSD" "item" "total_VHA" "station" "value")

Here's what Stata returned

. insheetjson using PTSDFY15.json, col("category" "item" "station" "value")
libjson::parse: unexpected character 'c' at position 1 forced error #-10
libjson::parse(): 3301 subscript invalid [13]
injson_sheet(): - function returned error
<istmt>: - function returned error
r(3301);

Looking at the raw file from the link, this is the first line of data
[{"Category": "Prevalence of Confirmed PTSD Diagnosis", "Item": "Total VHA Service Users ", "Station": "402", "Value": "39999"}, However, here is a later line of data
{"Category": "MH Service Utilization with Confirmed PTSD Diagnosis", "Item": "Number of Service Users with a Confirmed Diagnosis of PTSD", "Station": "402", "Value": "4218"} Almost looks like there are multiple embedded files.

Eric
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#4

23 Sep 2016, 13:35

Eric,

I got a different error with your code, but here's what worked:

Code:

gen str50 category="" gen str50 item="" gen str50 station="" gen str30 value="" insheetjson using PTSDFY15.json, col("Category" "Item" "Station" "Value")

I think the capitalization of the columns in the JSON file are important. Also, only include the names of columns actually in the JSON file.

The JSON file is actually fine, although it is stored all as one line which makes it hard to follow. If you do the following:

Code:

insheetjson using PTSDFY15.json, showresponse

you can better see the actual structure of the data.

Regards,
Joe
Comment
Joe Canner

Join Date: Mar 2014

Posts: 580
#5

23 Sep 2016, 13:40

P.S. In case it helps, JSON files are stored as variable-value pairs. In your case, you have the first variable ("Category") followed by a colon and the value of Category ("Prevalence of..."), then a comma, then the next variable and its value, etc with four variables total. This whole thing is surrounded by {} and each collection of four variables is separated by a comma. This structure can be used to organize data in an outline structure, where some items have sub-items.
Comment

Robert Picard

Join Date: Mar 2014
Posts: 1536

24 Sep 2016, 09:10

Let me add that a JSON file is a regular text file with information that is structured according to the JSON format rules. Since it is a text file, it can be viewed with any text editor (including Stata's do-file editor). A quick look at the file in question suggest that it is a rectangular dataset. The main issue is that there are no carriage return/linefeed characters so the file is essentially one long line of data. This can be quickly addressed by noting that observation breaks occur at "}, {" text pattern and using the filefilter command to substitute this pattern of characters with a carriage return. Once this fix is in, the data can be input as a single string variable with double-quotes preserved.

There are many ways to parse the data once it is in memory. Here's one that uses moss (from SSC) to find all text enclosed in double-quotes.

Code:

* split original one line JSON data into rows
filefilter NEPEC_AnnualDataSheet_PTSD_FY15.json json.txt, from("}, {") to(\r) replace

clear
import delimited json.txt, delim("\t") stripquotes(no)

moss v1, match(`"("[^"]+")"') regex

There are 8 matches in this case. The odd matches are the item description and the even matches, the item value. Here are the first 5 observations:

Code:

. list _match1 _match3 _match5 _match7 in 1/5

     +--------------------------------------------+
     |    _match1   _match3     _match5   _match7 |
     |--------------------------------------------|
  1. | "Category"    "Item"   "Station"   "Value" |
  2. | "Category"    "Item"   "Station"   "Value" |
  3. | "Category"    "Item"   "Station"   "Value" |
  4. | "Category"    "Item"   "Station"   "Value" |
  5. | "Category"    "Item"   "Station"   "Value" |
     +--------------------------------------------+

. list _match2 _match4 _match6 _match8 in 1/5

     +-------------------------------------------------------------------------------------------+
     |                                  _match2                      _match4   _match6   _match8 |
     |-------------------------------------------------------------------------------------------|
  1. | "Prevalence of Confirmed PTSD Diagnosis"   "Total VHA Service Users "     "402"   "39999" |
  2. | "Prevalence of Confirmed PTSD Diagnosis"   "Total VHA Service Users "     "405"   "24493" |
  3. | "Prevalence of Confirmed PTSD Diagnosis"   "Total VHA Service Users "     "518"   "18324" |
  4. | "Prevalence of Confirmed PTSD Diagnosis"   "Total VHA Service Users "     "523"   "57821" |
  5. | "Prevalence of Confirmed PTSD Diagnosis"   "Total VHA Service Users "     "608"   "23942" |
     +-------------------------------------------------------------------------------------------+

Comment

Eric Gardner

Join Date: Sep 2016

Posts: 3
#7

26 Sep 2016, 14:50

Thanks Joe and Robert. Finally got the insheetjson code to work. I realized that (somehow) the .json file I created (on my end) was incorrect. No idea how I accomplished that but it finally occurred to me the issue must be with the file itself (and, not the code Joe sent).

Eric
Comment
wbuchanan

Join Date: Mar 2014

Posts: 1362
#8

01 Oct 2016, 16:49

Eric Gardner you could also use the -jsonio- program to load the data in Stata from a flat file or using API calls. There is documentation available at https://wbuchanan.github.io/StataJSON.
Comment

Kosta Lamprogiorgos

Join Date: Apr 2020
Posts: 3

01 Apr 2020, 19:17

Hi,

I am using Stata 16 in windows 10. I have the same problem as Eric's example above. However, the data in my JSON file is not structured, meaning that the same delimiters are used in different occasions.
For example, each observation is within {" "}, each varriable is seperated by "," and the values of each varriable comes after ":". The problem is that within each observation the symbols [{}] are used intercheangably and i cannot specify accurately where each observation starts. Furthermore, some of the varriables have multiple values and there are also subcategories of some of the varriables. Not all observations have exactly the same number of subvarriables.

The data in the JSON file for my first observation is (I have highlighted in red colour where the first observations begins and where it ends):

Code:

{"result": [{"issue_codes": ["BUD"], "received": 995428800, "has_amendments": false, "client_name": "TRANSPORTATION DISTRICT COMMISSION OF HAMPTON ROADS", "period": "Mid-Year (Jan 1 - Jun 30)", "registrant": "HOLLAND & KNIGHT LLP", "amount": 20000, "client": {"legal_name": "TRANSPORTATION DISTRICT COMMISSION OF HAMPTON ROADS", "name": "TRANSPORTATION DISTRICT COMMISSION OF HAMPTON ROADS", "naics": "485119", "gvkey": null, "ticker": "Unlisted", "id": null, "bvdid": "US132438635L"}, "specific_issues": [{"text": "", "gov_entities": ["Federal Transit Administration (FTA)", "U.S. House of Representatives", "U.S. Senate"], "lobbyists": ["Barkovic, Lisa A", "Whitestone, David C"], "year": 2001, "issue": "BUD", "id": "5U5WN9cPNKNj6hEynnVwck"}], "year": 2001, "is_latest_amendment": false, "type": "MID-YEAR REPORT", "id": "0CA354DE-D9EF-4D44-ADF3-7C50F22A9174", "alternate_name": "TRANSPORTATION DISTRICT COMMISSION OF HAMPTON ROADS"}, {"issue_codes": ["HCR"], "received": 1492187423, "

Intially I created all the varriable names and used the insheetjson code that Joe Canner suggested above.

Code:

Code:

clear

gen str300 result=""
gen str50 issue_codes=""
gen str50 received=""
gen str50 has_amendments=""
gen str150 client_name=""
gen str50 period=""
gen str100 registrant=""
gen str30 amount=""
gen str50 legal_name=""
gen str150 name=""
gen str30 naics=""
gen str30 gvkey=""
gen str30 ticker=""
gen str30 id=""
gen str30 bvdid=""
gen str350 text=""
gen str80 gov_entities=""
gen str80 lobbyists=""
gen str30 year=""
gen str50 issue=""
gen str30 is_latest_amendment=""
gen str50 type=""
gen str80 alternate_name=""
gen str350 bills_by_algo=""
gen str80 top_terms=""
gen str30 number=""
gen str80 session=""
gen str80 public_law=""
gen str80 sponsor=""
gen str80 firstname=""
gen str80 district=""
gen str50 title=""
gen str50 lastname=""
gen str50 religion=""
gen str80 state=""
gen str30 gender=""
gen str80 party=""
gen str350 committees=""


insheetjson  using response.json, col(result" "issue_codes" "received" "has_amendments" "client_name" "period" "period" "registrant" "amount" "legal_name" "name" "naics" "gvkey" "ticker" "id" "bvdid" "text" "gov_entities" "lobbyists" "year" "issue" "is_latest_amendment" "type" "alternate_name" "bills_by_algo" "top_terms" "number" "session" "public_law" "sponsor" "firstname" "district" "title" "lastname" "religion" "state" "gender" "party" "committees")

However, I get the following error message:

Code:

 Invalid column name/selector 'issue_codes'. (Possible name candidates are: "result" )
 Empty result returned; Nothing to do.

Then, in my text file I replaced }, {"issue with ????? . I included the word "issue" because the sequence of the following symbols }, { can also be found between my variables in each observation.

Code:

Code:

clear

filefilter lobbydata.txt lobbydata_adj.txt, from("?????") to(\r) replace /*

clear
import delimited lobbydata_adj.txt, delim("\t") stripquotes(no)

This created 1 varriable (v1) and 5000 observations. Each of these varriables contain all the other varriables and their values. The problem is that I do not know how to properly split each varriable.

I tried:

Code:

 split v1, generate(NewVar) parse(", ") limit(100)

By not including the "limit" command I get the following error message: "no room to add more variables
Up to 5,000 variables are currently allowed, although you could reset the
maximum using set maxvar; see help

end of do-file

However, some varriables take 1 value while others take more values seperated with ", ". That means it is included in different varriables (e.g., see observations 6,7 below):

Code:

.  list NewVar1 NewVar2 in 1/7

     +--------------------------------------------------------------+
     |                             NewVar1                  NewVar2 |
     |--------------------------------------------------------------|
  1. | {"result": [{"issue_codes": ["BUD"]    "received": 995428800 |
  2. |                    _codes": ["HCR"]   "received": 1492187423 |
  3. |                    _codes": ["DEF"]   "received": 1287628382 |
  4. |                    _codes": ["DEF"]   "received": 1216650203 |
  5. |                         _codes": []   "received": 1109048400 |
     |--------------------------------------------------------------|
  6. |                     _codes": ["HCR"                    "LBR" |
  7. |                     _codes": ["TAX"                    "TRD" |
     +--------------------------------------------------------------

Is there a better and more tidy way to do all this? Any help would be greatly appreciated.

Kind Regards,
Kosta

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#10

02 Apr 2020, 06:30

Kosta Lamprogiorgos
i think part of the problem is that issue_codes is an array and not an atomic element. Have you tried using `jsonio` to read your data?
Comment

Kosta Lamprogiorgos

Join Date: Apr 2020
Posts: 3

#11

04 Apr 2020, 08:29

wbuchanan
Thank you for the reply!

I used the following code:

Code:

jsonio kv, filenm ("response.json") nourl

but I got the following error message:

Code:

java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImp
> l.java:62)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcc
> essorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at com.stata.Javacall.load(Javacall.java:132)
        at com.stata.Javacall.load(Javacall.java:92)
Caused by: java.lang.IndexOutOfBoundsException: Index: 4, Size: 4
        at java.util.ArrayList.rangeCheck(ArrayList.java:653)
        at java.util.ArrayList.get(ArrayList.java:429)
        at org.paces.Stata.Input.FlatJSON.flatten(FlatJSON.java:205)
        at org.paces.Stata.Input.FlatJSON.flatten(FlatJSON.java:194)
        at org.paces.Stata.Input.FlatJSON.flatten(FlatJSON.java:176)
        at org.paces.Stata.Input.FlatJSON.flatten(FlatJSON.java:194)
        at org.paces.Stata.Input.FlatJSON.flatten(FlatJSON.java:116)
        at org.paces.Stata.Input.InJSON.insheetFile(InJSON.java:144)
        ... 6 more
r(5100);

end of do-file

r(5100);

I uninstalled `jsonio` and I re-installed it, but I still get the same error message. I have not yet figured out why I get this error message.

Kind Regards,
Konstantinos

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#12

05 Apr 2020, 14:31

Kosta Lamprogiorgos
Which version of jsonio did you install?
Comment

Kosta Lamprogiorgos

Join Date: Apr 2020
Posts: 3

#13

18 Apr 2020, 07:24

@wbuchanan

I had installed jsonio from stata using ssc install jsonio because i thought that this would be the most updated version. So the version I had was the following:

Code:

[16]  package jsonio from http://fmwww.bc.edu/repec/bocode/j
      'JSONIO': module for I/O operations on JSON data

After I saw your message I found an update from github (see below).

Code:

[22] package jsonio from https://wbuchanan.github.io/StataJSON
      'JSONIO': A Stata plugin for I/O operations on JSON data

I installed it and it works now. I got two columns: the first one is the "key" and the second one is the "value". However, some of my varriables take many values. Therefore, they come as different varriables (e.g., see below "issue_codes").

Code:

key    value
/result_2478/alternate_name    Bockorny Group (on behalf of National Business Aviation Association)
/result_2479/issue_codes/element_1    HCR
/result_2479/issue_codes/element_2    LBR
/result_2479/issue_codes/element_3    FOR
/result_2479/issue_codes/element_4    HOU
/result_2479/issue_codes/element_5    ECN
/result_2479/issue_codes/element_6    IMM
/result_2479/issue_codes/element_7    TAX
/result_2479/issue_codes/element_8    MMM
/result_2479/issue_codes/element_9    LAW
/result_2479/issue_codes/element_10    BUD
/result_2479/issue_codes/element_11    ENV
/result_2479/issue_codes/element_12    WEL
/result_2479/issue_codes/element_13    AGR
/result_2479/issue_codes/element_14    DEF
/result_2479/received    1319057854

Then I used the "regexm" command to create new varribles for each of the varriables i need. For those that have many elements such as the "issue_codes" shown above i created a new varriable for each element. For example:

Code:

gen issue_codes1 = value if regexm(key,"[/]result[_][0-9][0-9][0-9][0-9][/][issue_codes[/]element_1") | regexm(key,"[/]result[_][0-9][0-9][0-9][/][issue_codes[/]element_1") | regexm(key,"[/]result[_][0-9][0-9][/][issue_codes[/]element_1") | regexm(key,"[/]result[_][0-9][/][issue_codes[/]element_1") /*need to create loop for each observation or create different issue codes varriables for each element and then use xfill each of them*/

gen issue_codes2 = value if regexm(key,"[/]result[_][0-9][0-9][0-9][0-9][/][issue_codes[/]element_2") | regexm(key,"[/]result[_][0-9][0-9][0-9][/][issue_codes[/]element_2") | regexm(key,"[/]result[_][0-9][0-9][/][issue_codes[/]element_2") | regexm(key,"[/]result[_][0-9][/][issue_codes[/]element_2")

Then i created a new varriable "n_obs" that shows the number of observation for each varriable:

Code:

split key, generate(observation) parse(/)
destring observation2 , gen(n_obs) i(/result_)
drop observation1 observation2 observation3 observation4 observation5 observation6

generate id = _n

Then I was able to use xfill command for each of my varriables to fill the empty cells within each observation:

Code:

xfill issue_codes1, i(n_obs)
xfill issue_codes2, i(n_obs)

That is what it worked for me. I do not know if there is any better way to do all this though!

Kosta

Comment

wbuchanan

Join Date: Mar 2014

Posts: 1362
#14

19 Apr 2020, 04:52

Kosta Lamprogiorgos
Since JSON can be arbitrarily complex with regards to its structure, the goal was to flatten the data into a single two dimensional structure that retained enough of the information about the structure of the data to allow it to be reshaped reasonably easily.

Aside from recommending the use of the Unicode regular expression commands to clean up what you have above, if you split the key variable on the “/“ character, you can reshape the data using the generated variable that contains the “element_#” values. I’m not certain what the end result you are looking for is, but the goal with jsonio was to get the data into a structure that could be more easily manipulated within Stata.
Comment
Mohamad Soltani

Join Date: Jun 2019

Posts: 6
#15

23 Feb 2023, 17:31

Hi,

I have been trying to import a .json file to Stata using insheetjson. Unfortunately, however, after taking a long time, it only imports the first row of data. Has anyone faced a similar issue and found a solution around it?

Here is the line of the code that I am using:

Code:

gen str40 user_id = "" gen str40 user_name = "" gen str40 review_time = "" gen str4 rating = "" gen str400 review_text = "" gen str240 pics = "" gen str40 resp_time = "" gen str400 resp_text = "" gen str240 gmap_id = "" insheetjson using "~/review-Vermont_10.json", col("user_id" "name" "time" "rating" "text" "pics" "time" "text" "gmap_id")

Here is where you can find and download the data: https://drive.google.com/file/d/1Gqs...AbeqX13dI/view

And here, you can read more about the data and its format: https://jiachengli1995.github.io/goo...x.html#subsets

I am facing the same issue when I use jsonio.
Comment

Announcement