Import a text format data with different delimiters into Stata

Tony Shayan

Join Date: Jun 2021

Posts: 25
#1

Import a text format data with different delimiters into Stata

07 Feb 2023, 11:44

Hello,

I have a (series of) text file that use the following delimiters :

Column: ‘~’ (including the single quotes)
Row: #@#@#

I could not be able to import this data into Stata. Is there a way to import this data into Stata ? I would appreciate if you could help me.

Kind regards,
Shayan

P.S: I leave a small part of the text file below
Example :

A'~'165390'~'1469'~'40'~'2022-01-01 00:00:00'~'Sazon Chileno C.A'~'Venezuela'~''~''~''~''~'Venezuela'~'12811919 8'~'SAZON CHILENO, C.A'~'PRINCIPAL LA GUAIRA'~''~''~''~''~''~''~'United States'~'ESTADOS UNIDOS'~'Maritime'~'18063120000'~'1'~'.11636363636 363636'~'1422'~'1280'~'1065'~'4930'~'99'~'2022-08-15 00:00:00#@#@#A'~'165582'~'1469'~'40'~'2022-01-01 00:00:00'~'Recimetales C.A'~''~''~''~''~''~'Venezuela'~'87583138'~'RECIME TALES, C.A'~'PRINCIPAL LA GUAIRA'~'
Tags: None
Jorge L. Guzman

Join Date: Mar 2015

Posts: 50
#2

07 Feb 2023, 12:01

Hi Tony,
After you import the data, you could try:

Code:

split variable1, p("'~'")

The result would be several variables that you can rename as you wish.
Kind regards,
Jorge
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#3

07 Feb 2023, 12:43

The data sample provided is not completely sufficient for an optimal answer, as among other things, we don't know how/if variable names are present in the file, and how *actual* line breaks are encoded. (The sample almost suggests that the whole file is just a string of characters without any end of line characters, but I'd suspect that's not true.) I'm assuming, though, that there are variable names and that preserving them, rather than having to rename each one, would be desirable.

With those caveats, I would approach this by using fileread()/filewrite() to make the input more "standard," and then use -import delimited-. A similar approach could be done with -filefilter-, but I prefer fileread/filewrite. Either of these Stata commands should be very fast.

Code:

// Choose a delimiter character not used in the data file. I'm using tab, ascii 009. local delimiter = char(9) // Choose the end of line character(s) you want. I'm using a carriage return, // ascii 013, which I think will work under any OS. local eoln = char(13) // Read your file and change the odd characters to something standard. clear set obs 1 gen strL s = fileread("original file.txt") replace s = subinstr(s, "'~'", "`delimiter'", .) replace s = subinstr(s, "#@#@#", "`eoln'", .) gen b = filewrite("new file.tab", s) clear import delimited using "new file.tab", delimiter(tab) // Where there were actual end of line characters in the // original file, you'll probably get blank lines, // and I'd delete them here.

The preceding could be adapted to be used in a loop over multiple different files, which I gather is relevant to the situation.

Finally, I'd note that, because I didn't have a real example file to work with, there might well be some errors here, but the concept should be fine.
3 likes
Comment
Rich Goldstein

Join Date: Mar 2014

Posts: 4466
#4

07 Feb 2023, 13:32

you might want to take a look at:

Code:

h filefilter
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#5

07 Feb 2023, 13:33

I strongly endorse Mike Lacy's solution in #3.

That said, I would point out that for the column delimiter (~), you don't actually need to replace that with anything because -import delimited- will subsequently allow you to specify ~ as the delimiter there. And replacing ~ with tab characters could mess things up if there are variables in the data that, themselves, contain tab characters.

It may well be that the data set was created in this odd way precisely because the data contains strings that include the usual delimiter characters, so they had to come up with some special delimiters that would never be confused with parts of a string variable.

I also see in the example provided that there are columns whose content is " (or is that ' '?) If the former, you may have to either replace those by something else in the fileread/filewrite code and then replace them back once the data are in Stata, or play with various quote-related options in -import delimited- to avoid having them mess up the import.

Last edited by Clyde Schechter; 07 Feb 2023, 13:35.
1 like
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#6

07 Feb 2023, 13:41

I was thinking like Clyde to suggesting changing the delimiter to "~", but my understanding is that the delimiter in this file actually contained the single quotes as well as the tilde. My understanding, perhaps wrong, is that -import delimited- can't handle a multiple-character delimiter. I agree with Clyde about the possible danger of tab here, and had thought of suggesting some less common character code, say ascii 171 ("«"), but thought to not be too obscure <grin>
Comment
Clyde Schechter

Join Date: Apr 2014

Posts: 30117
#7

07 Feb 2023, 14:45

[quote]but my understanding is that the delimiter in this file actually contained the single quotes as well as the tilde. [quote]
Mike Lacy is right: it does say in #1 that the quotes are part of the delimiter. I missed that. And he is also right that the delimiter in -import delimited- has to be a single character. So my suggestion that that delimiter could be left out of the fileread/filewrite routine is not workable here. Sorry for any confusion.

The concern about possible tabs included in the variables making it problematic to use tab as a delimiter remains.
1 like
Comment
Tony Shayan

Join Date: Jun 2021

Posts: 25
#8

08 Feb 2023, 02:23

Mike Lacy Thanks a lot for your help. The code worked fast and clean. Clyde Schechter thank you for your comment that "tab" might be a concern, too. It is relevant for the bigger dataset, and I guess using ascii 171 ("«") will remedy that (?).

As for the variables names: this is also relevant. This dataset comes in batches of text file and using a customized app, they are supposed to join and form a full dataset. Combining all, I will have variables name. I could not share a complete text file here.

Thanks again,
Tony Shayan
Comment

Mike Lacy

Join Date: Apr 2014
Posts: 2416

08 Feb 2023, 07:25

Regarding concerns about choosing a delimiter that might be a legitimate character, you could put trap that problem before proceeding:

Code:

clear
set obs 1
gen strL s = fileread("original file.txt")
gen check = strpos(s, "`delimiter'")
if (check[1] != 0) {
   di as error "Warning, delimiter found in text at position = " check[1]
   di "No further processing of this file."
}
else {
   replace s = subinstr(s, "'~'", "`delimiter'", .)
     ... etc.
}

Comment

Tony Shayan

Join Date: Jun 2021

Posts: 25
#10

05 May 2023, 06:35

Hello,

I hope you are doing well. I am reaching out to this forum once again to seek your valuable input on a new challenge I am facing with my dataset.

In summary, I am working with a dataset in the form of a text file (name.txt), which uses non-standard delimiters. I had previously resolved this issue with the help of Mike Lacy 's solution and successfully proceeded with my analysis on a small sample. However, when transitioning to the full-sized dataset, I am encountering difficulties as it seems that the previous code cannot handle the larger data.

The dataset contains transaction-level international trade data, and for a single country like Indonesia (export side), I have 100GB of data after unzipping the file. The data size will significantly increase when dealing with larger economies like the US and China. It is essential to note that the dataset contains many string values (text, such as addresses and names), which likely contribute to the file's large size. While I do not require most of these text fields, I need to find a way to open the files first before dropping them from my dataset.

Unfortunately, Stata does not seem capable of handling this task using the previous code, even though I am using a powerful machine with a Core i9 processor, 64GB RAM, 1TB SSD, and an SE version of Stata.

I would greatly appreciate any suggestions on how to manage these text files. Should I continue seeking a solution within Stata, or should I first process and clean the data using another software and then return to Stata for further analysis? Is this even possible with the current machine I am working with?
Comment
Tony Shayan

Join Date: Jun 2021

Posts: 25
#11

05 May 2023, 06:38

Hello,

I hope you are doing well. I am reaching out to this forum once again to seek your valuable input on a new challenge I am facing with my dataset.

In summary, I am working with a dataset in the form of a text file (name.txt), which uses non-standard delimiters. I had previously resolved this issue with the help of Mike Lacy 's solution and successfully proceeded with my analysis on a small sample. However, when transitioning to the full-sized dataset, I am encountering difficulties as it seems that the previous code cannot handle the larger data.

The dataset contains transaction-level international trade data, and for a single country like Indonesia (export side), I have 100GB of data after unzipping the file. The data size will significantly increase when dealing with larger economies like the US and China. It is essential to note that the dataset contains many string values (text, such as addresses and names), which likely contribute to the file's large size. While I do not require most of these text fields, I need to find a way to open the files first before dropping them from my dataset.

Unfortunately, Stata does not seem capable of handling this task using the previous code, even though I am using a powerful machine with a Core i9 processor, 64GB RAM, 1TB SSD + 10TB of cloud storage, and an SE version of Stata.

I would greatly appreciate any suggestions on how to manage these text files. Should I continue seeking a solution within Stata, or should I first process and clean the data using another software and then return to Stata for further analysis? Is this even possible with the current machine I am working with?

Thanks in advance.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#12

05 May 2023, 09:00

Yes, -fileread- won't work for a very large file, as a strL can "only" hold about 2GB. So, as I mentioned at #3 above, -filefilter- is an alternative. I haven't used -filefilter- in a long time, but a quick reading of -help filefilter- makes me think that it doesn't try to hold the whole file in memory at one time, and so should work.

*Contrary* to what I posted above, using as a delimiter ASCII char(171) or another character above ASCII 127 does not work well with -import delimited-, at least not in my version of Stata. However, lower ASCII numbers (e.g. char(17)) do seem to work. So, I *think* the relevant syntax with -filefilter- would be something like this here for Tony's situation:

Code:

// Change the variable delimiter. -filefilter- has special notation for lower ASCII filefilter "InputFile" "Output1" , from("‘~'") to(\017d) // Change the end of line marker filefilter "Output1" "Output2" , from("#@#@#") to( \U)

The "\U", as documented in the help for -filefilter-, is a special code to denote an end of line marker appropriate to the operating system. It would be \U for Unix, \W for Windows, etc. I haven't tried this.

"Output2" should be importable as follows:

Code:

local delimiter = char(17) import delimited "Output", delimiter("`delimiter'")

I find my suggested solution here rather "klugy," so other ideas are welcome.
Comment
Mike Lacy

Join Date: Apr 2014

Posts: 2416
#13

05 May 2023, 14:05

A small mistake: Tony's original delimiter contained two apostrophes, no "backtick," contrary to my code above:

Code:

filefilter "InputFile" "Output1" , from("'~'") to(\017d)
Comment

Announcement

Import a text format data with different delimiters into Stata

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment