Tokenize and trimming

Daniel Bela

Join Date: Apr 2014

Posts: 246
#16

13 Jul 2016, 11:00

Clyde was quicker, again, in having a look at the documentation. Also, he gives nice insights into the history if Stata, which are always great to have, to get a idea of why things are the way they are.

Thanks!
Comment
daniel klein

Join Date: Mar 2014

Posts: 3890
#17

13 Jul 2016, 12:23

I suppose Daniel Klein didn't get my point with the problem of the leading and trailing blanks,

I object. I did completely get the point and I solved the problem. I introduced a new problem, however, referring to x when I should have referred to kw inside the loop.

As Daniel Bela mentioned, this is the most efficient, best readable and (therefore) least error prone technique that has been proposed in this discussion.

Best
Daniel
1 like
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#18

13 Jul 2016, 12:24

From the pedantry corner also, I note in passing that when presented with the problem that needed to be solved (as recommended in post #2), the solution was not bound by the limitations of tokenize but instead made effective use of Stata's foreach command and an unconcatenated list of quoted strings to match.

Just as a disclaimer, I don't find anything ugly about copying a line and pasting, then editing, 5 copies. By my calculations, this thread took over six hours. Regardless of how complex the editing, I think I could have accomplished the editing much faster.
1 like
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#19

13 Jul 2016, 13:29

Just to sum up, foreach seems to be the most efficient way of doing the job, but I was unaware of the fact that I can overcome this leading and trailing blanks there. Thanks again for the idea! A strong reason against copy pasting is the time spent with debugging, when you modify something, but it's a matter of taste, I admit.

Few unsolicited thoughts on the political-religious side of our conversation. I started with SPSS, then Matlab, at work I needed to master SAS, and for the last few years I've been playing with Stata. Therefore I've been reading Statalist, and I've already found immense amount of nice solutions. But many times I bump into guys defending the famous tostring missing stuff or the tokenized parse characters, which can be documented, but still insane. Tokens are better implemented in command prompt, and it was quite a while ago when that was developed. Backward compatibility is a nice thing, but it's better to let old wrong decisions go. Or at least introduce a normal scan function in Stata 15 as Clyde proposed.

Sorry for being too outspoken, but there are quite many guys outside Statalist, who agree with me. Again, my goal is to improve usability, not to offend anyone.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#20

13 Jul 2016, 14:43

I suspect that many users have other priorities for Stata's not-unlimited development effort than polishing up the arcane tokenize command whose purpose can often, as in this case, be accomplished by using a newer command designed, and more appropriate, for the purpose. Every developer has to decide where to put their efforts; luckily there's lots of competition in the statistical software space. I'm glad to have seen the last of the statistical analysis system I most recently used before Stata, but really miss some of the features of JMP.

I hope that along with Statalist you are availing yourself of Stata's excellent documentation, especially the User's Guide, Base Reference Manual, Data-Management Reference Manual, and Programming Reference Manual.

Let me add that another way to improve usability of Stata is to share your knowledge on Statalist with others. Before doing that, though, please click on "Contact Us" in the lower right corner, and request that your name be changed from Uncle Kazi to your real name, as the Statalist FAQ you were advised to read when you joined Statalist makes clear.

Last edited by William Lisowski; 13 Jul 2016, 15:24.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#21

13 Jul 2016, 15:11

Kazi: Your colourful language varies from "insane" through "wrong decisions" and "political-religious" to "not [wanting] to offend anyone". That's good, but the latter implies something different for the tone and content of your posts. Technical arguments and technical details should be the focus, please.

tokenize by default parses on white space. Nevertheless it serves well for many parsing problems. If you want something else, you need something else.

You mention one new specific point: you talk about "the famous tostring missing stuff". I am the original author of tostring but I have no idea whatsoever of what that means. If there is a bug or major problem it's StataCorp's responsibility now, but what do you mean? But if that's an example it's salutary because tostring was originally user-written and made public originally in response to other user requests. If you want something else, you need something else.

So, generally, the recurrent point here is remarkably simple. If you want some different behaviour then you can program it yourself or -- equally valid --- make a specific technical argument on what's wrong to be considered.
1 like
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#22

13 Jul 2016, 15:51

Commands destring and tostring are extremely powerful ones (and it's quite more difficult to do it in SAS for example), except for one thing, when converting missing numerical into a dot. I've seen you defending it, but at least admit it that it's totally counter-intuitive. Regarding the manual, its content is really fab, even though the manual of SAS and Matlab have a handy web site, which is quicker and highly googleable. I suppose it'd be nice to move into that direction and provide the full content of the manual in HTML.
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#23

13 Jul 2016, 16:53

You say what you think is wrong but don't spell out what would be right and why.

I imagine that you want string(.) to return empty string "". You can have that: one solution is to clone tostring and change its behaviour in that respect. Another is to follow tostring by

Code:

replace mystring = "" if mystring == "."

and you could have yet further solutions.

The rationale touched on in the post you cite is make tostring reversible, and reversible explicitly, to the maximum extent. That's quite a deep principle: that commands (or functions) that go in opposite directions should yield inverses of each other.

I have learned to distrust mightily appeals to intuition. Most of the time what is claimed to be intuitive is just familiar because learned. If you and I have different intuitions, who is to give way? More crucially, if your intuition clashes with that of Stata developers, who is to give way?

Do you object to the fact that real("") is system missing? It's perhaps a practical or psychological decision that Stata has an explicit special symbol for system missing rather than say empty numbers. One gets used to it, however. If Stata had empty numbers, there would have to be a special syntax for them, but there is, namely system missing. So we just execute a perfect circle.

Missings are always awkward to handle. SAS sorts numeric missings first; Stata sorts them last. Is that intuitive? Stata's developers had one reason for doing that, but it was at least a little arbitrary. I've heard two talks at users' meetings both based on the idea that Stata needs a three-way logic in which true, false and missing are possible results of logical operations. Both talks had interesting points about small awkwardnesses in Stata's treatment, but neither visibly convinced anyone except the speaker that the much more complicated scheme they proposed was at all attractive in practice.
Comment
William Lisowski

Join Date: Dec 2014

Posts: 10150
#24

13 Jul 2016, 17:35

Regarding the manual, its content is really fab, even though the manual of SAS and Matlab have a handy web site, which is quicker and highly googleable.

Typing the command help tokenize seems reasonably quick, as long as you have Stata running; Googling stata help tokenize seems equally adequate and returns the entire entry from the programming manual.
1 like
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#25

14 Jul 2016, 03:00

Why don't we just collect the 10 most debated issues, and ask the dear Stata users to vote and give suggestions? I've proposed two, I'm sure other guys have more ideas. Regarding the help command, it's quick, but obviously lacks the details of the PDF. And if you Google, you get the quick guide or the PDF. And the latter is slow and clumsy. Again, we may put this into a poll, and let users decide. It can be that I'm alone with my opinion, maybe you're wrong, who knows?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#26

14 Jul 2016, 03:17

I don't think you have yet explained what you think tostring should do instead. As already pointed out, you can write your own alternative. StataCorp are not going to change its behaviour because you find it counter-intuitive. What kind of caprice would that be?

Googling the .pdf documentation on the internet is a strange habit when you can use search to navigate through the same .pdf documentation on your own machine.
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#27

14 Jul 2016, 03:33

I've already written it, but I do it again explicitly: If you convert missing numerical with tostring, you should get "" instead of a dot. This way, if you destring this variable (and use force), you'd get back the original missing value. Of course I can write additional codes to circumvent this, but it's inconvenient, and I suppose it is for most of the users (again, let's do a poll to find out that). Searching a 609 pager PDF is just clumsy, moreover with Google you can use better search phrases if you're looking for something specific.
Comment
Jesse Wursten

Join Date: Jan 2016

Posts: 915
#28

14 Jul 2016, 03:40

Originally posted by Nick Cox View Post

I don't think you have yet explained what you think tostring should do instead. As already pointed out, you can write your own alternative. StataCorp are not going to change its behaviour because you find it counter-intuitive. What kind of caprice would that be?

Googling the .pdf documentation on the internet is a strange habit when you can use search to navigate through the same .pdf documentation on your own machine.

I have to side with Kazi Bacsi here - I've never really understood why string(.) returns "." instead of "". As real("") and real(".") both return a dot, I don't see where the lack of reversibility comes from if string(.) returns an empty string ("")?
Comment
Nick Cox

Join Date: Mar 2014

Posts: 35809
#29

14 Jul 2016, 04:22

So string "23.4" should be "234"? Or string(23) + string(.) + string(4) should return something different from string(23.4)?

I think not.

Force isolated "." to "" if you wish with a replace or your own program -- the user has that right and that freedom -- but please don't let syntax depend on semantics.
Comment
Kazi Bacsi

Join Date: Jul 2016

Posts: 59
#30

14 Jul 2016, 04:59

With the current solution you have consistency with the string() function and inconsistency with tostring/destring, with my proposal you could make tostring/destring consistent and you wouldn't need to touch str. Maybe people would like to avoid tokenized parse characters, or at least to have a proper function like scan in SAS. I don't have the philosopher's stone, neither you do, so it'd be better to ask the users which do they prefer, wouldn't it? People can fare well many times with inconsistencies (risk-averse people playing lottery etc.), maybe they'd choose a different solution than yours. Maybe your argument is valid from a certain point of view, but others think it different.
Comment

Announcement

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment