View Full Version : Idea for blank-delimited concatenation
Robert Hodge
17-06-2013, 00:38
It is common when creating command strings of various kinds to concatenate pieces of a command together with blanks in between each section of the resulting string:
DIM ONE AS STRING = "ONE"
DIM TWO AS STRING = "TWO"
DIM THREE AS STRING = "THREE"
DIM CMD AS STRING = ONE & " " & TWO & " " & THREE
This is so common, it would be nice if there were shortcut syntax to do this automatically. Imagine there was an operator of &&, which would mean & " " &. Then, the statement above could be simplified to:
DIM CMD AS STRING = ONE && TWO && THREE
which is a lot easier to type and to read.
Since the other common concatentation would be comma-separated, a similar syntax of &, might be allowed:
DIM CMD AS STRING = ONE & "," & TWO & "," & THREE
' would become ...
DIM CMD AS STRING = ONE &, TWO &, THREE
It's possible other adaptations to this idea might be attempted, but I think blank and comma account for 99% of the likely use cases.
If implemented, && and &, could not have any blanks between them.
Similar to the &= compound assignment operator, there could also be
x &&= y to mean x = x && y which in turn means x = x & " " & y
Likewise, a compound operator like &,= is also possible.
Note: When the right-hand operand to && is a zero-length string or only contains blanks, the && operator would produce only its left-hand operand. In that way, it would not be propagating a bunch of unnecessary blank delimiters when there was nothing to delimit. This prevents && from generating excessive, spurious blanks.
For example,
DIM ONE AS STRING = "ONE"
DIM TWO AS STRING
DIM THREE AS STRING = "THREE"
DM CMD AS STRING
' ... some code to initialize TWO as "" or " " or " " etc.
CMD = ONE && TWO && THREE
' CMD now has "ONE THREE" rather than "ONE THREE"
Robert Hodge
03-07-2013, 23:22
While trying to figure out how I could achieve the effect of the proposed blank-delimited concatenation myself, assuming that Eros wasn't too excited about adding new syntax, it occurred to me that it's possible to define certain special characters as function names, such as _, $, # and %.
Let's say that we called $ the name of a function that takes a variable number of strings, and concatenates them together with a blank between them, with the understanding that if a value is only blank, it gets ignored, so as to prevent extraneous blanks from being inserted.
It turns out that this functional notation is actually more concise than inventing an operator like &&.
Perhaps Eros could make a built-in function more efficient than what I wrote, but this is a working example:
DIM SKIP AS STRING = ""
MSGBOX 0, $ ("A1",SKIP, "B2", "C3", "D", "EFG")
STOP
FUNCTION $ (S01 AS STRING, _
OPTIONAL S02 AS STRING, _
S03 AS STRING, _
S04 AS STRING, _
S05 AS STRING, _
S06 AS STRING, _
S07 AS STRING, _
S08 AS STRING, _
S09 AS STRING, _
S10 AS STRING, _
S11 AS STRING, _
S12 AS STRING, _
S13 AS STRING, _
S14 AS STRING, _
S15 AS STRING, _
S16 AS STRING) AS STRING
DIM I AS LONG, X(16) AS STRING, GAP AS STRING, RESULT AS STRING
IF FUNCTION_CParams >= 01 THEN X(01) = S01
IF FUNCTION_CParams >= 02 THEN X(02) = S02
IF FUNCTION_CParams >= 03 THEN X(03) = S03
IF FUNCTION_CParams >= 04 THEN X(04) = S04
IF FUNCTION_CParams >= 05 THEN X(05) = S05
IF FUNCTION_CParams >= 06 THEN X(06) = S06
IF FUNCTION_CParams >= 07 THEN X(07) = S07
IF FUNCTION_CParams >= 08 THEN X(08) = S08
IF FUNCTION_CParams >= 09 THEN X(09) = S09
IF FUNCTION_CParams >= 10 THEN X(10) = S10
IF FUNCTION_CParams >= 11 THEN X(11) = S11
IF FUNCTION_CParams >= 12 THEN X(12) = S12
IF FUNCTION_CParams >= 13 THEN X(13) = S13
IF FUNCTION_CParams >= 14 THEN X(14) = S14
IF FUNCTION_CParams >= 15 THEN X(15) = S15
IF FUNCTION_CParams >= 16 THEN X(16) = S16
FOR I = 1 TO 16
IF VERIFY (X(I), " ") > 0 THEN
RESULT = RESULT & GAP & X(I)
GAP = " "
END IF
NEXT
RETURN RESULT
END FUNCTION
Robert Hodge
04-07-2013, 15:26
I played around with your function a little bit. It works even if omit variable "Skip" and just call as
MsgBox 0, $ ("A1", , "B2", "C3", "D", "EFG")
but I don't like the Optional-approach because it's limited to 32 parameters - so I tried an Array which resulted in some opposite of Parse-function - I called it "Unparse$":
That makes sense that when SKIP was omitted it would work that way. It appears that omitted optional STRING parameters are passed as "" empty strings. My SKIP test was just to show that empty or blank operands would be ignored, which they are. I believe too that the checking for the count of arguments passed to the function probably isn't necessary for the same reason. I can just copy them all the the array, in the knowledge that omitted ones are like "". It works the same either way.
Your "unparse" is nice; it's kind of like a "collect" or "gather" function.
That would be good if we start with an array to begin with. In my case, I am creating command-line statements, which have various commands and keywords separated by spaces, and having values coming from a variety of sources. For me, I'd hardly ever have all these neatly stored somewhere in a string array.
You're right about the 32-argument limit, but in most practical cases, that wouldn't pose too great a hardship. If I really had to do more than 32, I could concatenate in 32-argument sections and then glue the sections together. For me, it would be rare that I would ever get even close to the 32 limit. My example only handled 16, but this was just a demo function.
Of course, if there were a built-in operator to do this, there wouldn't be *any* limits. The *real* limit is how much nagging and pleading for new features that Eros would be willing to put up with. ;-))
Robert Hodge
05-07-2013, 05:19
In a way it's too bad TB has so many different kinds of punctuation for concatenation. You can use either & or + or . to do it. All three of these do exactly the same thing. With all due respect to Eros, it's like somebody couldn't make up their mind.
Suppose we allowed the . to mean blank concatenation, as described above. Then, instead of
CMD = ONE & " " & TWO & "" & THREE
or using some function, we could just write,
CMD = ONE . TWO . THREE
So much cleaner and nicer notation.
I have no way of knowing how much TB code uses . to concatenate instead of & or + operators. Maybe you couldn't just change the dot operator. But, suppose you had a compiler directive, so that you could have a user-customizable dot operand, that represented concatenation with any user-defined string literal. Call this the #DOT directive. Then, you could say
#DOT " "
to do what I have been discussing above. Suppose you needed to create same CSV files, and you really needed a "comma concatenate" operator. You could do this:
#DOT ","
etc. The directive would default to #DOT "" for compatibility.
... just ideas to think about ...
Petr Schreiber
05-07-2013, 09:06
Hi Robert,
the "+" and "&" are both commonly used in BASICs, so they are here for better code portability.
The "." is brough to ThinBASIC from PHP.
...just to give more light on "why so many".
The concept of separation by comma to automagically insert spaces is already present in Print/PrintL commands, so maybe it would be logical to use this notation for this case. Sample:
uses "console"
String One = "One"
String Two = "Two"
String Three = "Three"
PrintL One, Two, Three
Waitkey
Petr
Robert Hodge
05-07-2013, 15:30
Hi Robert,
the "+" and "&" are both commonly used in BASICs, so they are here for better code portability.
The "." is brought to ThinBASIC from PHP.
...just to give more light on "why so many".
Petr
Any guess as to how much the . dot notation is used? I can see the + and & as really common Basic operators for strings, but . dot is really pretty uncommon. Would you have any feel for how much impact there would be if the behavior of . dot as an operator were changed? Is there really that much demand for porting PHP code to TB that you would need this? I'm not a PHP coder, so I'd no way guage this myself.
Robert Hodge
05-07-2013, 19:20
Personally, I think having three different operators + & and . all meaning the same thing is redundant. If it were me, I'd just go ahead and break things, and give . a new meaning. But, in the absence of doing that, there are other characters not being used right now. These are
accent `
tilde ~
exclamation ! (I think it's not used; could be wrong?)
semicolon ; (rather see ; used for other things, though)
vertical bar |
Of all of these, the | vertical bar is the most appealing. Here's your code above written using it:
Dim CMD As String = TrimFull$(ONE)|TrimFull$(TWO)|TrimFull$(THREE)
Not too bad.
In order to give the | operator the cleanest, simplest meaning (and the greatest chance of Eros adopting it), it would have the following meaning:
1. The | operator is a binary string operator.
2. The definition of LeftHand | RightHand is:
LeftHand & " " & RightHand
3. The composite operator |= would also be defined.
4. The definition of Variable |= expression is:
Variable = Variable | expression
which in turn is defined as:
Variable = Variable & " " & expression
5. No effort would be made to compress any extraneous blanks in any expression or term involved in the | or |= operations. The user would be responsible for performing any desired trimming operation.
Robert Hodge
05-07-2013, 20:14
I think the vertical bar means OR
Rats. The | character is not documented as an OR operator. Sure wish the doc was more accurate.
Robert Hodge
05-07-2013, 20:46
I'd really like to see something with good visual separation.
The . dot is nice, but taken.
The | vertical bar is nice, but (undocumented-ly) taken.
I am not sure if ! exclamation is taken; that's a possibility.
One thing that's not used that's possible is the ` accent:
DIM X AS STRING = ONE ` TWO ` THREE
Not too bad.
The other possibility is to use the PL/1 notation of || :
DIM X AS STRING = ONE || TWO || THREE
At least a few people actually *know* what || means ...
Robert Hodge
05-07-2013, 21:47
I bet you have a herd of those mexicans grazing in your back yard so you don't have to get off your lazy/racist ass to cut the lawn.
Keep it clean, both of you. Behave yourselves.
Robert Hodge
06-07-2013, 14:30
As often happens with me, I came upon the answer sometime between 5 AM and breakfast this morning.
The | vertical bar is the right thing to use for blank-delimited concatenation. Now, I was thrown off when it was pointed out that | was an OR operator (actually, I think it's ORb, but whatever ...).
That doesn't have to be a show-stopper. It's no different than + being used for addition and regular concatenation. We would make | a polymorphic operator.
Example:
DIM N, N3, N5 AS LONG
DIM S, S1, S2 AS STRING
N3 = 3
N5 = 5
S1 = "ONE"
S2 = "TWO"
N = N3 + N5 ' N = 8
S = S1 + S2 ' S = "ONETWO"
N = N3 | N5 ' N = 7
S = S1 | S2 ' S = S1 & " " & S2 = "ONE TWO"
The solution lies in making the | operator polymorphic, just like + is, varying its behavior based on the types of the operands.
Nobody gets confused or loses sleep over the fact that S1 + S2 is a concatenation, not an addition.
Likewise, no one will lose sleep because S1 | S1 is concatenation with a blank in between rather than an OR operation.
This will work.
Robert Hodge
08-07-2013, 16:21
I would ask that the off-topic issues please stop. It is uncomfortable and embarassing to read them. I started this with a serious idea - blank-delimited concatenation. Let's please keep this forum dignified, OK?
---
Referring to my prior post, I still believe that | and |= are good candidates for these operations, and I hope Eros considers them.
However, Rene brings up a good point about long strings. Standard Basic only has some clunky methods for defining long strings, like:
X = "Part one, " & _
"part two, " & _
"part three"
thinBasic has additional methods to handle long strings, but (with all due respect to Eros) I find those methods to be awkward and unwieldy.
As a long-time C coder, I am spoiled by how easy C makes this:
X = "Part one, "
"part two, "
"part three"
The C rule is that two quoted strings found adjacent to each other are concatenated, even if on separate lines, and even if separated by multiple blank lines; no muss, no fuss, no operators.
It might be a little harder, but by just a minor extension, implicit concatenation of literals could be made as implicit operators.
For example, if literals were concatenated just by their mere appearance, then this should be possible too:
DIM S1 AS STRING = "ONE"
DIM S2 AS STRING = "TWO"
DIM X AS STRING
X = S1 & S2 ' standard Basic
X = S1 S1 ' Rexx-like concatenation ---- illlegal, and not being proposed here
X = "ONE" "TWO" ' implied concatenation of literals
X = S1 " and " S2 ' implied concatenation of variables with a literal in between
X = S1 "" S2 ' use null string like an operator
Here, an empty (null) string is "concatenated" to the string expressions on its left and right. Since the null string has no data, no actual concatenation of the null is performed (at least, none is required). Instead, the mere presence of the null string - in effect - becomes an implicit concatenation operator.
Were such a facility available, this would "kill two birds with one stone" so to speak. It would provide a briefer way to put blanks between pieces of an expression, and also gives us a clean, simple way to define long strings. Further, it should be relatively straight-forward to say that if a line ended with a quoted literal and the next line began with a quoted literal, then not only are the "strings" concatenated, but so are the lines. That is,
X = "Part one, "
"part two"
should be sufficient to determine that this is one single Basic statement, and an explicit line continuation of
X = "Part one, " _
"part two"
much less,
X = "Part one, " & _
"part two"
is redundant.
Robert Hodge
08-07-2013, 19:54
I guess you don't watch TV/news if my comments shock you. I think I'm done here and will start a new thread if Rene continues being a jerk.
I have no wish to perpetuate further off-topic debates, so I will try to keep this reply as brief and polite as possible, under the circumstances. As it so happens, I DON'T watch TV news. That of it which is not depressing and demoralizing merely trivializes and skews the facts to garner ratings. You are incorrect, though, and presumptuous, to suggest your words "shock" me. I am, however, profoundly disaapointed by them. I have a background in computer science and compiler design. I am trying to have a serious (and hopefully, thoughtful) discussion about polymorphic string operators, and what I get in reply are a bunch of pointless, sarcastic, totally inrrelevant insults, vulgarity and political bickering between the two of you. What does ANY of that have to do with the price of potatoes, as they say? This is not what I signed up for when I chose to involve myself with the thinBasic project.
The word "shock" is incorrect, since it implies a measure of surprise. Unfortunately, disrespect and vulgarity (from Latin vulgaris: common, ordinary) are far, far too UN-surprising in this world. I was expecting and hoping for better from all of you, and for that, I find it, not shocking, but discouraging and extremely disappointing.
Is that the best you can do?
Petr Schreiber
08-07-2013, 21:06
Moderation intervention: Renne, John - I have soft deleted your emotional exchange. I don't say it is not natural from time to time, but it has really started to drag this thread down. I hope it will give Robert more space to breath.
My intention was not to offend John or Renne with this, but I started to feel I need to cut it or it will go for another 2 pages.
You both guys are smart guys, I have a respect for both of you, but this really started to get too personal.
Make code, not war ;)
Thanks for understanding.
Petr
Charles Pegge
08-07-2013, 22:15
Hi Robert,
I like the idea of string literal fusion that ignores blank lines and comments.
As you may be aware, thinBasic also supports multiline strings, even supporting inner quotes, (providing the outer closing quote is on its own line). This is very useful for data lists and scripts.
Omitting concatenation '+' operators is also good for reducing noisy syntax.
Less certain about having a 'blank' operator though. I find defining local strings like SP,TAB,CM,CR easier to understand. They work very well with implicit concatenation:
dim string tab=chr(9),sp=chr(32),cm=chr(44),cr=chr(13,10)
print a tab b tab c cm d cr
Robert Hodge
09-07-2013, 03:02
Less certain about having a 'blank' operator though. I find defining local strings like SP,TAB,CM,CR easier to understand. They work very well with implicit concatenation:
dim string tab=chr(9),sp=chr(32),cm=chr(44),cr=chr(13,10)
print a tab b tab c cm d cr
Charles, there were a couple of reasons for the "blank" operator (I think you mean blank concatenation operator, not just 'blank' - trying to be precise here).
First, it's very, very common to need to piece together a bunch of operand strings for a command line. The most common of all delimiters for this are blank and comma. I'd say 99.9% of the time when strings aren't just joined together, there's a comma or a blank. There is definitely a use-case for having a comma-concatenate feature. It would take some doing to figure out the syntax, but I think the accent-quote would make an excellent candidate - it kind of "looks" like a comma, and it's not used for anything I know of in Basic. Of course, then someone else will say they want some concise way to add parens or something - there are limits to how far you can take this.
Second, there is a problem doing what you suggest in terms of your example, "print a tab b tab c ...". The problem has to do with the functioning of the parser. You have a whole bunch of user-defined symbols - a, b, c, d and also tab, cm, cr, etc. When the parser sees this, it has to generate implicit operators, determine precedence and associativity, etc. It turns out that since the variables a, b, c, and d, and the "named literals" tab, cm and cr, are essentially in the same "token classes", they are hard to distinguish. What happens is that if you try to define a grammar that allows such constructs, AND accepts all the other existing grammar before this change was added in, it will (most likely) create grammar conflicts and break the parser. It's not an unsolvable problem, but it's hard to solve. Eros uses what is called a recursive-descent parser, whereas I am more familiar with the LALR parsers in YACC. In YACC, this kind of syntax will create what are called shift/reduce or reduce/reduce conflicts. Regardless of the parsing technology, a great parser can't always fix a "bad" grammar.
As an example, suppose you had a grammar that accepted either:
KEYWORD expression
or
KEYWORD expression expression
And, expressions happened to include operators like unary and binary - operations. Now, suppose you had this input, where A and B were variables:
KEYWORD A - B
Is A-B a single expression, consisting of the value of A minus the value of B? Or, is A - B actually TWO expressions, where the first one is A and the second one is -B ? It's possible to interpret the input two different ways, and the grammar doesn't help us decide which is the "right" one. The result is that the grammar is broken and can't really be parsed. You could *assume* one rule or the other, but then you would not be using all of your grammar rules, and in fact you would have discarded part of your grammar as unparsable. Moral of the story: The two-expression grammar needs a token between them; otherwise we can't parse it:
KEYWORD expression , expression
The same issue holds true here. It's great fun to dream up ideas for new language features - hey, I do it all the time. But unless we adhere to sound grammar principles, they are ideas that Eros could not make a reality.
The way languages like C get around this is that they compile in two 'stages', a lexical scanner that grabs tokens, and a parser that processes them. In the lexical scanner, it does some fancy footwork by buffering the string tokens, and before it releases one it does a one-token lookahead and sees there is another string. When that happens, the two tokens are glued together as one, and the parser only sees one long string. That way, the language grammar doesn't even know it happened, and everyone is happy.
One of the biggest tricks when maintaining a mature language like thinBasic is trying to add things so that they don't break the parser. The longer a language has been around, the harder it becomes to change it. Eventually, it reaches a point where all you can do is add new function names, or split it off as a new language and 'start all over again'. Hopefully, TB hasn't reached its 'brick wall' just yet.
Charles Pegge
09-07-2013, 22:36
Robert,
Basic parsing, in my view, is more complex than C parsing, but if you have your own parsing routines, then just about anything is possible. Breakages usually occur from name conflicts, where a new version introduces names that have already been used by prior app-code.
In OxygenBasic, the absence of an operator is usually taken to imply concatenation or addition, depending on operand type. It often, but not always, helps to produce cleaner code.
Robert Hodge
11-07-2013, 02:57
A parser can get "broken" when the grammar itself is broken. The "grammar" is simply a set of rules, which are independent of any implementation. There is a parsing book that talks about conflicts, and it points out that grammars that are hard for compilers to manage also turn out to be grammars that are hard for people to understand. So, it isn't merely a technology issue; the problem goes to the heart of the language rules themselves, and represents a fundamental design blunder that software can't fix. You might say, all the software in the world can't make a crazy thing un-crazy.
Having adjacent symbols mean concatenation is what Rexx does; if you say A B C, then A, B and C are concatenated with one blank in between each of them, while A(B)C or A'B'C will put them together without spaces.
Because Rexx is an interpreter, it has a lot in common with Basic, and it turns out that this particular features is kind of tricky to do; I have seen the source code for Open Object Rexx where they do this, and it takes a lot of code to pull it off. But, they do it.
You're right about the 'cleaner code' aspect. The idea behind the implied blanks is that when you are building up commands, it's just so much cleaner not to have to specify every single blank separator in excruciating detail. Like, instead of:
CMD = KEYWORD & " " & FILENAME1 & " " & FILENAME1 & " " OPTION
you could do this:
CMD = KEYWORD | FILENAME1 | FILENAME2 | OPTION
or, in the "Rexx-like" way, even shorter:
CMD = KEYWORD FILENAME1 FILENAME2 OPTION
There is price for this, besides complicating the parser, and that is when you "imply" something by the absence of a token, there is nothing to "look for" as you're working on code. If I use "" or "|" as an operator, I can use an editor to search for it. What do I search for when the "operator" is implied and there's nothing there in the code? This can be a potential maintenance issue.
There's another problem, too, and that is that if you imply an operator by just having two expressions adjacent to each other, it can become very easy to inadvertantly drop some OTHER operator by mistake. Suppose I meant to say this:
X = VARIABLE1 + VARIABLE2
and instead I forgot, missed the + sign key, or something, and did this:
X = VARIABLE1 VARIABLE2
If we make that into legal syntax, I no longer get a compiler warning message about a missing operator or illegal syntax, and I may end up with a lurking bug. That is a danger you run into if the grammar is too flexible and too tolerant. It's stops catching things you really need it to catch.
Charles Pegge
11-07-2013, 08:13
Language grammars are riddled with implied symbols and operations. So Omitting the '+' operator is not especially radical.
Other Examples:
Omission of brackets around procedure parameters
Operator precedence
Omission of 'this.' for properties inside a method
But I agree that this stuff can cause problems. While it makes programs easier to write, and easier to get the gist of a program, it is more difficult to prove whether a program does precisely what is intended. If your software is to be used in a critical application like controlling the engines of an AirBus, then the code has to be highly constrained to eliminate ambiguities.
Robert Hodge
11-07-2013, 15:00
it is more difficult to prove whether a program does precisely what is intended. If your software is to be used in a critical application like controlling the engines of an AirBus, then the code has to be highly constrained to eliminate ambiguities.
If you ever take a course in what is called Theory of Computation, it turns out in most cases, it is impossible, or nearly impossible, to prove in all cases (a) that a program will terminate without being 'hung up in a loop', (b) that a program will work "correctly", or (c) or that two different programs will produce the same output given the same input. Part (c) is especially troublesome, because a "source" program and its translated form into an "object" program, or a run-time representation in an interpreter, are two such "different" programs. That means, in 'mathematical' terms at least, you can't prove that a program has been compiled correctly. A corollary of this is that in general, there is no such thing as an "optimized" or "optimal" program. A program can be "improved" but you can't prove that any particular "improvement" is the "best" one, which is what an optimal/optimized one would be.
Because you CAN'T prove, you do the only thing you CAN do, which is to run (in commmercial products, anyway) thousands of regression tests and sample programs in a test suite to make sure that good programs still work and bad ones are caught. It's a lot of work and a pain in the neck, but it's the only way it can be done, and the quality of your software then becomes totally dependent on your test/regression suite - better hope it's a good one.