Why RE are singleline?

Discussion:

Dmitry Ivankov

2007-03-01 20:17:11 UTC

Why can't I write expression like ";[ \t\r\n]*;"?

Syntax I want to parse
is sequence of quoted strings, separated by spaces, tabs and newlines,
prefixed by a special char.
For example $"abc" "def" "ghi"; or @"123" "456" "789";
And for different prefixes I want different schemes for strings.

Without newlines it's simple
block:
start: $\"
end: \"
scheme = S1
S1:
regexp: \"[ \t]*\" //just eat that sequence
//and of course other rules, concerning content of strings

Block can't be
used instead of regexp, because it's start will be matched with any quote.

Igor Russkih

2007-03-01 21:49:51 UTC

Permalink

Not sure why you need this in this particular case.

HRC model is 'positive' parsing (in contrast with BNF and other forms
of grammar description). you don't have to eat all the characters -
just the things you need. In case your syntax doesn't allow anything,
except 'spaces' between your strings - you can 'tell' the user this:
just use
<regexp match="/\S/" region="def:Error"/>
at the end of your top-level scheme,

And user will see all the bad characters "between" your strings.

Believe in your case you don't need [ \t]* to be tokenized. You can
just leave your 'S1' scheme with rules, concerning that string's
content.

As for the regexp single line limitation, it has its own roots.
Indeed, sometimes it limits expression power, however it allows HRC to
parse code rather fast - in other case many of "free" HRC
constructions would give exponential time in scope of overall file
content.

Post by Dmitry Ivankov
Why can't I write expression like ";[ \t\r\n]*;"?
Syntax I want to parse
is sequence of quoted strings, separated by spaces, tabs and newlines,
prefixed by a special char.
And for different prefixes I want different schemes for strings.
Without newlines it's simple
start: $\"
end: \"
scheme = S1
regexp: \"[ \t]*\" //just eat that sequence
//and of course other rules, concerning content of strings
Block can't be
used instead of regexp, because it's start will be matched with any quote.

--
Igor

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV

Dmitry Ivankov

2007-03-01 22:50:26 UTC

Permalink

Post by Igor Russkih
HRC model is 'positive' parsing (in contrast with BNF and other forms
of grammar description). you don't have to eat all the characters -
just the things you need.

Yes, i know.
It's positiveness is good thing, because looking at hrc i see the
parsing algorithm.
But now i see algo which is in hrc style, but can't express it in hrc ;)

In case your syntax doesn't allow anything,

Post by Igor Russkih
except 'spaces' between your strings...

Yes and no. Syntax of these sequences doesn't, but top level syntax
allows many things, for example $"ab" "c"; "def"; @"ghi" should be parsed as
S1(both strings); S0; S2

Post by Igor Russkih
Believe in your case you don't need [ \t]* to be tokenized. You can
just leave your 'S1' scheme with rules, concerning that string's
content.

In that case S1 will end at first ", and $"abc" "def" will be S1 S0, not
S1(2 strings)
By tokenizing separator i extend S1 to next string in sequence if there is
next string.
I still haven't found another way to extend S1. Because if it is some hacky
block, it's start is surely " and end is " too, but once " is matched it
can't rollback if end is not found (or there were bad symbols before end).

Post by Igor Russkih
As for the regexp single line limitation, it has its own roots.
Indeed, sometimes it limits expression power, however it allows HRC to
parse code rather fast - in other case many of "free" HRC
constructions would give exponential time in scope of overall file
content.

Speed is critical, but limited usage of "good" multiline regexp wont affect
it. (In my case those RE wont analyze any symbol more than once). I
think that just twoline (\n isn't used with in * or + operators, and appears
only once) will cover most of constructions that require multiline
RE, and won't give significant slowdown. (just like we'll have lines twice
longer than usual)
In my case twoline will be enough. (there is no need in separating sequence
by blank lines)

Or there is some other way? :)

Igor Russkih

2007-03-02 07:36:15 UTC

Permalink

Post by Igor Russkih
In case your syntax doesn't allow anything,
except 'spaces' between your strings...

Yes and no. Syntax of these sequences doesn't, but top level syntax allows
S1(both strings); S0; S2

Ok, begining to understand what you are trying to express ;-)

By tokenizing separator i extend S1 to next string in sequence if there is
next string.

And possibly this is the wrong thing - tokenizing separator. Lets see
what we can do here.

I still haven't found another way to extend S1. Because if it is some hacky
block, it's start is surely " and end is " too, but once " is matched it
can't rollback if end is not found (or there were bad symbols before end).

got it.

In my case twoline will be enough. (there is no need in separating sequence
by blank lines)

I really can't understand now why 'two line' will be enough? Do you
mean Your syntax allows
-------------
@"foo"
"bar";
-------------
But doesnt allows
-------------
@"foo"

"bar";
-------------
?
I believe don't.

Or there is some other way? :)

HRC is still context-free like grammar language. To express the things
you have to find recursive constructions which doesn't need to be
rolled back.
I see that all prefixed string ends with ';'. Can it be used to match
end of your string?

scheme top
block
\@
\;
scheme at_string

scheme at_string
block
"
"
scheme string_content
re /./
def:Error
priority=low

Can the above fit your needs?

If no, I definitely need your language name and some concrete source
code samples ;)

The thing I often see with HRC is that if language is difficult of
impossible to express in HRC then this means that this language's
grammar is poorly designed and it requires heavy resources even to
compile/interpret it...

Dmitry Ivankov

2007-03-02 08:17:44 UTC

Permalink

Post by Igor Russkih
I really can't understand now why 'two line' will be enough?

Because sequence of strings is used mostly to split long string into lines.
Just like c/c++ consecutive string literals are concatenated.
So $"long line" can be written as $"long"\n"line"
Twoline is enough or better to say acceptable for most texts. $"q"\n\n"w" is
legal and is S1(2 strings), but it will be ok if it's parsed as S1\n\nS0.

Post by Igor Russkih
I see that all prefixed string ends with ';'. Can it be used to match
end of your string?

No it can end with any other separator, maybe [;,.)+-/........], but
it's a bad idea

Post by Igor Russkih
If no, I definitely need your language name and some concrete source
code samples ;)

Language is nemerle. Syntax is similar to c/c++/c#.
You can imagine syntax as c++ with additional construction (it's not true
but it describes problem):
- $"the value of x is $x\n" "and y is $y."\n "and of course z = $z"
(sequence of strings prefixed with "$" has separate grammar, i.e. $x, $y, $z
are highlighted, at least it would be cool :) )
(if it's not prefixed then it is just many strings, with escape sequences
highlighted in each)

Or more simple syntax - just c++.
But with paired quotes. And in case of "str1"\n "str2" 1st and 4rd
quotes are paired, 2d and 3d are just quotes,
highlighted differenty from strings' content.

Dmitry Ivankov

2007-03-02 11:18:15 UTC

Permalink

Post by Igor Russkih
I see that all prefixed string ends with ';'. Can it be used to match
end of your string?

I've just realized that sequence ends with..... whatever :))
So following seems to work:
in top scheme:
block:
start: \$\M\"
end: \M.
scheme: strings

scheme strings:
re: [ \t]
block:
start: \"
end: \"
scheme: inside_string

Hacks rock :)

Igor Russkih

2007-03-02 11:51:13 UTC

Permalink

Wow, thats really cool.

The only minor possible drawback here is that this block will lasts
until that 'whatever'. F.e. in case you want to highlight background
color of such strings you'll see it not until the last " quote but
alittle bit further.

Post by Dmitry Ivankov

Post by Igor Russkih
I see that all prefixed string ends with ';'. Can it be used to match
end of your string?

I've just realized that sequence ends with..... whatever :))
start: \$\M\"
end: \M.
scheme: strings
re: [ \t]
start: \"
end: \"
scheme: inside_string
Hacks rock :)