Robelle Technical Database Reference KB15669

Product: Qedit Qedit for Windows
Subject:Character class interpretation
Status:Signed-Off
Date Created:2000.02.03  13:01
Date Modified:2000.02.15  07:44
Short Description:   Using ranges within character class can be confusing if not incorrect

Originator: FRANCOIS DESROCHERS
Within a character class, you can define character ranges using the hyphen character. For example, "[0-9]" means all digits between 0 and 9, "[a-z]" all lowercase letters. The range is actually based on the ASCII collating sequence. So, "[0-Q]" is valid and would search for numeric digits, some punctuation characters and uppercase letters up to "Q". The regexp engine in our products has a strange way of interpreting ranges. Problem #1: You don't have to specify the smallest character first and the largest last as in "[a-z]". You can enter "[z-a]" and the engine will gladly take it. Internally, it reverses the order. When it reverses the order, the first character is excluded from the list. For example: qux/l"[m-p]" 13 mmmmmmmmmmmmm 14 nnnnnnnnnnnnnn 15 oooooooooooooooo 16 pppppppppppppppp qux/l"[p-m]" 14 nnnnnnnnnnnnnn 15 ooooooooooooooo 16 pppppppppppppppp Notice, the reversed range only select "n" through "p". Egrep and perl do not accept that syntax i.e. "[p-m]" is invalid. Problem #2: If enter another hyphen without another start character, the regexp engine sort of extends the previous range. For example, if you enter: qux/l"[m-p-s]" 13 mmmmmmmmmmmmm 14 nnnnnnnnnnnnnn 15 ooooooooooooooo 16 pppppppppppppppp 17 qqqqqqqqqqqqqqqqq 18 rrrrrrrrrrrrrrrrrr 19 sssssssssssssssssss This regexp is interpreted as "[m-pp-s]" which means all characters between "m" and "p" and all characters between "p" and "s". That's very confusing. Although egrep does the same thing, I think this is incorrect (not desirable). perl correctly (in my opinion) interprets this as all characters between "m" and "p", an hyphen or the letter "s". Problem #3: If you forget to specify the range end character as in "[A-]", the regexp engine simply uses the character class close character as the end character. Thus, the above regexp is interpreted as all characters between "A" and "]" which corresponds to all uppercase letters, left square bracket, backslash and right square bracket. Egrep and perl are interpreting this correctly i.e. "A-" is not a range. Problem #5: If you forget to close the character class, the engine seems to take whatever character follows the "-" in memory. Currently, it appears to always be a null. Thus, "[$-" finds all characters between and "$". With all this, you have to be really careful how you code your regexp.
Append: (B:B) FRANCOIS DESROCOther cases 09 Feb00 8:52 AM Here are a few other cases where a character class is not interpreted correctly. 1) Escaped characters e.g. "\n", "\t", "\007", "\x007" can only be used as the first character of a range. For example, "[\007-\027]" is actually interpreted as: the bell character "\007" characters between 7 and backslash "7-\" (see item #2) character 2 "2" character 7 "7" 2) the last character of an escaped sequence is also used as the start of a new range. For example, "[\t-0]" is interpreted as: the tab character "\t" characters between 0 and t "t-0" It might be arguable that escaped sequences should be allowed as range values. I think they should be allowed. That doesn't change the fact that character classes such as these are returning very unexpected results.
Append: (B:S) FRANCOIS DESROCFixed in 4.8.12 15 Feb00 7:44 AM These problems have been fixed. 1) Reverse character range e.g. "[z-a]" This is still allowed. Qedit takes care of switching the start and end characters. It now extracts [a-z] correctly. 2) Extended range e.g. "[m-p-s]" This is valid syntax. However, the new version now interprets this as a character range [m-p], an hyphen or lowercase s. 3) Right bracket as end character range e.g. "[a-]" This is now correctly interpreted as a 2-character class: lowercase a or an hyphen. 4) Missing right bracket e.g "[a-" If the right bracket is missing, Qedit assumes the character class ends with the regexp itself. The example is equivalent to "[a-]". Of course, if the right bracket is missing, the rest of the regexp is considered as part of the class. 5) Escaped characters in character class e.g. "[\7-\27]" Escaped characters such as "\7" (bell), "\t" (tab) can now be used as start and end characters for a range. 6) Octal values e.g. "\7" The first digit of all octal values was skipped causing the calculated character value to be unpredictable. Octal values can really have 1, 2 or 3 digits now.


Comment on this KB entry

Your Comments:

Your Name :
Your E-mail:

Your comments will be sent to Robelle technical support. Any updates will be reflected on the website the following day.


Do another KB search

keywords: