aboutsummaryrefslogtreecommitdiffstats
diff options
context:
space:
mode:
authorArnold D. Robbins <arnold@skeeve.com>2012-07-20 12:26:59 +0300
committerArnold D. Robbins <arnold@skeeve.com>2012-07-20 12:26:59 +0300
commit7bfc288d27bacb715ff63dbf71be53304917685a (patch)
treef575046eebd32bb710198e45072ec30e71255e7f
parent4fe1f4ac1aa0e4b99c9abb26794fc0d10ebb77c6 (diff)
downloadegawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.gz
egawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.bz2
egawk-7bfc288d27bacb715ff63dbf71be53304917685a.zip
Fix doc on ranges and locales.
-rw-r--r--doc/ChangeLog5
-rw-r--r--doc/gawk.info136
-rw-r--r--doc/gawk.texi26
3 files changed, 98 insertions, 69 deletions
diff --git a/doc/ChangeLog b/doc/ChangeLog
index e56c35a5..75b39158 100644
--- a/doc/ChangeLog
+++ b/doc/ChangeLog
@@ -1,3 +1,8 @@
+2012-07-20 Arnold D. Robbins <arnold@skeeve.com>
+
+ * gawk.texi (Ranges and Locales): Clarified ranges and
+ locales.
+
2012-07-13 Arnold D. Robbins <arnold@skeeve.com>
* gawk.texi (Getline Notes): Discuss side effects in
diff --git a/doc/gawk.info b/doc/gawk.info
index c2781488..c485e4cb 100644
--- a/doc/gawk.info
+++ b/doc/gawk.info
@@ -20163,7 +20163,7 @@ additional, non-alphabetic characters as well.)
as working in this fashion, and in particular, would teach that the
"correct" way to match lowercase letters was with `[a-z]', and that
`[A-Z]' was the "correct" way to match uppercase letters. And indeed,
-this was true.
+this was true.(1)
The 1993 POSIX standard introduced the idea of locales (*note
Locales::). Since many locales include other letters besides the plain
@@ -20181,13 +20181,14 @@ outside those locales, the ordering was defined to be based on
In many locales, `A' and `a' are both less than `B'. In other
words, these locales sort characters in dictionary order, and
`[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might
-be equivalent to `[aBbCcdXxYyz]', for example.
+be equivalent to `[aBbCcDdXxYyZz]', for example. (And to make things
+worse, on other systems, it might be equivalent to `[aAbBcCdDxXyYz]'.)
This point needs to be emphasized: Much literature teaches that you
should use `[a-z]' to match a lowercase character. But on systems with
non-ASCII locales, this also matched all of the uppercase characters
-except `Z'! This was a continuous cause of confusion, even well into
-the twenty-first century.
+except `A' or `Z'! This was a continuous cause of confusion, even well
+into the twenty-first century.
To demonstrate these issues, the following example uses the `sub()'
function, which does text replacement (*note String Functions::). Here,
@@ -20218,12 +20219,12 @@ like "why does `[A-Z]' match lowercase letters?!?"
nicely standards-compliant, and that the issue was in the user's
locale. During the development of version 4.0, he modified `gawk' to
always treat ranges in the original, pre-POSIX fashion, unless
-`--posix' was used (*note Options::).
+`--posix' was used (*note Options::).(2)
Fortunately, shortly before the final release of `gawk' 4.0, the
maintainer learned that the 2008 standard had changed the definition of
ranges, such that outside the `"C"' and `"POSIX"' locales, the meaning
-of range expressions was _undefined_.(1)
+of range expressions was _undefined_.(3)
By using this lovely technical term, the standard gives license to
implementors to implement ranges in whatever way they choose. The
@@ -20233,7 +20234,14 @@ in all cases, `gawk' remains POSIX compliant.
---------- Footnotes ----------
- (1) See the standard
+ (1) And Life was good.
+
+ (2) And thus was born the Campain for Rational Range Interpretation
+(or RRI). A number of GNU tools, such as `grep' and `sed', have either
+implemented this change, or will soon. Thanks to Karl Berry for
+coining the phrase "Rational Range Interpretation."
+
+ (3) See the standard
(http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05)
and its rationale
(http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05).
@@ -27817,61 +27825,63 @@ Node: BTL798181
Node: POSIX/GNU798915
Node: Common Extensions804066
Node: Ranges and Locales805173
-Ref: Ranges and Locales-Footnote-1809777
-Node: Contributors809998
-Node: Installation814260
-Node: Gawk Distribution815154
-Node: Getting815638
-Node: Extracting816464
-Node: Distribution contents818156
-Node: Unix Installation823378
-Node: Quick Installation823995
-Node: Additional Configuration Options825957
-Node: Configuration Philosophy827434
-Node: Non-Unix Installation829776
-Node: PC Installation830234
-Node: PC Binary Installation831533
-Node: PC Compiling833381
-Node: PC Testing836325
-Node: PC Using837501
-Node: Cygwin841686
-Node: MSYS842686
-Node: VMS Installation843200
-Node: VMS Compilation843803
-Ref: VMS Compilation-Footnote-1844810
-Node: VMS Installation Details844868
-Node: VMS Running846503
-Node: VMS Old Gawk848110
-Node: Bugs848584
-Node: Other Versions852436
-Node: Notes857717
-Node: Compatibility Mode858409
-Node: Additions859192
-Node: Accessing The Source860004
-Node: Adding Code861429
-Node: New Ports867396
-Node: Dynamic Extensions871509
-Node: Internals872885
-Node: Plugin License881988
-Node: Sample Library882622
-Node: Internal File Description883308
-Node: Internal File Ops887023
-Ref: Internal File Ops-Footnote-1891804
-Node: Using Internal File Ops891944
-Node: Future Extensions894321
-Node: Basic Concepts896825
-Node: Basic High Level897582
-Ref: Basic High Level-Footnote-1901617
-Node: Basic Data Typing901802
-Node: Floating Point Issues906327
-Node: String Conversion Precision907410
-Ref: String Conversion Precision-Footnote-1909110
-Node: Unexpected Results909219
-Node: POSIX Floating Point Problems911045
-Ref: POSIX Floating Point Problems-Footnote-1914750
-Node: Glossary914788
-Node: Copying939764
-Node: GNU Free Documentation License977321
-Node: Index1002458
+Ref: Ranges and Locales-Footnote-1809884
+Ref: Ranges and Locales-Footnote-2809911
+Ref: Ranges and Locales-Footnote-3810171
+Node: Contributors810392
+Node: Installation814654
+Node: Gawk Distribution815548
+Node: Getting816032
+Node: Extracting816858
+Node: Distribution contents818550
+Node: Unix Installation823772
+Node: Quick Installation824389
+Node: Additional Configuration Options826351
+Node: Configuration Philosophy827828
+Node: Non-Unix Installation830170
+Node: PC Installation830628
+Node: PC Binary Installation831927
+Node: PC Compiling833775
+Node: PC Testing836719
+Node: PC Using837895
+Node: Cygwin842080
+Node: MSYS843080
+Node: VMS Installation843594
+Node: VMS Compilation844197
+Ref: VMS Compilation-Footnote-1845204
+Node: VMS Installation Details845262
+Node: VMS Running846897
+Node: VMS Old Gawk848504
+Node: Bugs848978
+Node: Other Versions852830
+Node: Notes858111
+Node: Compatibility Mode858803
+Node: Additions859586
+Node: Accessing The Source860398
+Node: Adding Code861823
+Node: New Ports867790
+Node: Dynamic Extensions871903
+Node: Internals873279
+Node: Plugin License882382
+Node: Sample Library883016
+Node: Internal File Description883702
+Node: Internal File Ops887417
+Ref: Internal File Ops-Footnote-1892198
+Node: Using Internal File Ops892338
+Node: Future Extensions894715
+Node: Basic Concepts897219
+Node: Basic High Level897976
+Ref: Basic High Level-Footnote-1902011
+Node: Basic Data Typing902196
+Node: Floating Point Issues906721
+Node: String Conversion Precision907804
+Ref: String Conversion Precision-Footnote-1909504
+Node: Unexpected Results909613
+Node: POSIX Floating Point Problems911439
+Ref: POSIX Floating Point Problems-Footnote-1915144
+Node: Glossary915182
+Node: Copying940158
+Node: GNU Free Documentation License977715
+Node: Index1002852

End Tag Table
diff --git a/doc/gawk.texi b/doc/gawk.texi
index fb17b716..bf30d012 100644
--- a/doc/gawk.texi
+++ b/doc/gawk.texi
@@ -66,6 +66,15 @@
@set DARKCORNER (d.c.)
@set COMMONEXT (c.e.)
@end ifdocbook
+@ifxml
+@set DOCUMENT book
+@set CHAPTER chapter
+@set APPENDIX appendix
+@set SECTION section
+@set SUBSECTION subsection
+@set DARKCORNER (d.c.)
+@set COMMONEXT (c.e.)
+@end ifxml
@ifplaintext
@set DOCUMENT book
@set CHAPTER chapter
@@ -27062,7 +27071,7 @@ Almost all introductory Unix literature explained range expressions
as working in this fashion, and in particular, would teach that the
``correct'' way to match lowercase letters was with @samp{[a-z]}, and
that @samp{[A-Z]} was the ``correct'' way to match uppercase letters.
-And indeed, this was true.
+And indeed, this was true.@footnote{And Life was good.}
The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}).
Since many locales include other letters besides the plain twenty-six
@@ -27080,12 +27089,14 @@ But outside those locales, the ordering was defined to be based on
In many locales, @samp{A} and @samp{a} are both less than @samp{B}.
In other words, these locales sort characters in dictionary order,
and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]};
-instead it might be equivalent to @samp{[aBbCcdXxYyz]}, for example.
+instead it might be equivalent to @samp{[aBbCcDdXxYyZz]}, for example.
+(And to make things worse, on other systems, it might be equivalent to
+@samp{[aAbBcCdDxXyYz]}.)
This point needs to be emphasized: Much literature teaches that you should
use @samp{[a-z]} to match a lowercase character. But on systems with
non-ASCII locales, this also matched all of the uppercase characters
-except @samp{Z}! This was a continuous cause of confusion, even well
+except @samp{A} or @samp{Z}! This was a continuous cause of confusion, even well
into the twenty-first century.
To demonstrate these issues, the following example uses the @code{sub()}
@@ -27121,13 +27132,16 @@ the @command{gawk} maintainer grew weary of trying to explain that
@command{gawk} was being nicely standards-compliant, and that the issue
was in the user's locale. During the development of version 4.0,
he modified @command{gawk} to always treat ranges in the original,
-pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).
+pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And
+thus was born the Campain for Rational Range Interpretation (or RRI). A number
+of GNU tools, such as @command{grep} and @command{sed}, have either
+implemented this change, or will soon. Thanks to Karl Berry for coining the phrase
+``Rational Range Interpretation.''}
Fortunately, shortly before the final release of @command{gawk} 4.0,
the maintainer learned that the 2008 standard had changed the
definition of ranges, such that outside the @code{"C"} and @code{"POSIX"}
-locales, the meaning of range expressions was
-@emph{undefined}.@footnote{See
+locales, the meaning of range expressions was @emph{undefined}.@footnote{See
@uref{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard}
and
@uref{http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.}