diff options
author | Arnold D. Robbins <arnold@skeeve.com> | 2012-07-20 12:26:59 +0300 |
---|---|---|
committer | Arnold D. Robbins <arnold@skeeve.com> | 2012-07-20 12:26:59 +0300 |
commit | 7bfc288d27bacb715ff63dbf71be53304917685a (patch) | |
tree | f575046eebd32bb710198e45072ec30e71255e7f | |
parent | 4fe1f4ac1aa0e4b99c9abb26794fc0d10ebb77c6 (diff) | |
download | egawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.gz egawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.bz2 egawk-7bfc288d27bacb715ff63dbf71be53304917685a.zip |
Fix doc on ranges and locales.
-rw-r--r-- | doc/ChangeLog | 5 | ||||
-rw-r--r-- | doc/gawk.info | 136 | ||||
-rw-r--r-- | doc/gawk.texi | 26 |
3 files changed, 98 insertions, 69 deletions
diff --git a/doc/ChangeLog b/doc/ChangeLog index e56c35a5..75b39158 100644 --- a/doc/ChangeLog +++ b/doc/ChangeLog @@ -1,3 +1,8 @@ +2012-07-20 Arnold D. Robbins <arnold@skeeve.com> + + * gawk.texi (Ranges and Locales): Clarified ranges and + locales. + 2012-07-13 Arnold D. Robbins <arnold@skeeve.com> * gawk.texi (Getline Notes): Discuss side effects in diff --git a/doc/gawk.info b/doc/gawk.info index c2781488..c485e4cb 100644 --- a/doc/gawk.info +++ b/doc/gawk.info @@ -20163,7 +20163,7 @@ additional, non-alphabetic characters as well.) as working in this fashion, and in particular, would teach that the "correct" way to match lowercase letters was with `[a-z]', and that `[A-Z]' was the "correct" way to match uppercase letters. And indeed, -this was true. +this was true.(1) The 1993 POSIX standard introduced the idea of locales (*note Locales::). Since many locales include other letters besides the plain @@ -20181,13 +20181,14 @@ outside those locales, the ordering was defined to be based on In many locales, `A' and `a' are both less than `B'. In other words, these locales sort characters in dictionary order, and `[a-dx-z]' is typically not equivalent to `[abcdxyz]'; instead it might -be equivalent to `[aBbCcdXxYyz]', for example. +be equivalent to `[aBbCcDdXxYyZz]', for example. (And to make things +worse, on other systems, it might be equivalent to `[aAbBcCdDxXyYz]'.) This point needs to be emphasized: Much literature teaches that you should use `[a-z]' to match a lowercase character. But on systems with non-ASCII locales, this also matched all of the uppercase characters -except `Z'! This was a continuous cause of confusion, even well into -the twenty-first century. +except `A' or `Z'! This was a continuous cause of confusion, even well +into the twenty-first century. To demonstrate these issues, the following example uses the `sub()' function, which does text replacement (*note String Functions::). Here, @@ -20218,12 +20219,12 @@ like "why does `[A-Z]' match lowercase letters?!?" nicely standards-compliant, and that the issue was in the user's locale. During the development of version 4.0, he modified `gawk' to always treat ranges in the original, pre-POSIX fashion, unless -`--posix' was used (*note Options::). +`--posix' was used (*note Options::).(2) Fortunately, shortly before the final release of `gawk' 4.0, the maintainer learned that the 2008 standard had changed the definition of ranges, such that outside the `"C"' and `"POSIX"' locales, the meaning -of range expressions was _undefined_.(1) +of range expressions was _undefined_.(3) By using this lovely technical term, the standard gives license to implementors to implement ranges in whatever way they choose. The @@ -20233,7 +20234,14 @@ in all cases, `gawk' remains POSIX compliant. ---------- Footnotes ---------- - (1) See the standard + (1) And Life was good. + + (2) And thus was born the Campain for Rational Range Interpretation +(or RRI). A number of GNU tools, such as `grep' and `sed', have either +implemented this change, or will soon. Thanks to Karl Berry for +coining the phrase "Rational Range Interpretation." + + (3) See the standard (http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05) and its rationale (http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05). @@ -27817,61 +27825,63 @@ Node: BTL798181 Node: POSIX/GNU798915 Node: Common Extensions804066 Node: Ranges and Locales805173 -Ref: Ranges and Locales-Footnote-1809777 -Node: Contributors809998 -Node: Installation814260 -Node: Gawk Distribution815154 -Node: Getting815638 -Node: Extracting816464 -Node: Distribution contents818156 -Node: Unix Installation823378 -Node: Quick Installation823995 -Node: Additional Configuration Options825957 -Node: Configuration Philosophy827434 -Node: Non-Unix Installation829776 -Node: PC Installation830234 -Node: PC Binary Installation831533 -Node: PC Compiling833381 -Node: PC Testing836325 -Node: PC Using837501 -Node: Cygwin841686 -Node: MSYS842686 -Node: VMS Installation843200 -Node: VMS Compilation843803 -Ref: VMS Compilation-Footnote-1844810 -Node: VMS Installation Details844868 -Node: VMS Running846503 -Node: VMS Old Gawk848110 -Node: Bugs848584 -Node: Other Versions852436 -Node: Notes857717 -Node: Compatibility Mode858409 -Node: Additions859192 -Node: Accessing The Source860004 -Node: Adding Code861429 -Node: New Ports867396 -Node: Dynamic Extensions871509 -Node: Internals872885 -Node: Plugin License881988 -Node: Sample Library882622 -Node: Internal File Description883308 -Node: Internal File Ops887023 -Ref: Internal File Ops-Footnote-1891804 -Node: Using Internal File Ops891944 -Node: Future Extensions894321 -Node: Basic Concepts896825 -Node: Basic High Level897582 -Ref: Basic High Level-Footnote-1901617 -Node: Basic Data Typing901802 -Node: Floating Point Issues906327 -Node: String Conversion Precision907410 -Ref: String Conversion Precision-Footnote-1909110 -Node: Unexpected Results909219 -Node: POSIX Floating Point Problems911045 -Ref: POSIX Floating Point Problems-Footnote-1914750 -Node: Glossary914788 -Node: Copying939764 -Node: GNU Free Documentation License977321 -Node: Index1002458 +Ref: Ranges and Locales-Footnote-1809884 +Ref: Ranges and Locales-Footnote-2809911 +Ref: Ranges and Locales-Footnote-3810171 +Node: Contributors810392 +Node: Installation814654 +Node: Gawk Distribution815548 +Node: Getting816032 +Node: Extracting816858 +Node: Distribution contents818550 +Node: Unix Installation823772 +Node: Quick Installation824389 +Node: Additional Configuration Options826351 +Node: Configuration Philosophy827828 +Node: Non-Unix Installation830170 +Node: PC Installation830628 +Node: PC Binary Installation831927 +Node: PC Compiling833775 +Node: PC Testing836719 +Node: PC Using837895 +Node: Cygwin842080 +Node: MSYS843080 +Node: VMS Installation843594 +Node: VMS Compilation844197 +Ref: VMS Compilation-Footnote-1845204 +Node: VMS Installation Details845262 +Node: VMS Running846897 +Node: VMS Old Gawk848504 +Node: Bugs848978 +Node: Other Versions852830 +Node: Notes858111 +Node: Compatibility Mode858803 +Node: Additions859586 +Node: Accessing The Source860398 +Node: Adding Code861823 +Node: New Ports867790 +Node: Dynamic Extensions871903 +Node: Internals873279 +Node: Plugin License882382 +Node: Sample Library883016 +Node: Internal File Description883702 +Node: Internal File Ops887417 +Ref: Internal File Ops-Footnote-1892198 +Node: Using Internal File Ops892338 +Node: Future Extensions894715 +Node: Basic Concepts897219 +Node: Basic High Level897976 +Ref: Basic High Level-Footnote-1902011 +Node: Basic Data Typing902196 +Node: Floating Point Issues906721 +Node: String Conversion Precision907804 +Ref: String Conversion Precision-Footnote-1909504 +Node: Unexpected Results909613 +Node: POSIX Floating Point Problems911439 +Ref: POSIX Floating Point Problems-Footnote-1915144 +Node: Glossary915182 +Node: Copying940158 +Node: GNU Free Documentation License977715 +Node: Index1002852 End Tag Table diff --git a/doc/gawk.texi b/doc/gawk.texi index fb17b716..bf30d012 100644 --- a/doc/gawk.texi +++ b/doc/gawk.texi @@ -66,6 +66,15 @@ @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) @end ifdocbook +@ifxml +@set DOCUMENT book +@set CHAPTER chapter +@set APPENDIX appendix +@set SECTION section +@set SUBSECTION subsection +@set DARKCORNER (d.c.) +@set COMMONEXT (c.e.) +@end ifxml @ifplaintext @set DOCUMENT book @set CHAPTER chapter @@ -27062,7 +27071,7 @@ Almost all introductory Unix literature explained range expressions as working in this fashion, and in particular, would teach that the ``correct'' way to match lowercase letters was with @samp{[a-z]}, and that @samp{[A-Z]} was the ``correct'' way to match uppercase letters. -And indeed, this was true. +And indeed, this was true.@footnote{And Life was good.} The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}). Since many locales include other letters besides the plain twenty-six @@ -27080,12 +27089,14 @@ But outside those locales, the ordering was defined to be based on In many locales, @samp{A} and @samp{a} are both less than @samp{B}. In other words, these locales sort characters in dictionary order, and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; -instead it might be equivalent to @samp{[aBbCcdXxYyz]}, for example. +instead it might be equivalent to @samp{[aBbCcDdXxYyZz]}, for example. +(And to make things worse, on other systems, it might be equivalent to +@samp{[aAbBcCdDxXyYz]}.) This point needs to be emphasized: Much literature teaches that you should use @samp{[a-z]} to match a lowercase character. But on systems with non-ASCII locales, this also matched all of the uppercase characters -except @samp{Z}! This was a continuous cause of confusion, even well +except @samp{A} or @samp{Z}! This was a continuous cause of confusion, even well into the twenty-first century. To demonstrate these issues, the following example uses the @code{sub()} @@ -27121,13 +27132,16 @@ the @command{gawk} maintainer grew weary of trying to explain that @command{gawk} was being nicely standards-compliant, and that the issue was in the user's locale. During the development of version 4.0, he modified @command{gawk} to always treat ranges in the original, -pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}). +pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And +thus was born the Campain for Rational Range Interpretation (or RRI). A number +of GNU tools, such as @command{grep} and @command{sed}, have either +implemented this change, or will soon. Thanks to Karl Berry for coining the phrase +``Rational Range Interpretation.''} Fortunately, shortly before the final release of @command{gawk} 4.0, the maintainer learned that the 2008 standard had changed the definition of ranges, such that outside the @code{"C"} and @code{"POSIX"} -locales, the meaning of range expressions was -@emph{undefined}.@footnote{See +locales, the meaning of range expressions was @emph{undefined}.@footnote{See @uref{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard} and @uref{http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.} |