diff options
author | Arnold D. Robbins <arnold@skeeve.com> | 2012-07-20 12:26:59 +0300 |
---|---|---|
committer | Arnold D. Robbins <arnold@skeeve.com> | 2012-07-20 12:26:59 +0300 |
commit | 7bfc288d27bacb715ff63dbf71be53304917685a (patch) | |
tree | f575046eebd32bb710198e45072ec30e71255e7f /doc/gawk.texi | |
parent | 4fe1f4ac1aa0e4b99c9abb26794fc0d10ebb77c6 (diff) | |
download | egawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.gz egawk-7bfc288d27bacb715ff63dbf71be53304917685a.tar.bz2 egawk-7bfc288d27bacb715ff63dbf71be53304917685a.zip |
Fix doc on ranges and locales.
Diffstat (limited to 'doc/gawk.texi')
-rw-r--r-- | doc/gawk.texi | 26 |
1 files changed, 20 insertions, 6 deletions
diff --git a/doc/gawk.texi b/doc/gawk.texi index fb17b716..bf30d012 100644 --- a/doc/gawk.texi +++ b/doc/gawk.texi @@ -66,6 +66,15 @@ @set DARKCORNER (d.c.) @set COMMONEXT (c.e.) @end ifdocbook +@ifxml +@set DOCUMENT book +@set CHAPTER chapter +@set APPENDIX appendix +@set SECTION section +@set SUBSECTION subsection +@set DARKCORNER (d.c.) +@set COMMONEXT (c.e.) +@end ifxml @ifplaintext @set DOCUMENT book @set CHAPTER chapter @@ -27062,7 +27071,7 @@ Almost all introductory Unix literature explained range expressions as working in this fashion, and in particular, would teach that the ``correct'' way to match lowercase letters was with @samp{[a-z]}, and that @samp{[A-Z]} was the ``correct'' way to match uppercase letters. -And indeed, this was true. +And indeed, this was true.@footnote{And Life was good.} The 1993 POSIX standard introduced the idea of locales (@pxref{Locales}). Since many locales include other letters besides the plain twenty-six @@ -27080,12 +27089,14 @@ But outside those locales, the ordering was defined to be based on In many locales, @samp{A} and @samp{a} are both less than @samp{B}. In other words, these locales sort characters in dictionary order, and @samp{[a-dx-z]} is typically not equivalent to @samp{[abcdxyz]}; -instead it might be equivalent to @samp{[aBbCcdXxYyz]}, for example. +instead it might be equivalent to @samp{[aBbCcDdXxYyZz]}, for example. +(And to make things worse, on other systems, it might be equivalent to +@samp{[aAbBcCdDxXyYz]}.) This point needs to be emphasized: Much literature teaches that you should use @samp{[a-z]} to match a lowercase character. But on systems with non-ASCII locales, this also matched all of the uppercase characters -except @samp{Z}! This was a continuous cause of confusion, even well +except @samp{A} or @samp{Z}! This was a continuous cause of confusion, even well into the twenty-first century. To demonstrate these issues, the following example uses the @code{sub()} @@ -27121,13 +27132,16 @@ the @command{gawk} maintainer grew weary of trying to explain that @command{gawk} was being nicely standards-compliant, and that the issue was in the user's locale. During the development of version 4.0, he modified @command{gawk} to always treat ranges in the original, -pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}). +pre-POSIX fashion, unless @option{--posix} was used (@pxref{Options}).@footnote{And +thus was born the Campain for Rational Range Interpretation (or RRI). A number +of GNU tools, such as @command{grep} and @command{sed}, have either +implemented this change, or will soon. Thanks to Karl Berry for coining the phrase +``Rational Range Interpretation.''} Fortunately, shortly before the final release of @command{gawk} 4.0, the maintainer learned that the 2008 standard had changed the definition of ranges, such that outside the @code{"C"} and @code{"POSIX"} -locales, the meaning of range expressions was -@emph{undefined}.@footnote{See +locales, the meaning of range expressions was @emph{undefined}.@footnote{See @uref{http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html#tag_09_03_05, the standard} and @uref{http://pubs.opengroup.org/onlinepubs/9699919799/xrat/V4_xbd_chap09.html#tag_21_09_03_05, its rationale}.} |