summaryrefslogtreecommitdiffstats
path: root/doc
diff options
context:
space:
mode:
Diffstat (limited to 'doc')
-rw-r--r--doc/mmutf8fix.html29
1 files changed, 19 insertions, 10 deletions
diff --git a/doc/mmutf8fix.html b/doc/mmutf8fix.html
index c75e71bc..1a98f660 100644
--- a/doc/mmutf8fix.html
+++ b/doc/mmutf8fix.html
@@ -17,17 +17,12 @@ in non-UTF character sets, e.g. ISO 8859. As syslog does not have a way
to convey the character set information, these sequences are not properly
handled. While they are typically uncritical with plain text files, they can
cause big headache with database sources as well as systems like ElasticSearch.
-<p>The module is an experiement at "fixing" such encoding problems. It
-begun as a very simple replacer of non-control characters, and actually breaks
-some UTF-8 encoding right now. If the module turns out to be useful, it
-should be enhanced to support modes that really detect invalid UTF8. In the longer term
+<p>The module supports different "fixing" modes and fixes. The current
+implementation will always replace invalid bytes with a single US ASCII
+character. Additional replacement modes will probably be added in the future,
+depending on user demand. In the longer term
it could also be evolved into an any-charset-to-UTF8 converter. But
first let's see if it really gets into widespread enough use.
-<p>What it currently does is simply replace all US-ASCII control characters
-(characters ouside the range of 32 to 126) by a configured replacement
-character. For forward compatibility, this will remain the default mode
-in the future. However, as said above, more useful modes will be added
-based on user feedback and demand.
<p><b>Proper Usage</b>:</p>
<p>Some notes are due for proper use of this module. This is a message modification
@@ -50,8 +45,22 @@ ruleset.
<p>&nbsp;</p>
<p><b>Action Confguration Parameters</b>:</p>
<ul>
+<li><b>mode</b> - <b>utf8</b>/controlcharacters<br>
+This sets the basic detection mode.
+<br>In <b>utf8</b> mode (the default), proper
+UTF-8 encoding is checked and bytes which are not proper UTF-8 sequences
+are acted on. If a proper multi-byte start sequence byte is detected but
+any of the following bytes is invalid, the whole sequence is replaced by
+the replacement method. This mode is most useful with non-US-ASCII character
+sets, which validly includes multibyte sequences. Note that in this mode
+control characters are NOT being replaced, because they are valid UTF-8.
+<br>In <b>controlcharacters</b> mode, all bytes which do not represent a
+printable US-ASCII character (codes 32 to 126) are replaced. Note that this
+also mangles valid UTF-8 multi-byte sequences, as these are (deliberately) outside
+of that character range.
<li><b>replacementChar</b> - default " " (space), a single character<br>
-This is the character that invalid sequences are replaced by.
+This is the character that invalid sequences are replaced by. Currently, it
+MUST be a <b>printable</b> US-ASCII character.
</ul>
<p><b>Caveats/Known Bugs:</b>