1 files changed, 105 insertions, 0 deletions
diff --git a/doc/mmutf8fix.html b/doc/mmutf8fix.html
new file mode 100644
index 00000000..6275c17e
--- /dev/null
+++ b/doc/mmutf8fix.html
@@ -0,0 +1,105 @@
+<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
+<html><head>
+<meta http-equiv="Content-Language" content="en">
+<title>Fix invalid UTF-8 Sequences (mmutf8fix)</title></head>
+
+<body>
+<a href="rsyslog_conf_modules.html">back</a>
+
+<h1>Fix invalid UTF-8 Sequences (mmutf8fix)</h1>
+<p><b>Module Name:&nbsp;&nbsp;&nbsp; mmutf8fix</b></p>
+<p><b>Author: </b>Rainer Gerhards &lt;rgerhards@adiscon.com&gt;</p>
+<p><b>Available since</b>: 7.5.4</p>
+<p><b>Description</b>:</p>
+<p>The mmutf8fix module permits to fix invalid UTF-8 sequences.
+Most often, such invalid sequences result from syslog sources sending
+in non-UTF character sets, e.g. ISO 8859. As syslog does not have a way
+to convey the character set information, these sequences are not properly
+handled. While they are typically uncritical with plain text files, they can
+cause big headache with database sources as well as systems like ElasticSearch.
+<p>The module supports different "fixing" modes and fixes. The current
+implementation will always replace invalid bytes with a single US ASCII
+character. Additional replacement modes will probably be added in the future,
+depending on user demand.  In the longer term
+it could also be evolved into an any-charset-to-UTF8 converter. But
+first let's see if it really gets into widespread enough use.
+
+<p><b>Proper Usage</b>:</p>
+<p>Some notes are due for proper use of this module. This is a message modification
+module utilizing the action interface, which means you call it like an action.
+This gives great flexibility on the question on when and how to call this module.
+Note that once it has been called, it actually modifies the message. The original
+messsage is then no longer available. However, this does <b>not</b> change any
+properties set, used or extracted before the modification is done.
+<p>One potential use case is to normalize all messages. This is done by simply calling
+mmutf8fix right in front of all other actions.
+<p>If only a specific source (or set of sources) is known to cause problems,
+mmutf8fix can be conditionally called only on messages from them. This also offers
+performance benefits. If such multiple sources exists, it probably is a good idea
+to define different listeners for their incoming traffic, bind them to specific
+<a href="multi_ruleset.html">ruleset</a> and call mmutf8fix as first action in this
+ruleset.
+
+<p><b>Module Configuration Parameters</b>:</p>
+<p>Currently none.
+<p>&nbsp;</p>
+<p><b>Action Confguration Parameters</b>:</p>
+<ul>
+<li><b>mode</b> - <b>utf-8</b>/controlcharacters<br>
+This sets the basic detection mode.
+<br>In <b>utf-8</b> mode (the default), proper
+UTF-8 encoding is checked and bytes which are not proper UTF-8 sequences
+are acted on. If a proper multi-byte start sequence byte is detected but
+any of the following bytes is invalid, the whole sequence is replaced by
+the replacement method. This mode is most useful with non-US-ASCII character
+sets, which validly includes multibyte sequences. Note that in this mode
+control characters are NOT being replaced, because they are valid UTF-8.
+<br>In <b>controlcharacters</b> mode, all bytes which do not represent a
+printable US-ASCII character (codes 32 to 126) are replaced. Note that this
+also mangles valid UTF-8 multi-byte sequences, as these are (deliberately) outside
+of that character range. This mode is most useful if it is known that no
+characters outside of the US-ASCII alphabet need to be processed.
+<li><b>replacementChar</b> - default " " (space), a single character<br>
+This is the character that invalid sequences are replaced by. Currently, it
+MUST be a <b>printable</b> US-ASCII character.
+</ul>
+
+<p><b>Caveats/Known Bugs:</b>
+<ul>
+<li>overlong UTF-8 encodings are currently not detected in utf-8 mode.
+</ul>
+
+<p><b>Samples:</b></p>
+<p>In this snippet, we write one file without fixing UTF-8 and another one
+with the message fixed. Note that once mmutf8fix has run, access to the 
+original message is no longer possible.
+<p><textarea rows="5" cols="60">module(load="mmutf8fix")
+action(type="omfile" file="/path/to/non-fixed.log")
+action(type="mmutf8fix")
+action(type="omfile" file="/path/to/fixed.log")
+</textarea>
+
+<p>In this sample, we fix only message originating from host 10.0.0.1.
+<p><textarea rows="5" cols="60">module(load="mmutf8fix")
+if $fromhost-ip == "10.0.0.1" then
+    action(type="mmutf8fix")
+# all other actions here...
+</textarea>
+
+<p>This is mostly the same as the previous sample, but uses "controlcharacters"
+processing mode.
+<p><textarea rows="5" cols="60">module(load="mmutf8fix")
+if $fromhost-ip == "10.0.0.1" then
+    action(type="mmutf8fix" mode="controlcharacters")
+# all other actions here...
+</textarea>
+
+<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual 
+index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p>
+<p><font size="2">This documentation is part of the
+<a href="http://www.rsyslog.com/">rsyslog</a> project.<br>
+Copyright &copy; 2013 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and
+<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL 
+version 3 or higher.</font></p>
+
+</body></html>