diff options
Diffstat (limited to 'doc/mmutf8fix.html')
-rw-r--r-- | doc/mmutf8fix.html | 105 |
1 files changed, 105 insertions, 0 deletions
diff --git a/doc/mmutf8fix.html b/doc/mmutf8fix.html new file mode 100644 index 00000000..6275c17e --- /dev/null +++ b/doc/mmutf8fix.html @@ -0,0 +1,105 @@ +<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"> +<html><head> +<meta http-equiv="Content-Language" content="en"> +<title>Fix invalid UTF-8 Sequences (mmutf8fix)</title></head> + +<body> +<a href="rsyslog_conf_modules.html">back</a> + +<h1>Fix invalid UTF-8 Sequences (mmutf8fix)</h1> +<p><b>Module Name: mmutf8fix</b></p> +<p><b>Author: </b>Rainer Gerhards <rgerhards@adiscon.com></p> +<p><b>Available since</b>: 7.5.4</p> +<p><b>Description</b>:</p> +<p>The mmutf8fix module permits to fix invalid UTF-8 sequences. +Most often, such invalid sequences result from syslog sources sending +in non-UTF character sets, e.g. ISO 8859. As syslog does not have a way +to convey the character set information, these sequences are not properly +handled. While they are typically uncritical with plain text files, they can +cause big headache with database sources as well as systems like ElasticSearch. +<p>The module supports different "fixing" modes and fixes. The current +implementation will always replace invalid bytes with a single US ASCII +character. Additional replacement modes will probably be added in the future, +depending on user demand. In the longer term +it could also be evolved into an any-charset-to-UTF8 converter. But +first let's see if it really gets into widespread enough use. + +<p><b>Proper Usage</b>:</p> +<p>Some notes are due for proper use of this module. This is a message modification +module utilizing the action interface, which means you call it like an action. +This gives great flexibility on the question on when and how to call this module. +Note that once it has been called, it actually modifies the message. The original +messsage is then no longer available. However, this does <b>not</b> change any +properties set, used or extracted before the modification is done. +<p>One potential use case is to normalize all messages. This is done by simply calling +mmutf8fix right in front of all other actions. +<p>If only a specific source (or set of sources) is known to cause problems, +mmutf8fix can be conditionally called only on messages from them. This also offers +performance benefits. If such multiple sources exists, it probably is a good idea +to define different listeners for their incoming traffic, bind them to specific +<a href="multi_ruleset.html">ruleset</a> and call mmutf8fix as first action in this +ruleset. + +<p><b>Module Configuration Parameters</b>:</p> +<p>Currently none. +<p> </p> +<p><b>Action Confguration Parameters</b>:</p> +<ul> +<li><b>mode</b> - <b>utf-8</b>/controlcharacters<br> +This sets the basic detection mode. +<br>In <b>utf-8</b> mode (the default), proper +UTF-8 encoding is checked and bytes which are not proper UTF-8 sequences +are acted on. If a proper multi-byte start sequence byte is detected but +any of the following bytes is invalid, the whole sequence is replaced by +the replacement method. This mode is most useful with non-US-ASCII character +sets, which validly includes multibyte sequences. Note that in this mode +control characters are NOT being replaced, because they are valid UTF-8. +<br>In <b>controlcharacters</b> mode, all bytes which do not represent a +printable US-ASCII character (codes 32 to 126) are replaced. Note that this +also mangles valid UTF-8 multi-byte sequences, as these are (deliberately) outside +of that character range. This mode is most useful if it is known that no +characters outside of the US-ASCII alphabet need to be processed. +<li><b>replacementChar</b> - default " " (space), a single character<br> +This is the character that invalid sequences are replaced by. Currently, it +MUST be a <b>printable</b> US-ASCII character. +</ul> + +<p><b>Caveats/Known Bugs:</b> +<ul> +<li>overlong UTF-8 encodings are currently not detected in utf-8 mode. +</ul> + +<p><b>Samples:</b></p> +<p>In this snippet, we write one file without fixing UTF-8 and another one +with the message fixed. Note that once mmutf8fix has run, access to the +original message is no longer possible. +<p><textarea rows="5" cols="60">module(load="mmutf8fix") +action(type="omfile" file="/path/to/non-fixed.log") +action(type="mmutf8fix") +action(type="omfile" file="/path/to/fixed.log") +</textarea> + +<p>In this sample, we fix only message originating from host 10.0.0.1. +<p><textarea rows="5" cols="60">module(load="mmutf8fix") +if $fromhost-ip == "10.0.0.1" then + action(type="mmutf8fix") +# all other actions here... +</textarea> + +<p>This is mostly the same as the previous sample, but uses "controlcharacters" +processing mode. +<p><textarea rows="5" cols="60">module(load="mmutf8fix") +if $fromhost-ip == "10.0.0.1" then + action(type="mmutf8fix" mode="controlcharacters") +# all other actions here... +</textarea> + +<p>[<a href="rsyslog_conf.html">rsyslog.conf overview</a>] [<a href="manual.html">manual +index</a>] [<a href="http://www.rsyslog.com/">rsyslog site</a>]</p> +<p><font size="2">This documentation is part of the +<a href="http://www.rsyslog.com/">rsyslog</a> project.<br> +Copyright © 2013 by <a href="http://www.gerhards.net/rainer">Rainer Gerhards</a> and +<a href="http://www.adiscon.com/">Adiscon</a>. Released under the GNU GPL +version 3 or higher.</font></p> + +</body></html> |